Tomatoes and data

You gotta prune tomato plants if you want good tomatoes.

If you let them go wild, they’ll grow aggressively, growing tons of leaves, growing too densely, and destroying themselves by fungus and other blight from the overcrowding. As the stems grow stronger and thicker, you often get buds at the pits where the sun leaves branch off, and each bud has the potential to grow just as much as the main stem (sun leaves, fruit, and more buds!). You can let a couple of these buds/suckers go (2-3?) per plant, but each one will draw and spend its energy on trying to be a whole tomato plant on its own. Most people would prefer to have more, and larger fruit instead of more stems, branches, and leaves. In letting them grow on their own, your tomato production will be minimal and the plant may die prematurely.

It’s kind of like data. You want an abundant, prolific data source to grow your dataset, but if you don’t filter aggressively, it might be too hard to find the nuggest (fruit) you want. It’s not a big deal when the dataset (plant) is small, but it can be a real hassle with large datasets (big plants). You may find new ways directions to take the dataset, too, and like the suckers, pursuing a few of those is fine, but, too many, and it may be hard to give them enough attention to get the insights you need. If you spread yourself too thin, you might blind yourself to considering deeper ways to mine the data.

Now, if you have more land than time, you can grow lots of tomato plants. With 100x the plants, you will probably yield more good fruit without spending intense effort on each one. This is like having Google-amounts of data. You don’t have the energy to prune and filter your data at this scale, but you can often still produce more results with simple algorithms on more dirty data at high scale than sophisticated algorithms on relatively little data.

What do you think?

About Daniel

I write distributed database software. Coding is fun. Love learning languages (spoken and computer). Always looking for opportunities to use advanced math in work and daily life. github: wangd
This entry was posted in development. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *