Tomatoes and data

You gotta prune tomato plants if you want good tomatoes.

If you let them go wild, they’ll grow aggressively, growing tons of leaves, growing too densely, and destroying themselves by fungus and other blight from the overcrowding. As the stems grow stronger and thicker, you often get buds at the pits where the sun leaves branch off, and each bud has the potential to grow just as much as the main stem (sun leaves, fruit, and more buds!). You can let a couple of these buds/suckers go (2-3?) per plant, but each one will draw and spend its energy on trying to be a whole tomato plant on its own. Most people would prefer to have more, and larger fruit instead of more stems, branches, and leaves. In letting them grow on their own, your tomato production will be minimal and the plant may die prematurely.

It’s kind of like data. You want an abundant, prolific data source to grow your dataset, but if you don’t filter aggressively, it might be too hard to find the nuggest (fruit) you want. It’s not a big deal when the dataset (plant) is small, but it can be a real hassle with large datasets (big plants). You may find new ways directions to take the dataset, too, and like the suckers, pursuing a few of those is fine, but, too many, and it may be hard to give them enough attention to get the insights you need. If you spread yourself too thin, you might blind yourself to considering deeper ways to mine the data.

Now, if you have more land than time, you can grow lots of tomato plants. With 100x the plants, you will probably yield more good fruit without spending intense effort on each one. This is like having Google-amounts of data. You don’t have the energy to prune and filter your data at this scale, but you can often still produce more results with simple algorithms on more dirty data at high scale than sophisticated algorithms on relatively little data.

What do you think?

Posted in development | Leave a comment

systemd resource limiting for mortals


I’ve recently gotten completely fed-up with my backup process (using duplicity) in the way it gobbles up huge amounts of memory (6GB+ out of 8GB physical memory) when doing its daily backup. This annoyance is amplified by code in the display driver that bails out when there is intense memory pressure and exits out of my Wayland session (likely bug).  I’m documenting my solution for Fedora 26, as I haven’t found trivially-understandable articles elsewhere.


Duplicity is a quite a nice program, and I am quite thankful for its development. While I am unhappy with the way its resource usage scales with 200GB backups and limited available knowledge on best tunings for various scales greater and smaller, I have not found any superior backup solution. I just need to do something about its ability to gobble memory undeterred by the Linux kernel.

Intel video hardware is probably the best low-power graphics with “reasonable” Linux support in its open-source drivers, but unfortunately, some debilitating bugs are being triggered with  the recent Wayland migration. Most recently, its eagerness to bail out (and kill the user session) under memory pressure has been a concern.

Resource management tools have been available on Linux (and other POSIX-style OSes) for quite some time, so why can’t you just do a “nice -n “19 and be done? Some notes suggest that systemd may be of help.


TL;DR: We need to create a systemd slice and put the backup process systemd unit into that slice.

Create a slice by creating the file in /etc/systemd/system/. Here’s what mine looks like:

$ cat /etc/systemd/system/duplicity.slice 
Description=Duplicity resource-limited slice


I’ve set MemoryHigh to 2G, which is supposed to mean that systemd will start preferentially swapping pages out from the slice, should tasks inside that slice use more thatn 2GB physical memory. Other pages suggested setting  MemoryMax, but as I still wanted my backup to complete, and a re-run of the backup process might even demand more, I did not want the MemoryMax behavior of “kill processes with OOM if the max is reached”. MemoryAccounting=true is a necessary prerequisite for any memory tracking controls in that slice.

With the slice file created, then I just needed to add:


to the corresponding duplicity.service file. While I was there, I also added (Nice=19, IOSchedulingClass=3, IOSchedulingPriority=7).

To apply the changes:

$ systemctl daemon-reload

When the task is running, you can verify that it’s running in your new slice.

$ systemd-cgls # list running services by cgroup slice

$ systemd-cgtop # list running services sorted by resource usage

So far, the resource limiting has prevented the abrupt session-exits that used to consistently coincide with the backup trigger, but I will update this if I find anything new.


  • The necessity of creating a slice was only clear after reading through a particular systemd issue, which thankfully included a concise example. If this issue were more findable to people looking for a HOWTO, I may not have bothered with this post.
Posted in development | Leave a comment

Notes from AWS summit 2014

I got a chance to stop by the AWS Summit 2014 at their stop in San Francisco at the Moscone. It was a free event, and mobbed by people apparently hungry to learn more about Amazon’s Cloud platform services. I was interested in the talk: Scaling on AWS for the first 10 million users. Here are some notes:

In 2014, the amount of capacity added daily on AWS exceeds the capacity to operate Amazon’s $7billion business in 2014.

Stages of growing:
Initial: single machine (e.g., EC2 instance) running web+db, with DNS routed by AWS Route53.

Users>100: Split db off into a db service instance (e.g., using Amazon Relational Data Store (RDS)), or another vm instance dedicated to the db.

Users>1000: Replicate machine and db instance. Now we have two pairs of web+app and db. Let the dbs coordinate using replication. Balance between them using Elastic Load Balancer. Place each pair in different “availability zones” so that a data center failure in one of the zones affects only one web+app and db pair–your system is still functional in the other zone.

Users>10k: Add even more pairs. Consider more availability zones.

Beyond: Move static content to S3 and serve using cloudfront for edge CDN service. Use elastic cache and/or dynamodb to cache state and reduce traffic to db instances.

Amazon autoscaling can use metrics from cloudwatch to make decisions to add more web nodes or db instances as load changes.

Use service-oriented architecture: this facilitates making your components stateless, replaceable, and scalable. Don’t let components talk directly to each other–use indirection so that either side of the communication can fail without affecting the other.

Overall advice: split processing into pieces that are loosely-coupled and stateless as is possible.

Honestly though, I didn’t think of it so much as a deep-dive as advertised, but that may be more of a reflection of my deeply technical perspective.

The keynote talk this morning announced some steep price drops in AWS products like EC2 and S3. Most people think it was in response to Google’s price drops yesterday. Techcrunch articles: (Google) (AWS)

I also got a chance to talk with some people from the DynamoDB and CloudSearch groups at AWS, and was surprised to find out that both have implemented some tooling libraries for geospatial support. The CloudSearch support seemed really interesting. Apparently, you can push them your data, and they can do area searches (circle or cone) on latitude/longitude data, which is pretty neat. The showstopper might be the pricing (it seemed like about $1000/TB/month), but it seems worth a look at least–their latencies are really, really low, for returning batches of objects.

Anyway, the whole one-day event seemed well attended, and I found people pretty enthusiastic about their work and cloud-based work in general. Amazon really does have a dizzying array of building blocks for scalable systems. If I were building a scalable webapp, AWS really seems like a great way to get things (a) up and running, and (b) scaled up to netflix-like scale.

Posted in development | Leave a comment

How to login GNOME3 keyboard only

Not being able to login without a mouse/trackpad with the gnome3 gdm greeter has bothered me for quite some time, and usually I just complain silently and use the mouse. But today, I rejoice!

…because someone has figured it out!

Naysayers will say, of course it works, the username: field has focus, and you can just type your u/p and hit enter a couple times, and it’ll work. And if you enter things in just after it boots up, or just after you logout, that works. But if you let the screensaver kick in, you can dismiss that with Esc or space or enter, but the username: field will not have focus. And you can let your cat play with the keyboard at this point, because Gnome3 will ignore you completely.

So the solution is a magic incantation, which I scrawl here for my own benefit (and yours, if your search engine doesn’t lead you to the above link).

Use Ctrl-Alt-Tab to manually switch focus among the screen elements. I don’t know where this is documented.

Posted in development | Leave a comment

Web frameworks for web startups

I am not a web dev. I am mostly in the dark with regards to utilizing the latest web technologies. If you’re actually starting a company, you’ll want to hire a real web-dev

But occasionally, I need to throw up a webpage/website quickly. And while I can write up some basic HTML with images, lists, headings, fonts, and links from scratch, I am not a “Javascript guru” or a “Rails ninja”. So at last weekend’s startup weekend, I struggled with the web-facing portions.

I’d like to maintain a short list of frameworks one might use to throw up something really quickly for a web-facing company.

  • Bootstrap: A CSS/JS front-end framework. Originally Twitter Blueprint
  • Twilio: A service for integrating with telecom(voice, SMS, MMS). This sounds simple, and it is, but it might be useful for bringing small bits of your product off the web and onto your phone.
  • AngularJS: A library for abstracting HTML5 to ease building dynamic pages. From Google.

Any suggestions? I’d like to maintain a *short* list of frameworks and tools that help get something off the ground. These may not be the same as the ones you use when you’re “serious” and have some sort of clue as to what you’re building.

Notes from last weekend: I think most of the teams used Bootstrap. One team used AngularJS, and it looked impressive. Our team used Groovy and Grails on top of MySQL and we had a working product.

Note to self: you don’t want to deal with SElinux problems on a startup weekend. That cost me a couple hours. *sigh*

Other frameworks I haven’t evaluated: YUI (Yahoo UI). Other tools: Highcharts (javascript charts).

Posted in development | Tagged | Leave a comment

Ignore whitespace in meld

This is probably *not news*, but I discovered that there is a way to ignore whitespace in meld. It’s under Edit->Preferences->Text Filters. There’s a setting (in hindsight, obviously) for ignoring changes that insert and delete lines and ones for ignoring whitespace in various forms (and other textual flimflam). Very useful when reviewing commits/branches in git where your colleague has taken the liberty (often rightly so) of cleaning up whitespace, since s/he’s “there already”.

This is more of a note to myself– I found the info via “EdmondsCommerce“.

Posted in development | 1 Comment

People who use deques need drug rehab

Can you picture the conversation:
Alice: Hey, I wonder what the signature for deque::insert is…
Bob: What? deque? What are you smoking? Hey, I know this great place for rehab…

Apparently, people who are reading documentation on C++ STL deque are… chemically dependent? Addicted to…. I guess caffeine? 🙂

Thank you, advertising relevancy algorithms.

screenshot of deque dep ad

Need drug rehab while looking at docs for deque

Posted in Uncategorized | Tagged | Leave a comment

HOWTO: Enable hibernate in gnome 3.6

I upgraded to Fedora 18 from Fedora 17 using FedUp recently, and I was dismayed to find that there was no more “Hibernate” option in the status menu, even though I had the alternative-status-menu extension installed. Apparently in their wisdom, the devs of that extension decided that the hibernate option should not be visible by default for users of said extension.

Thankfully, hibernate can be re-enabled. The extension’s page on describes how to do this:

Then use gsettings to turn on Hibernate: gsettings set allow-hibernate true

after you’ve installed the extension and enabled it. I’d like to note that you can do this “graphically” with dconf-editor, by browsing to org/gnome/shell/extensions/alternative-status-menu and setting allow-hibernate to true.

Part of the reason for this post is to give myself a record of how to fix this, because the instructions are in a comment, which could potentially scroll off and get buried under other comments.
Have fun!

Posted in development | Tagged , | Leave a comment

(Computer) languages worth learning

A friend of mine asked me what languages he should learn to get re-started/refreshed in programming.

I’ve listed my opinions here, partly so I’ll remember what I recommended in the past, partly so that others can learn too, and partly so that I can learn from others’ opinions.

Here we go:
Continue reading

Posted in development | Tagged , | Leave a comment

Fedora 17 upgrade notes

Here are some personal notes on upgrading a home machine to Fedora 17 (beta).

The preupgrade requires about 165MB on /boot for the upgrade. This is in addition to whatever you needed to boot Fedora. I didn’t have enough space, and the preupgrade process didn’t warn me. This is a bug that wasn’t considered a blocker. I’m glad I treated the spew of “No space left on device” messages and python stack traces on the console rather than trusting the preupgrade dialog box that said “ready to reboot” into Beefy Miracle. So I aggressively removed all kernels except the active/current one. Moral of the story: if your boot partition is smaller than 200MB (200*2^20), preupgrade will probably not work.

For the record, I fully support release naming in Fedora and have no problems with Beefy Miracle or Spherical Cow.

Posted in development | Leave a comment