Archive | bugs RSS feed for this section

Dropbox-style Two-sided Sharing Incentives

Last weekend, among a whole schedule of other great presentations at the Startup Lessons Learned conference (you can watch the video here), the folks behind Dropbox had a presentation (video) about how they went about growing their business.  Apparently search ads were too expensive for them (due to bidding up by other venture-funded firms in their space) and the long tail of search was not panning out, but their referral program worked out really well for them.  Really, really superflously well for them.

(For those who haven’t used it, Dropbox is absurdly well-implemented file storage, backup, and sharing in the clooooooooooooud.  They have saved my life twice when hard drives died and save me hours of time schlepping files from my Windows PC to my Ubuntu box and back again.  Go try them out if you haven’t already — I can’t imagine not having them anymore.  Anyhow: the business model is “We’ll give you 2 GB of space for free, or you pay us to get more than that.”)

The biggest single thing about their referral program is that it has a two-sided incentive for sharing: the person who signs up for Dropbox through your referral link gets a better deal than they would have gotten from the homepage, and you also get a free bonus yourself.  That is marvelous, marvelous psychology there: it gives both parties a benefit so the email seems less like spam, and social relations being what they are, the person receiving the gift feels a wee bit obligated to accept it.  (This same dynamic is used by many of the social gaming companies on Facebook, to degrees which almost make me feel icky.)

This works great for Dropbox because they have a product which they can easily make more useful in a granular fashion: just add more space and stir.  The cost of the marginal space is truly miniscule as a cost of customer acquisition: a few pennies a month if the user actually fills it, and most will not.  However, the passionate freebie seeking techies in the audience will use their disproportionate-sized online megaphones to scrape another few gigs out of their account.

The more I thought about it, the more impressed with this idea I was.  I had considered and rejected “Tell a friend” as a marketing scheme for BCC a few years back, on the theory that it just creates more spam and few of my customers would use it, but the double-sided incentive addresses both of those issues for me.  Plus I thought I could potentially implement it very quickly.  (I had a funny idea for a minimum viable tell-a-friend page: just ask for the friend’s email address and then have it ping me when someone submits anything.  I’d send the emails and credit both users manually.  That would have taken my development time from about six hours to one hour, but I decided to do it the “right way” on the theory that a few hours isn’t much of a risk to me anymore.)

So I decided to test out a version of this for Bingo Card Creator.  Historically, I have given free trial users 15 bingo cards for free.  (This neatly segments my markets between parents, who very rarely have 16 children, and teachers and professional users, who rarely play bingo with under a dozen players.)  I’m allowing them to invite friends: each successful invite gives both parties 3 extra cards, with a cap at 12 gained from inviting.  This theoretically will allow a large portion of my core customer base to get their program for free, but I think that paying is ridiculously more efficient for most of them, so it will only be the truly inveterate skinflints who sign up four of their closest friends so that they can get 27 cards for class.

The cost of allowing users to print extra bingo cards is, of course, too low to measure.

This Feature Is Surprisingly Hard

I’m used to pushing changes which only require ten lines of code, but this feature was a monster:

  • Tell a friend page
  • Processing email addresses put into that form
  • Actually sending emails
  • Facebook sharing integration
  • Customized signup page
  • I-Can’t-Believe-They’re-Not-Affiliate URLs
  • Properly crediting people for signups
  • Anti-abuse measures (mostly making it so that folks can’t use it to spam)
  • Minimum viable stats tracking (no charts yet)

All in all, it took a solid day.

I’m really pleased with a couple of implementation choices:

Getting the user’s first name:

My gut feeling (yeah yeah, A/B test incoming) is that users will be overwhelmingly more likely to respond to an invitation from Jane than from jsmith@example.com or from “a friend.”  So I asked customers to provide it if they haven’t already.  It is totally optional but I’m thinking they’re overwhelmingly going to comply.

Highlighting the offer on the signup page:

I hit the social proof fairly hard on the signup page: mentioning again that Bob or whomever sent the invitation, and that both Bob and the user will benefit from accepting the offer.  This page could stand to be a lot prettier, and I could probably throw a testimonial in here somewhere…  In this example, our generous inviting user’s name is Bingo.

Facebook integration:

Recently, having spent far too much of my time playing Facebook games (market research, I swear!) and scouting out the ecosystem more, I’ve noticed something.  One, a quarter of the female members of my family aged 30+ are currently sheering sheep, planting pumpkins, or throwing pigs at each other.  Two, my friends seem to comment on things they share… a lot.  Whoa.  This whole Facebook thing might actually have legs.

It turns out that getting folks to share links on Facebook is child’s play: one line of code that you can copy/paste.  With a bit more work, you can customize the text Facebook will pull out of the page.  I customized the text to include a strong call to action with added social proof, naturally.  Facebook sharing: not just for blog posts.


One feature that I particularly like about the Facebook option is that it only requires two mouse clicks from users (one to open, one to confirm — assuming they’re already cookied on FB), and it doesn’t require them to understand or recall email addresses.  My users have enough problems remembering and managing their own email addresses — I don’t want to include a “look up Anne’s email address” step in the workflow.

Finding folks when they’re ready:

Here’s an idea ripped straight off of the better Facebook games: give folks an opportunity to share stuff right when they hit the wall.  (Well, most of the Facebook games artificially construct the wall such that you have to share to get around it…  but I’m not that tricky.)  For example, if someone wants to print 22 cards and only has 15 quota, that would be a great time to remind them of the incentive.

Metrics Tracking

At the moment I put in very, very basic stats tracking:

  • Who ever accessed the invite page.
  • Who sent invites via email, and how many.
  • How many folks signed up as a result of invites.  (No source tracking, but GA should show me that, easily — referrer is Facebook, etc.)
  • Daily counts of all of the above.
  • As you could probably have guessed, I turned this on via an A/B test and will be watching to see if it hurts conversion to purchases.  (This is just a first cut, since it is possible that shaving 10% off sales from inviters would be worth getting the invitees, if invitees turn out to convert well.)
  • I’ve laid the groundwork for tracking the viral coefficient, although I strongly, strongly suspect it will be far below 1.  (I am not promoting this very aggressively, at all.)

Future Directions

In addition to the obvious (testing to see if this actually works), I have a few ideas for how to improve this in the future.

One obvious thing which I will probably not do is to ask folks for their webmail login details, grab their contact lists, and assist them in selecting folks to receive emails.  That would be stupidly effective, but it teaches bad Internet practices (do not give your Gmail details to random websites!) and frankly I don’t want it to be that easy to send invites.  We’ve all got that one aunt who has not figured out netiquette for Farmville and sends 14 lost kittens a day: I do not want to be her enabler.

I’ll probably also work on placement of the offer to invite, copy on the invite page, invite email, and invitation signup page (including graphical design), and will do some much more sophisticated metrics on this if early results look promising.

Will It Work?

Your guess is as good as mine.  In favor of it working, most of my customers and many of my trial users are very thrilled with Bingo Card Creator.  Many of them have figured out how to share it with friends despite me not giving them any good way to do so (not a trivial thing for elementary school English teachers — one bragged to me that she found out how to make a link to the website on her desktop and then bring it to her sister on a floppy, and if they’re getting over barriers to conversion that high, you know there must be something going on).  There is also the natural penny-pinching nature of teachers operating in my favor — the fact that the program is not free is far and away my #1 user complaint — and the fact that they tend to travel in packs.

In favor of the idea not working out so well: these are not very plugged in people as compared to Dropbox’s early adopter userbase, the actual mechanics of sharing still require non-trivial technical expertise (understanding email addresses and knowing those of your friends, for the option I’m giving highest billing to), and there are non-trivial business risks if it either becomes too popular or if folks feel that the invitation emails are an imposition.

Speaking of which: I capped the number of invites I’ll send out per user at 5 (hard-capped at the moment), capped the number of invitations any individual will receive at 1, and have capped the system at a total of 500 a day until I have some idea of how many is safe to send.

Bug of the Year Award:

It is early, but I think this already won it: a poorly considered after_save callback on my user model caused users’ mailing list settings at MailChimp to be updated if their user record was updated.  That was previously desirable, since there was nothing in the user model which could be updated without touching either their email address or their mailing list settings, and all updates were at the user’s personal request.  However, when I put the users’ card limit in there, then updated that for 50,000 users to set it to the default value, the callback fired and suddenly I got about 30,000 Delayed jobs all waiting to ping MailChimp.  I was ignorant of this until — thank God for checklists — I was testing the deploy and found out that I could not print bingo cards.  I assumed I had botched the Delayed Job worker processes again, but no, they were up… and right after I confirmed that I got the email saying “Delayed Job has spiked to 30,000 jobs in the queue!”

As soon as I realized what had happened I hit the Big Red Button on DJ, but not before a few thousand of them had been processed.  For users who had actually confirmed the signup to the mailing list before, I don’t think anything bad happened.  For those who had second thoughts before double-optin, they were all hit with another email from MailChimp on my behalf, seemingly out of the blue.  I’ve now got an inbox of “Who are you and why are you spamming me?” to deal with.  *sigh*

On the plate for tomorrow: figure out how I could have seen this one coming.  My testing and staging environments simply ignore API calls to MailChimp for the obvious reason — I wonder if I should have them throw exceptions instead unless they’re explicitly expected behavior.

Building Highly Reliable Websites For Small Companies

Downtime severely annoys customers.  Downtime annoys sole proprietors even more, because it has a funny way of invariably striking at the worst possible time.  Apache has no respect for date night.  So if you’re a small company without dedicated ops team, you might well be worried about whether you can reasonably promise customers that you’ll be able to avoid inconveniencing them while still maintaining some semblance of sanity in your own life.

Happily, you can, if you’re savvy about it.  I’ve supported thousands of customers and hundreds of thousands of trial users for four years without causing frequent outages, despite not being particularly skilled at server administration or having a huge money or time budget.

Setting Expectations

Let’s get this out of the way: are you a small company dependent on technology?  You will have downtime.  You will wear a communication device twenty four hours a day for the next several years, and respond with alacrity if it goes off.  The purpose of the rest of this blog post is to minimize downtime and have that communication device do as little damage to your relationships and sanity as possible.

Specifically, you will want to:

  • Anticipate failure ahead of time
  • Minimize the inicidence of failure
  • Be notified of failures in a timely manner
  • Quickly recover from failure
  • Learn from failures to prevent reoccurence

Many of these tips are specific to my personal experiences as an entrepreneur with a small business.  If you work in a highly regulated industry, have a dedicated ops team, or are Google, you probably should not be reading my blog to solve your technical challenges.

Identify Risks To Your Service

The key to building reliable systems is first to know where the risks are.  Don’t be suckered into thinking that downtime is generally caused by unpredictable black swan events: that is an easy mistake to make when reading stories about reliability from Google et al.  This is partly because they have large teams of supergeniuses wielding nearly infinite budgets to build reliable systems, and partly because when they do have downtime they typically phrase the report of it in such a way that it sounds like it was a black swan and not systemic failure to follow routine, easily understood policies seven times in a row plus the black swan that let the system go down.  (We’ll be returning to the policy theme in a moment.)

No, your risks are quite predictable, and you can jot them on a piece of notebook paper right now.  I’d strongly suggest actually physically doing this, as it helps inform your thinking about what is likely to break and what you’ll need to do to mitigate the risk of that.

Not sure what your risks are?  I’ve worked for the last several years in Nagoya, the Town that Toyota Built, and even though I was never in the automotive industry my professional mentors were heavily influenced by it.  You know what causes 99% of problems with cars?  Moving parts.

It is astronomically more likely for something which moves to fail than something which doesn’t: it is subject to friction, wear, foreign particles, and a thousand other sources of failure.  By comparison, all the chassis of the car has to do is not decompose into its constituent atoms, and since it hasn’t done that until now it is a good bet that today will not be the day it picks to do so.

Software systems are also, overwhelmingly, killed by their moving parts.

Hard drives, to be very literal, are one of the only things that actually move in your server, and statistically speaking they’ve got the worst failure rate of any device in the box.  Serious engineers treat hard drives as a component that by grace of God has not died yet.  That is why we put them in RAID arrays which abstracts the life and death of particular hard drives away from users of the system.  I’m rationally ignorant of how RAID actually works: if you want to know, read a book; if you want to do something productive with your day, pay your VPS provider or managed hosting company to deal with this for you.  Trust me, you have better things to do than actually touch hard drives.

Side note: This is increasingly becoming true of just about everything [PDF link] in computer systems: scales are such that statistically speaking something is broken right now, so we’ll build our systems in the assumption that they’re broken to a degree unpredictable at run-time, and still squeeze some work out of them.  “The Cloud” has not quite brought web-scale computing principles to the smallest software companies yet, but I think it is highly likely we’ll adopt bits and pieces of them eventually.  After all, essentially all of us use RAIDs these days, many without knowing it, while they were a Serious Tool for Serious Businesses only a fairly short while ago.

Less literally speaking, your system is most vulnerable where it sees dynamism, complexity, and change.

It is at its most vulnerable when you are working on it and shortly after, because computer systems largely do not rot and once they’ve achieved a steady state tend to stay in it until something exceptional happens.  You are your own worst enemy, and you’ll take steps to mitigate the threat you pose, as described later.

Many web applications these days have easy dividing lines between static and dynamic requests.  The dynamic requests represent generally a small fraction of the overall total but will cause almost all of the failures.  If you have, for example, Nginx proxying to Mongrel, you can be quite confident that Mongrel will fail much, much more often than Nginx will.  (In point of fact, Nginx fails so seldomly you can almost get away with ignoring the possibility of it happening, since something else will almost surely kill either your service or you personally prior to Nginx dying on you.   Carry life insurance and look both ways before you cross the street, but if you have too many things to do and not enough time to do them in, worrying about Nginx failing is something you can probably safely kick down the road.)

Your database is also a high-probability culprit for failing, partially for practical reasons (it is a very sophisticated bit of engineering which by its very nature is highly dynamic) and partially for philosophical ones.  For historical reasons, most databases are made/configured around the assumption that it is better to fail loudly and completely rather than fudge the ACID guarantees silently.  This is a perfectly reasonable engineering tradeoff for a bank’s transactional systems which might not be optimal for the sheepthrowing statistics your app may be tracking.  (The degree of this impedance mismatch is why some folks are very passionate about the whole NoSQL thing when they really just want to drop ACID.)

Then there are a whole host of systems outside of your direct control which can nonetheless bring your system down.  Your hosting provider’s network, for example, can fail at any moment, and typically a failure there is essentially a 100% loss of service for the typical web application.  Their upstream provider could similarly fail.  Any API your application depends on could fail in a hundred ways at any time.

Basically, keep writing on that piece of notebook paper until you run out of obvious sources of failure.

Here’s the abbreviated list of things that could go wrong at my business.

  • Operator error
  • Operator error
  • Operator error
  • Hardware failures on the server
  • Network failures at Slicehost
  • Mongrel fails
  • MySQL fails
  • Memcached fails
  • Delayed Job fails
  • Scheduled cron tasks fail
  • External APIs (e.g. Mixpanel) fail
  • Embedded Javascript (e.g. Google Analytics) fail

All of these have actually happened at one time or another, but most did not cause downtime for my customers.

Mitigating Failures Before They Happen

After you have some idea of what is likely to go wrong, you can start taking actions to mitigate it.

One conceptually easy step is decoupling: make it so that a failure of a particular component can’t bring down the entire system.  This isn’t always possible in a cost-effective fashion: for example, in a heavily dynamic web application, it is highly likely that the database failing means you’re going down hard.  That is OK.  Ideally, your business is not running the power generators at a hospital: you don’t have to eliminate all of the downtime, you just have to minimize it, so go after the low-hanging fruit before addressing “what happens if the database dies”.  (Answer: nothing… if you spent a few months of Very Expensive Engineer Time working out a replication/failover strategy.  That is overkill when a) your customers are the sort that will tolerate a bit of downtime once every blue moon and b) you’re much, much more likely to bring the server down because you didn’t spent ten minutes writing a deployment checklist.)

One example of decoupling: never call an external API from within an HTTP request/response cycle.  That essentially makes your system 100% dependent on the external API being constantly available.  Their downtime is now your downtime.  Their capacity problems are now your capacity problems.  Their operator error is now your operator error.

Instead, do all communication with external APIs asynchronously.  There are many common patterns for this.  My website calls out to Mixpanel for statistics tracking but the end-user doesn’t care about that, so I just queue up a Delayed Job to do the API call asynchronously and regardless of it eventually succeeding or failing my user never cares.  This means that Mixpanel’s (quite infrequent) downtimes have not caused my users any loss of service.

If your users actually need to see the results of the API call, you can schedule the call asynchronously and then do AJAX-y magic to poll the server asking if it has completed yet.  If it doesn’t complete in a reasonable amount of time, you can either tell the user to do so in a nice, customized error message, or you can fallback to something which you can accomplish locally.  In many applications, for example, instantaneous response to changes in the underlying data is just “nice to have” rather than a genuine requirement — RSS readers, for example, usually won’t kill anybody if they are a few minutes out of date.  If updating an RSS feed fails for whatever reason, you can probably get by with showing the user your most recent cache of it — quite possibly without ever telling the user about the failure at all.  (Engineers often are excessively protective of users.  Personally, in most applications that don’t involve lives or money, I would rather tell the user a white lie than show them a red error message.  This is particularly true when they can’t really do anything to address the error message other than “Try again later [and pray a third party who you don’t even know exists has figured out why they are throwing 501 status codes and addressed it].”)

Similarly, you can decouple bits of your web infrastructure from other bits.  In Rails applications, Mongrel can (and will) fail independently of Nginx.  By default, this will result in Nginx showing a forbidding black and white page with a scary error message on it.  That is a terrible user experience and you can alleviate it in seconds: create a nice-looking page using nothing but static assets, and have Nginx serve it using the error_page directive.  Somewhat contrary to what many engineers might assume, users are often largely mollified by anything showing up on their screens, and a well-written error page is sometimes almost as good as a functioning system.

I know the failwhale is a running joke, but that is just because we see so much of it and are accustomed to our computers mostly working.  More typical users have computers eat their documents, freeze, and break all the time for no discernable reason at all, and if you do your job right they may never see your system fail twice, so their first and only encounter with your error page might not actually cost you too much customer goodwill.

Automated recovery is another smart mitigation step to take.  I use a process manager called god to watch over my Rails programs, and when a Mongrel starts consuming too much memory or failing to respond to “Are you alive?” messages, god forcibly restarts it.  This sounds almost crazy to the Big Freaking Enterprise engineer in me, but practically speaking it eliminates almost all common Rails problems (e.g. poor memory management caused by overly enthusiastic creation of objects and not garbage collecting them efficiently) before they cause a problem for my customers or myself.  The god daemon is, similarly, restarted daily to avoid it having memory leaks.  Yep, it smells strongly of duct tape, but the duct tape works.

Minimizing operator error is critically important, because you are the least reliable component of your system.  Because you rely on software to do most of the actual work, when you touch the system you’re almost by definition performing something novel that isn’t automated.  Novel operations are moving parts and vastly more likely to fail than known-good operations that your system crunches millions of times per day.  Additionally, even if what you want to do is absolutely flawlessly planned out, you’ll often not execute flawlessly on the plan.  This was one of the root causes of my worst downtime ever.

Happily, the steps to minimize operator error are well understood.  Unhappily, they require swallowing a bit of your ego and actually following them.  They’re well researched, reproducible, and will save you time in the long run: get over yourself.  I had to, too.

If you have to do it more than once, it should be automated or made into a checklist.  This includes things like:

  • server setup
  • sever upgrades
  • upgrading code on the staging server
  • upgrading code on the production server
  • any maintenance task

Checklists are very simple: just a textual description of what the list describes, why you’d want to do it, what the exact steps are (i.e. down to what you type into the console), and what the exact steps are for verifying that the procedure was carried out correctly.  This last bit is non-optional: subtle failures in maintenance tasks are a frequent cause of downtime, sometimes weeks or months later.

This is one reason we spend so much time on root cause analysis, to demonstrate to skeptical engineers that checklists are like flossing.  My dentist tells me “If you think flossing is a nuisance, that’s fine: just floss the ones you intend to keep.”  If you think checklists are a nuisance, that is fine: you can feel free to skip checklists for systems where catastrophic failure is no big deal.

I personally keep checklists in text files, because I only have one person to worry about, but in a multi-user organization wikis are fantastic for them.  This goes doubly true for wikis which keep version history, because frequently as the system matures and as you respond to issues the checklist will need modification, and rather than tracking Bob down and asking him what this command is supposed to do, a well-written changelog will tell you “That command is there to prevent you from hosing the DB like we did last August.”

Certain checklists are executed very infrequently.  Of particular note is one checklist that absolutely everyone should have: how to restore a machine (or machines) from the bare metal to a working copy of the production system.  Ideally, you should be capable of doing this in 15 minutes or less, because if you ever have to do it for real it means that disaster has struck and your site is now down.  Take 15 minutes out of your busy schedule every quarter and actually run that checklist to make sure it still works. You will frequently find “Oh, effity, we use GraphicsMagick these days rather than ImageMagick but nobody wrote that on here.  Hah, silly me.”  which, if this were an actual emergency, would have you scrambling to correct while the site was still down.  “Scrambling” sounds a whole lot like moving parts, right?  Right, you’ll be introducing more errors at the worst possible time to have them, in the middle of an emergency.  Get emergency recover down to a routine, so that when you actually have an emergency (and you will, eventually), dealing with it is a matter of routine.

Automation is your friend.  Some organizations get checklist happy and make checklists for procedures which essentially can’t fail and which require no judgement.  Those shouldn’t be checklists — they should be shell scripts.  That way, they save your engineers time and you can be confident that the latest script in version control (it is in version control, riiiiiight?) is well-tested (it is tested, riiiiight?) and will actually work.

Of particular note in the Rails world: deprec and Capistrano are wonderful tools which automate server setup (very well suited to deployment to a VPS like Slicehost) and application deployment.  These are absolutely lifesavers and, although I’ve bashed my head against both a few times (typically with integration issues with Windows) they save you from weeks and weeks of script writing.  I also sleep much more easily at night knowing that I’ve set up a staging environment in ~8 minutes using my Capistrano script this month and, if everything else went wrong, I could have a server reimaged and loading my database backup almost as soon as I got the phone call.

Be Notified Of Failures In A Timely Manner

Many failures can be solved or mitigated fairly quickly after you get to a computer, which means that time to recovery is dominated by the time it takes from the time the failure arises to the time it takes for you to be made aware of it.  There are easy, reproducible ways to bring that down to “a few minutes or less.”

Low-hanging fruit: The absolute easiest possible solution is to point Pingdom or Mon.itor.us at your home page and have them contact you if it doesn’t resolve.  They’re fairly simple: if the server doesn’t respond with HTTP 200 when they try to access it (every 15 minutes or so), you get an email or SMS.  This is the simplest thing that could possibly work, but there are circumstances where it won’t catch failures.  (For example, applications of non-trivial complexity often have parts which can fail without taking the front page down.)

I recommend creating an internal status page which automatically checks all the things you think are crucial, risky, and tractable to resolution if you were to know about them.  (If an external API provider goes down and you already know your response is going to be “I wait until it comes back up”, then no sense disturbing your sleep about it, right?)  For example, mine will fail to return properly if Nginx, Mongrel, the Delayed::Job workers, memcached, or Redis is having a bad day.  You can then have your external monitoring poll that page and, if they don’t get the HTTP 200 all clear, send you an email.

For folks who are feeling adventurous (or excessively stingy), you can rig your own server monitoring terminating with a phone call or SMS with Twilio and about 30 minutes of work.  If you do this, you can escape the time or notification limitations which the notification services use to segment their customers into “hobbyists” and “enterprises who have an awful lot of money to spend to make sure nothing breaks.”  Personally I don’t think it is worth it but, hey, it is an option.  (Note that this introduces another moving part into your system which, if it fails, you will probably find out about only when your main system is down… a couple hours after the fact.  This is a compelling reason to not be a cheapskate.)

I have a very simple solution for going from notification mails to instant awareness of the problem: my cell phone, which is a cheapo Japanese model, can do custom ringtones for individual callers or mail senders.  In the event I get an email from my notification service, it plays Ride of the Valkyries.  As I recall, I’ve heard it twice, once while sound asleep and once on date night.  (The interesting question of whether I would have begged off of date night may not ever be answered, since I was able to successfully reboot before my girlfriend noticed.)

In addition to the phone-in method of server monitoring, you can also use the phone-out method.  I have been using Scout for the last couple of weeks, and it is wonderful.  Basically, a cron job running locally reports a variety of statistics to their server every few minutes, and if the statistics are anomalous or the report fails to happen as scheduled, you get detailed warnings of it.  My sole problem with Scout is that it is one chatty little robot: by default, it sends me emails about things like not-even-close-to-critical demand spikes (you went to 2% CPU utilization for a minute?  Poor baby! My business got linked to from Reddit and requests spiked 1,000% without any performance degradation: great, why are you telling me?).  After a few weeks of tuning I’ve mostly shut it up about non-critical notifications.  (One other gripe with Scout is that they reserve SMS notifications for their priciest plan.  It is market segmentation, I know, but in an age of Twilio it is almost petty.  Still, I feel quite satisfied on their $20 a month option.)

Quickly Recover From Failure

Ideally, you’re either recovering automatically or you’ve been given timely notice of a failure you anticipated and now all you have to do is open your checklist.  Neither of the above?  Well then, this is why you earn the big bucks.  Godspeed.

Learn From Failures To Prevent Re-occurrence

We’re a big fan locally of Five Whys, which has lately achieved a bit of prominence in US startups due to Eric Ries and the rest of the Lean Startups crowd.  Boiled down to its essence, Five Whys says that no failure ever has one cause.  There might be a single surface-level immediate cause, but the failure is also a symptom of multiple process failures because you had things in place to prevent that failure from happening and they did not trigger or were not effective.

I’m ruthless about doing the corrective action — root cause analysis — in my own business.  You can see an example here.  I’m more proud of that failure than a lot of my successes, because I learned from it.

Basically, you keep peeling layers of the failure onion until you’re satisfied that you’ve gone deep enough: five layers is a guideline, not a rule.  You then invest proportionate resources into making sure that each of the failures does not happen again.  This could mean updating procedures/checklists, adding features to your autorecovery code or diagnostics, beefing up employee training, etc etc etc.

A note for those with employees: Five Whys will frequently — very, very frequently — implicate human or cultural issues in your organization.  Just trust me on this:  you’re not alone, and you can persevere through the difficulties.  Resolving critical defects was ironically one of the least contentious processes we had at my ex-day job — even in a Japanese megacorp we had internalized that it was too important to coat with the usual amount of corporate horsepuckey.  (It helps that my Japanese megacorp is in Nagoya and that the development practices of a certain large automobile manufacturer are practically state religion here.)  Check the egos, chuck the org chart, and make the fixes you need to make to uphold your responsibilities to your customers.

Stats Bug In A/Bingo v1.0.0 and earlier

Many thanks to Ivan for reporting this one: there is a significant bug in A/Bingo calculation of z-scores for versions 1.0.0 and earlier, which borks substantially all z-score calculations and in some cases can change whether A/B test results are reported as statistically significant or not. 

The bug is all of one character long:

def zscore()
#omitted for clarity
 cr1 = alternatives[0].conversion_rate
 cr2 = alternatives[1].conversion_rate
 n1 = alternatives[0].participants
 n2 = alternatives[1].participants

 numerator = cr1 - cr2
 frac1 = cr1 * (1 - cr1) / n1
 frac2 = cr2 * (1 - cr1) / n2   #this line is bugged
 numerator / ((frac1 + frac2) ** 0.5)
end

I have fixed the bug (via the Slicehost console, on a Japanese cafe Internet PC, because I am stuck in Nagoya today again) and pushed the fix to the git repository.

Does this make my results invalid?

You can probably still have confidence in results you got from A/Bingo previously. While the numerical calculation of the z-score was borked, it was borked in a subtle enough fashion that most statistically significant tests will retain their statistical significance under the borked calculation and most statistically insignificant tests will not gain statistical significance magically as a result of the borked calculation. (My quick eyeball suggests that it causes BCC to overstate the significance of tests which are very significant and understate the significance of tests which are insignificant, which is a very fortuitous set of properties for a random bug to have in an A/B testing framework.)

I have re-run statistical confidence tests for everything I’ve ever done for BCC that I still have data for, and no experimental results changed as a result of the error. Nonetheless, I deeply regret the bug, and will write unit tests for the statistics code as soon as I am physically capable of doing so to rule out the possibility of this sort of thing in the future.

I Had Downtime Today. Here's What I'm Doing About It.

I screwed up in a major way yesterday evening. This post is part of my attempt to fix it.

This morning I woke up to an email from a paying customer saying that they tried to print cards but couldn’t. Specifically, they said that they were able to use the Print Preview feature, but that using the actual print button, quote, “caused the server to hang.” That can’t actually happen but it was sufficiently detailed as a bug report to immediately clue me in one what probably happened: the Delayed::Job workers must be down. A quick check of the server (ps -A | grep ruby) showed that this was indeed the case.

I quickly restarted the Delayed::Job workers then logged into the Rails console to check how many jobs had piled up. Six thousand.  Oof.  Most of them were low priority tasks (e.g. pinging the Mixpanel server with stats updates, which I do asynchronously to avoid having a failure there affect my users), but sixty users were affected — their print jobs were delayed.  Print jobs normally take under five seconds to execute and are checked with a bit of AJAX magic which polls the server until the job is ready, which means that most of these users probably got an animated GIF spinner to look at until they got tired and closed the web page.  The worst affected jobs took over twelve hours.

Happily, the downtime hit on a Saturday, which is the lightest day of the week for me.  If this had happened a week ago right before Valentine’s Day over 5,000 users would have been affected.

Apologizing To Affected Users

I used the Rails console to create a list of users affected by this, and have sent individual apology emails to the 2 paying customers affected (including attachments for the cards they had tried to print).  I will be contacting the trial users in a more scalable fashion.  Since I don’t have permission to email free trial users (the anti-spam guarantee I give is fairly strict), I dropped the development I had planned for this morning and built a simple messaging system into the site (~20 lines of code — I love you, Rails).  It gives me one-way “drop a message directly to your dashboard” functionality.

For example:

I prefer using this feature to the standard industry responses to outages:

  • “Outage?  What outage?”
  • “Please see our status page, which we’ve conveniently located in electronic Siberia.”
  • “ATTENTION ALL USERS!  0.7% of you were affected by very serious sounding things yesterday!  Please be worried unnecessarily even if you weren’t affected, and swamp our support line, who we will provide no effective tools to to tell you whether you’ve been affected or not!”

It allows me to apologize directly to affected users, makes minimal demands on their attention while still almost certainly reaching them, and does not cause any issue for the other 25,000 users.  Plus I can re-use this feature later in the event of needing to contact specific users without needing to email them (one obvious candidate would be plopping something straight on the screens of anonymous guests if I found something they individually needed to know, for example, if one of my automated processes caught that a recent print job of theirs did not come out right).

Preventing It From Happening Again

I’m something of a fan of Toyota’s Five Whys methodology for investigating issues like this.  (It has recently been popular with the lean startup crew.  My coworkers at the day job enjoyed some mostly justifiable smirks when I told them that.)

  1. Why couldn’t my users print?   Because the Delayed::Job workers were terminated when I upgraded the production server to Ubuntu Karmic Koala last night.
  2. Why didn’t the post-deploy checklist catch that users couldn’t print?  The post deploy checklist has “manually verify you can print cards” on it. I didn’t follow the post-deploy checklist with sufficient attention to detail because it was late (midnight) and I was tired (because I worked a six day crunch week at the day job… 30 days to go).  Here, I used the Print Preview feature to verify that I could print cards (“Hey, it tests the same code path, right?”), not realizing that while it tests the same code path they have different failure scenarios if e.g. Delayed::Job workers are down.  Fix: Quit day job and, regardless of how tired you are, follow the freaking checklist.
  3. Why weren’t you woken up by the Ride of the Valkyries playing on your cell phone when the site failed?  Don’t we have a system in place to do that? It turns out that the automated diagnostic (an external service pings a URL, the URL runs various tests and throws an HTTP error if any fail, the service mails my cell phone if there is an HTTP error twice in a row) tests nginx, mongrel, the D/B, and core program logic but doesn’t test the Delayed::Job processes or sanity check the job counts.  Fixed.
  4. Why didn’t the ‘god’ process monitor detect the workers were down? God sees every sparrow, but god only knows about the processes you tell it to manage, and my god_config.rb file has the Delayed::Job bits commented out with the notation “#This is buggy.”  I don’t remember why it was buggy and my notes in SVN are similarly unhelpful.  New task: unbuggy it.
  5. Why don’t you have commit notes, comments, or a development journal telling you what you were thinking when you found it was “buggy”? Failure to keep adequate records for “minor” changes and failure to follow up on a bug that was prioritized “Eh, get to that whenever” and then never gotten to.  Fix:  Look into beefing up developer documentation practices.

In the course of investigating this I discovered the update to Koala also killed Memcached on the server.  (Thankfully, Memcachedb — where I persist long-term user data that for whatever reason isn’t in the database, such as A/B testing participation data — is on another server.)  Unbeknownst to me, my use of memcached fails totally silently: if Rails can’t find the data in the cache it just regenerates it.  That would have had very unpleasant consequences for users if it had continued until Monday, and none of my automated tests would have picked up on it, because they all ignore timing.  I’ve added an explicit check to see if memcached is up and running.  I’ll also look into doing something about monitoring response times.

What I Learned From Japanese Engineering

I’m indebted to my day job for teaching me both a) how to do this and b) the absolute necessity of doing it, in spite of my longtime cavalierness with software testing. It was quite a culture shock for me the first time I logged into the test server at work to deploy something and got a rap on the knuckles for not:

  • Having a written explanation of exactly what commands I was going to enter.
  • Having a written checklist describing what tests to perform to ensure the deploy worked, and what the expected results would be.
  • Writing in the wiki that I was doing the deploy for a particular version done to close out a particular bug, so that there would be a trail to follow if the version I was about to deploy failed years from now.

That’s what we do for the test server.

All of the writing, test suites, automated test processes, and monitoring takes some time to set up and much of it generates additional overhead on all your tasks.  However, in the last three years, I’ve come to recognize that it is a net time-savings over writing apology letters and doing emergency incident response, neither of which are ever fun or quick.

Alright, development journal entry over.  Back to new development.

Do You Debug Your Website?

I can’t say that the bug this post is about is my worst bug ever.  Or even my most embarassing bug ever.  But it is certainly my most costly bug ever.  But first, a picture.  You can click it to see the full-sized version.

The above graph shows two years and change of monthly visitor counts to my business.  There are a bunch of milestones I could have put up there — the new versions of my software, the key events in my market’s annual cycle, the redesign of my website, the time I quadrupled my advertising budget.  And they all pale in comparison to squashing one stupid little bug.

What Could Possibly Have Done That?

Well, like many bugs, the impact of this one rose in direct proportion to how clever you are.  My business website runs on a custom-designed CSS which cranks out free printable bingo cards.  The home page, shopping cart, main level navigation, all of these things are valuable towers which sit on a gigantic foundation of free content.  That content attracts me links, traffic, and searchers.  The bright idea to make the production of this content scale is probably the single biggest thing I’ve ever done to grow my business.

And scale it does.  I pay a terrifically talented freelancer to write the actual word lists that become the bingo cards, and provide a little bit of color commentary.  I feed her input into the CMS, and *bam* web page, downloadable PDF, and GIF are created.  Now multiply this times a couple hundred, for very little additional work on my part.

The Best Laid Plans of Mice and Men

One design decision I made when building this automated marketing machine was to give it a time-based element: one bingo card got featured a day.  Originally this was largely because I had the site separate from my main site and wanted to call it Daily Bingo Cards.  I figured, hey, if a new card pops up to the “front” page every day then it will always be fresh, and that makes it stickier for folks.  (And, as it turns out, I do have a handful of folks who a year later are still loading the page several times a week to see what is new.)

However, I had (totally unfounded) performance worries about Rails when I wrote this application.  I thought hitting the database for every pageview was needlessly taxing on my server (bzz, I only get a few thousand a day, why the heck would the server complain?)  So I decided to turn on caching.

Caching is a wonderful tool.  You can use it to solve just about any performance problem.  It solves it by replacing the problem with a cache expiration problem.  (No performance problem?  Oh, you get to deal with cache expiration anyway, don’t worry.)

How Hard Can It Be?

Now, if you want to refresh your content on your website once a day, and you’re not really particular when you do it, caching should be dead easy: purge the cache once a day.  Which I did.  Sort of.

You see, Rails stores cached pages in the same directory your static HTML is being served out of, as static HTML.  Your web server treats the cached pages as it would any other web page if they exist, and just slurps them right out of the directory and spits them out at whoever is requesting that page.  This is blazingly, bugs-in-your-teeth fast.

And to clear the cache?  Simple — delete the HTML file and it is like it never existed.  Your web server, seeing no file matching the user’s request, will fallback to calling Rails to see what to do with the request, and after Rails does its magic there might be a new HTML file deposited into that directory.

And how to delete a file once a day?  Stick task in chrontab… done.

Your Testing Protocol Leaves Something To Be Desired

Now normally a line which is, roughly speaking, the complexity of rm -rf /some/file/name/goes/here is pretty hard to screw up.  But knowing well my pechant for screwups I tested those lines before I wrote them into the crontab.  As root (do you see where this is going?)  And they worked swimmingly.

Of course, when I wrote them into the crontab, I did not set them to execute as Root.  No, I set them to execute as the user that I thought would be dropping the files — clearly the web server, right?  (WRONG.  The web server proxies requests to Rails, but Rails operates under another user to write the cached html.  I knew that on an intellectual level, but it did not occur to me one fateful evening when writing that cron file.)

Compounding Errors

So, as a result, large portions of my website were not getting refreshed when I thought they were.  As a result, instead of a dynamic bingo card publishing empire, with constantly refreshed content that was added to, content on the website was fixed at first generation and then went stale.

Had it stayed stale, I would have eventually figured things out.  But no — I was developing the site, occasionally, and every time I deployed a new version of it, one of the deployment steps had the side effect of nuking the cache directory.  And since the only reason I would be trolling on my own bingo cards pages was to see that my changes were taking effect, I never saw the bug.  

Meanwhile, between bursts of development activity, the page would look like an unattended ghost town for weeks or months at a time.

Diagnosing The Error, or, Reality Bites Your Hindquarters

Periodically I check what bingo cards my users find most appealing, both for curiosity’s sake and to guide development of new content.  Typically this is largely seasonal in nature — in November, for example, I can tell you without looking that bingo cards for Thanksgiving are going to have a total lock on the most popular crown.  But sometimes I’ll notice other patterns — a church sends my site out in their weekly flier and Bible Bingo gets popular for a week, some school has a unit on Chaucer and suddenly I see a surge in searches about the Canterbury Tales, and the like.

And then sometimes you see things that can’t be explained.  Insects being on the top 10 list for two months.  But you ignore it — maybe some people like bugs.  Cat Breeds for two weeks?  Maybe some people like cats.  But the one that finally clued me in was The Moon.  Because either NASA camp was playing Moon bingo every single day in July or something was up.

The Game Is Afoot

So I checked the database to see what card was scheduled for that day, because the card that gets top billing should typically be at or near the top of the most popular list.  And the database dutifully reported: none.  Which is impossible — had I run out of cards?  I thought I would still have about 100 in reserver.  I checked the database and saw, hmm, closer to 200.  Meaning either I was off on my mental count or about 100 had not been assigned.

One quick Ruby script later, iterating through all the days in summer, and I realized that only 6 days in summer had seen a new card assigned.  The error log showed nothing out of the ordinary for the other days.  As a matter of fact, it showed… nothing for them.  It was like the website wasn’t getting generated at all… or if it was, it was being served 100% out of cache.

Shoot.

Five minutes later I found and fixed the bug.  I was so mad I nearly posted a note about it here, but figured no one would care.  Had I known at the time how bad the impact was, I would have had apoplexy.

Fast Forward Three Months

If you know what happened, this seems like a pretty teeny bug conceptually.  Google sure didn’t see it that way — this bug made a live, growing website look largely dead to the world, and Google accordingly sent most of their searchers to get their bingo cards elsewhere.  (Ever wondered if Google values “fresh” web content?  This is my most convincing experience that says the answer is YES, although there are a couple of reasons why the pages are better with the daily updates happening which go a bit deeper than “they smell fresh that way”, which are out of the scope of this post.)  After the website was restored to vibrant life, Google opened the long-tail floodgates.

And the difference?  I’ve more than doubled most types of traffic, both search related and otherwise.  (Funny, while no user thought this was important enough to email about — and would you email if a favorite website went dead for a while during the summer?  — many sure thought it important enough when it was fixed to take notice.)  There are a few confounding factors in there (increased advertising budget, seasonal fluctuations, etc) but nothing accomodates for anything like the huge run-up you see in the graph.  My sales graph also shows a bit of a bump, to put it mildly.

Word to the wise

1)  Test your website with the same diligence you test your application.  Or, in my case, improved diligence.

2)  Pay attention to your website.  Even the boring bits.  Especially the boring bits that make you money.

3)  Never send a commited Windows programmer to do a Unix sysadmin’s job.

Quick Request Part #2

If someone with Java 1.3 or 1.4 installed on their computer could please download the Windows free trial from the Bingo Card Creator site, install it, and tell me whether you get to the main screen or not, I would be obliged. 

I believe the problem was that earlier I had assumed, having set the compiler to use version 1.3 in Eclipse, that I was getting Java 1.3 class files.  That worked for a long, long time.  Then I made an ant script and, boom, it stopped working the first time I ran the clear target and Ant used the (defaults to 1.6) javac target to make the new build.  Of course, having 1.6 on this machine, I never noticed the change.

All I can say is doh!

"Thats Funny, No One Has Bought a CD In Weeks"

I’ve had my best month of sales ever, but only 1 CD in that time.  Typically about half of my customers get the CD.  I had a vague feeling that there were fewer CD orders than usual this month but it didn’t raise any flags with me.  Then I got a fairly typical email saying “How do I purchase your software?”  Thanks for your interest, click the big red button which says Purchase Now.  “That doesn’t say it comes with a CD.”  Thanks for your continued interest, you need to click the one next to the text Purchase a CD.  “That doesn’t work.”  Thanks for your continued interest, you need to… oh, wait.  It actually doesn’t work. 

It seems that when I switched the item numbers in e-junkie (to accomodate SwiftCD integration) I forgot to also switch the item numbers in the e-junkie links on my page.  For some reason this actually didn’t cause a problem for at least the first two weeks.  It had to be working for one of my customers to get an order for a CD through on March 3rd.  At some point after that the e-junkie system began saying “Oh, wait, the item number that link references doesn’t exist anymore” and bailing when you tried to use it to add things to the cart.  I assume that most of my customers who saw this error either shrugged and said “OK, I’ll take the download!” and some, more worrisomely, probably left.

This is one of those bugs that just makes me want to die inside as a programmer.  The systems involved have well-understood interfaces but the inner workings are complex and totally opaque to me.  As a result, bugs are hard to predict except by seeing them, and if their visibility is obscured by whatever system interaction happens its likely that I’ll be the last to know.  I guess the only solution to that is regular monitoring and applying enough concentration to know when the process is out of control. 

As long as I’m on the subject of CDs: if you’ll excuse my own HTML coding errors, the integration of SwiftCD and e-junkie has been flawless in every respect.  Its also cut the amount of customer support I had to do literally in half — back when half of my orders were CDs I spent as much time retyping addresses and invoice numbers into cd-fulfillment.com as I did answering customer emails.  Now delivering a CD takes as much marginal work as delivering a registration key: nothing.  Granted, at my level of sales thats probably 5 minutes saved 3-4 times a week, but for some reason minor repetitive nuisances like that grate on me far more than their absolute time required would suggest.

Note to those using Inno Setup…

… don’t forget to set the working directory for the shortcuts you create.  I had assumed Windows would automatically default to the program directory.  This is apparently not the case on some systems, and its been causing some extraordinarily quirky behavior for some of my users.  (v1.05 and 1.051 use Java’s facility to locate the working directory rather than using the directory the .exe is in, because that logic was causing problems for the Mac port.  Unfortunately, on at least some systems they’ll default to somewhere else instead.  I hadn’t noticed this problem because Inno Setup actually does set a sensible default working directory when you execute the program directly from the installer.)

*sigh* Time for a 1.052.

Laugh To Keep From Crying…

I just found this morning, through ironically the same customer that was having difficulties yesterday (this must be karma), that a key feature of my software has been disabled in the build on my website for the last 2 weeks.  I had been seeing traffic higher than ever, double the number of confirmed downloads I had in January, and was wondering “Why am I seeing 20-30% of my usual sales?  Well, it turns out the reason was that I was performing an involuntary A-B test: A with the feature,  and B without, based on what day you had first downloaded the software.  A wins by a longshot.

Here’s what happened: I ship Bingo Card Creator with about a dozen preprogrammed word lists.  Since I’m skeptical of my customers’ ability to do the Open File -> Navigate To Folder -> Select File routine, I make this very easy through a Wizards menu.  Click on the Wizards Menu, click your subject, click the menu item which describes what grade level/skill you are working on.  I had known this was a crucial feature when I included it in 1.02 because it alleviates the “empty screen problem” (I essentially can’t sell to someone who hasn’t seen a card printed out, and if you have to type in 25 words before you can print a card then you’re much less likely to invest the time).  I hadn’t known it was quite so crucial though.

Here’s what happened.  The Wizard menu is autogenerated, and unlike the vast majority of my code the logic is pretty smart.  It spiders a particular subdirectory in the installation, doing a breadth-first search of the file system tree, and making every directory a submenu and every properly formatted file an entry in the submenu it appears in.  This lets me add new items to the Wizard menu without tweaking anything in the Java code.  If a directory is empty, it doesn’t get shown.  If no files are found at all, the Wizard menu doesn’t show, because nothing ticks people off like non-functional programs, right?  (Grumble grumble.)

Anyhow, this feature has worked since 1.02 and I test to make sure it works every time I make a build, because its just so critical and because without the feature I’d actually have to type in word lists to type printing and that is slow.

I test in Eclipse and it works fine.

I test after exporting my JAR and it works fine.

I test after wrapping the JAR in the EXE and it works fine.

I test after building my installer and installing and it works fine.

My customer downloads the installer and sees no Wizards menu.

Can you spot what edge case my testing missed?  Oh, its fun — I accidentally introduced a single extra character into my Inno Setup script.  Since my setup script is not in my Subversion I never even noticed that I had made the change.  This resulted in the setup program copying the SampleLists (where the wizards reside) folder into the application directory, not into the application/SampleLists directory.  On my machine, since I was installing without uninstalling (clobbers identically named files but DOES NOT DELETE files not present in the new installer), I still had the old C:\Program Files\Bingo Card Creator\SampleLists directory with the proper structure, and things worked peachy.  Then my old, loyal customers who were doing updates installed and everything worked peachy.  Then new people downloaded the installer and, boom, no Wizards for two weeks.

After work today I’ll release a .01 “upgrade” to bugfix Download.com and all the other places that cache installers.

Lessons learned:

  • Setup script goes into the Subversion repository so that at any time it is either known-good or marked as changed since the last known-good.
  • Uninstall before installing to do final testing.  If I had another machine lying around I’d even do a virtual machine or something so that I was guaranteed of a fresh test environment.
  • Keep the lines of communication open with customers.  They can save you from yourself.