Last week I went down to Osaka to give a presentation to the Design Matters group at the Apple Store. I originally prepared a very geeky software-centric dive into the magic of using statistics to improve your software, but I was informed that the audience wouldn’t be as geeky as I had expected, so with great help from Andreas and company I retooled the presentation into something less technical and more interesting on the same topic. I don’t believe it was videotaped, but you can see my presentation and notes on Data-driven Software Design below:
Data Driven Software Design Presentation (plus bonus interview)
Building Highly Reliable Websites For Small Companies
Downtime severely annoys customers. Downtime annoys sole proprietors even more, because it has a funny way of invariably striking at the worst possible time. Apache has no respect for date night. So if you’re a small company without dedicated ops team, you might well be worried about whether you can reasonably promise customers that you’ll be able to avoid inconveniencing them while still maintaining some semblance of sanity in your own life.
Happily, you can, if you’re savvy about it. I’ve supported thousands of customers and hundreds of thousands of trial users for four years without causing frequent outages, despite not being particularly skilled at server administration or having a huge money or time budget.
Setting Expectations
Let’s get this out of the way: are you a small company dependent on technology? You will have downtime. You will wear a communication device twenty four hours a day for the next several years, and respond with alacrity if it goes off. The purpose of the rest of this blog post is to minimize downtime and have that communication device do as little damage to your relationships and sanity as possible.
Specifically, you will want to:
- Anticipate failure ahead of time
- Minimize the inicidence of failure
- Be notified of failures in a timely manner
- Quickly recover from failure
- Learn from failures to prevent reoccurence
Many of these tips are specific to my personal experiences as an entrepreneur with a small business. If you work in a highly regulated industry, have a dedicated ops team, or are Google, you probably should not be reading my blog to solve your technical challenges.
Identify Risks To Your Service
The key to building reliable systems is first to know where the risks are. Don’t be suckered into thinking that downtime is generally caused by unpredictable black swan events: that is an easy mistake to make when reading stories about reliability from Google et al. This is partly because they have large teams of supergeniuses wielding nearly infinite budgets to build reliable systems, and partly because when they do have downtime they typically phrase the report of it in such a way that it sounds like it was a black swan and not systemic failure to follow routine, easily understood policies seven times in a row plus the black swan that let the system go down. (We’ll be returning to the policy theme in a moment.)
No, your risks are quite predictable, and you can jot them on a piece of notebook paper right now. I’d strongly suggest actually physically doing this, as it helps inform your thinking about what is likely to break and what you’ll need to do to mitigate the risk of that.
Not sure what your risks are? I’ve worked for the last several years in Nagoya, the Town that Toyota Built, and even though I was never in the automotive industry my professional mentors were heavily influenced by it. You know what causes 99% of problems with cars? Moving parts.
It is astronomically more likely for something which moves to fail than something which doesn’t: it is subject to friction, wear, foreign particles, and a thousand other sources of failure. By comparison, all the chassis of the car has to do is not decompose into its constituent atoms, and since it hasn’t done that until now it is a good bet that today will not be the day it picks to do so.
Software systems are also, overwhelmingly, killed by their moving parts.
Hard drives, to be very literal, are one of the only things that actually move in your server, and statistically speaking they’ve got the worst failure rate of any device in the box. Serious engineers treat hard drives as a component that by grace of God has not died yet. That is why we put them in RAID arrays which abstracts the life and death of particular hard drives away from users of the system. I’m rationally ignorant of how RAID actually works: if you want to know, read a book; if you want to do something productive with your day, pay your VPS provider or managed hosting company to deal with this for you. Trust me, you have better things to do than actually touch hard drives.
Side note: This is increasingly becoming true of just about everything [PDF link] in computer systems: scales are such that statistically speaking something is broken right now, so we’ll build our systems in the assumption that they’re broken to a degree unpredictable at run-time, and still squeeze some work out of them. “The Cloud” has not quite brought web-scale computing principles to the smallest software companies yet, but I think it is highly likely we’ll adopt bits and pieces of them eventually. After all, essentially all of us use RAIDs these days, many without knowing it, while they were a Serious Tool for Serious Businesses only a fairly short while ago.
Less literally speaking, your system is most vulnerable where it sees dynamism, complexity, and change.
It is at its most vulnerable when you are working on it and shortly after, because computer systems largely do not rot and once they’ve achieved a steady state tend to stay in it until something exceptional happens. You are your own worst enemy, and you’ll take steps to mitigate the threat you pose, as described later.
Many web applications these days have easy dividing lines between static and dynamic requests. The dynamic requests represent generally a small fraction of the overall total but will cause almost all of the failures. If you have, for example, Nginx proxying to Mongrel, you can be quite confident that Mongrel will fail much, much more often than Nginx will. (In point of fact, Nginx fails so seldomly you can almost get away with ignoring the possibility of it happening, since something else will almost surely kill either your service or you personally prior to Nginx dying on you. Carry life insurance and look both ways before you cross the street, but if you have too many things to do and not enough time to do them in, worrying about Nginx failing is something you can probably safely kick down the road.)
Your database is also a high-probability culprit for failing, partially for practical reasons (it is a very sophisticated bit of engineering which by its very nature is highly dynamic) and partially for philosophical ones. For historical reasons, most databases are made/configured around the assumption that it is better to fail loudly and completely rather than fudge the ACID guarantees silently. This is a perfectly reasonable engineering tradeoff for a bank’s transactional systems which might not be optimal for the sheepthrowing statistics your app may be tracking. (The degree of this impedance mismatch is why some folks are very passionate about the whole NoSQL thing when they really just want to drop ACID.)
Then there are a whole host of systems outside of your direct control which can nonetheless bring your system down. Your hosting provider’s network, for example, can fail at any moment, and typically a failure there is essentially a 100% loss of service for the typical web application. Their upstream provider could similarly fail. Any API your application depends on could fail in a hundred ways at any time.
Basically, keep writing on that piece of notebook paper until you run out of obvious sources of failure.
Here’s the abbreviated list of things that could go wrong at my business.
- Operator error
- Operator error
- Operator error
- Hardware failures on the server
- Network failures at Slicehost
- Mongrel fails
- MySQL fails
- Memcached fails
- Delayed Job fails
- Scheduled cron tasks fail
- External APIs (e.g. Mixpanel) fail
- Embedded Javascript (e.g. Google Analytics) fail
All of these have actually happened at one time or another, but most did not cause downtime for my customers.
Mitigating Failures Before They Happen
After you have some idea of what is likely to go wrong, you can start taking actions to mitigate it.
One conceptually easy step is decoupling: make it so that a failure of a particular component can’t bring down the entire system. This isn’t always possible in a cost-effective fashion: for example, in a heavily dynamic web application, it is highly likely that the database failing means you’re going down hard. That is OK. Ideally, your business is not running the power generators at a hospital: you don’t have to eliminate all of the downtime, you just have to minimize it, so go after the low-hanging fruit before addressing “what happens if the database dies”. (Answer: nothing… if you spent a few months of Very Expensive Engineer Time working out a replication/failover strategy. That is overkill when a) your customers are the sort that will tolerate a bit of downtime once every blue moon and b) you’re much, much more likely to bring the server down because you didn’t spent ten minutes writing a deployment checklist.)
One example of decoupling: never call an external API from within an HTTP request/response cycle. That essentially makes your system 100% dependent on the external API being constantly available. Their downtime is now your downtime. Their capacity problems are now your capacity problems. Their operator error is now your operator error.
Instead, do all communication with external APIs asynchronously. There are many common patterns for this. My website calls out to Mixpanel for statistics tracking but the end-user doesn’t care about that, so I just queue up a Delayed Job to do the API call asynchronously and regardless of it eventually succeeding or failing my user never cares. This means that Mixpanel’s (quite infrequent) downtimes have not caused my users any loss of service.
If your users actually need to see the results of the API call, you can schedule the call asynchronously and then do AJAX-y magic to poll the server asking if it has completed yet. If it doesn’t complete in a reasonable amount of time, you can either tell the user to do so in a nice, customized error message, or you can fallback to something which you can accomplish locally. In many applications, for example, instantaneous response to changes in the underlying data is just “nice to have” rather than a genuine requirement — RSS readers, for example, usually won’t kill anybody if they are a few minutes out of date. If updating an RSS feed fails for whatever reason, you can probably get by with showing the user your most recent cache of it — quite possibly without ever telling the user about the failure at all. (Engineers often are excessively protective of users. Personally, in most applications that don’t involve lives or money, I would rather tell the user a white lie than show them a red error message. This is particularly true when they can’t really do anything to address the error message other than “Try again later [and pray a third party who you don’t even know exists has figured out why they are throwing 501 status codes and addressed it].”)
Similarly, you can decouple bits of your web infrastructure from other bits. In Rails applications, Mongrel can (and will) fail independently of Nginx. By default, this will result in Nginx showing a forbidding black and white page with a scary error message on it. That is a terrible user experience and you can alleviate it in seconds: create a nice-looking page using nothing but static assets, and have Nginx serve it using the error_page directive. Somewhat contrary to what many engineers might assume, users are often largely mollified by anything showing up on their screens, and a well-written error page is sometimes almost as good as a functioning system.
I know the failwhale is a running joke, but that is just because we see so much of it and are accustomed to our computers mostly working. More typical users have computers eat their documents, freeze, and break all the time for no discernable reason at all, and if you do your job right they may never see your system fail twice, so their first and only encounter with your error page might not actually cost you too much customer goodwill.
Automated recovery is another smart mitigation step to take. I use a process manager called god to watch over my Rails programs, and when a Mongrel starts consuming too much memory or failing to respond to “Are you alive?” messages, god forcibly restarts it. This sounds almost crazy to the Big Freaking Enterprise engineer in me, but practically speaking it eliminates almost all common Rails problems (e.g. poor memory management caused by overly enthusiastic creation of objects and not garbage collecting them efficiently) before they cause a problem for my customers or myself. The god daemon is, similarly, restarted daily to avoid it having memory leaks. Yep, it smells strongly of duct tape, but the duct tape works.
Minimizing operator error is critically important, because you are the least reliable component of your system. Because you rely on software to do most of the actual work, when you touch the system you’re almost by definition performing something novel that isn’t automated. Novel operations are moving parts and vastly more likely to fail than known-good operations that your system crunches millions of times per day. Additionally, even if what you want to do is absolutely flawlessly planned out, you’ll often not execute flawlessly on the plan. This was one of the root causes of my worst downtime ever.
Happily, the steps to minimize operator error are well understood. Unhappily, they require swallowing a bit of your ego and actually following them. They’re well researched, reproducible, and will save you time in the long run: get over yourself. I had to, too.
If you have to do it more than once, it should be automated or made into a checklist. This includes things like:
- server setup
- sever upgrades
- upgrading code on the staging server
- upgrading code on the production server
- any maintenance task
Checklists are very simple: just a textual description of what the list describes, why you’d want to do it, what the exact steps are (i.e. down to what you type into the console), and what the exact steps are for verifying that the procedure was carried out correctly. This last bit is non-optional: subtle failures in maintenance tasks are a frequent cause of downtime, sometimes weeks or months later.
This is one reason we spend so much time on root cause analysis, to demonstrate to skeptical engineers that checklists are like flossing. My dentist tells me “If you think flossing is a nuisance, that’s fine: just floss the ones you intend to keep.” If you think checklists are a nuisance, that is fine: you can feel free to skip checklists for systems where catastrophic failure is no big deal.
I personally keep checklists in text files, because I only have one person to worry about, but in a multi-user organization wikis are fantastic for them. This goes doubly true for wikis which keep version history, because frequently as the system matures and as you respond to issues the checklist will need modification, and rather than tracking Bob down and asking him what this command is supposed to do, a well-written changelog will tell you “That command is there to prevent you from hosing the DB like we did last August.”
Certain checklists are executed very infrequently. Of particular note is one checklist that absolutely everyone should have: how to restore a machine (or machines) from the bare metal to a working copy of the production system. Ideally, you should be capable of doing this in 15 minutes or less, because if you ever have to do it for real it means that disaster has struck and your site is now down. Take 15 minutes out of your busy schedule every quarter and actually run that checklist to make sure it still works. You will frequently find “Oh, effity, we use GraphicsMagick these days rather than ImageMagick but nobody wrote that on here. Hah, silly me.” which, if this were an actual emergency, would have you scrambling to correct while the site was still down. “Scrambling” sounds a whole lot like moving parts, right? Right, you’ll be introducing more errors at the worst possible time to have them, in the middle of an emergency. Get emergency recover down to a routine, so that when you actually have an emergency (and you will, eventually), dealing with it is a matter of routine.
Automation is your friend. Some organizations get checklist happy and make checklists for procedures which essentially can’t fail and which require no judgement. Those shouldn’t be checklists — they should be shell scripts. That way, they save your engineers time and you can be confident that the latest script in version control (it is in version control, riiiiiight?) is well-tested (it is tested, riiiiight?) and will actually work.
Of particular note in the Rails world: deprec and Capistrano are wonderful tools which automate server setup (very well suited to deployment to a VPS like Slicehost) and application deployment. These are absolutely lifesavers and, although I’ve bashed my head against both a few times (typically with integration issues with Windows) they save you from weeks and weeks of script writing. I also sleep much more easily at night knowing that I’ve set up a staging environment in ~8 minutes using my Capistrano script this month and, if everything else went wrong, I could have a server reimaged and loading my database backup almost as soon as I got the phone call.
Be Notified Of Failures In A Timely Manner
Many failures can be solved or mitigated fairly quickly after you get to a computer, which means that time to recovery is dominated by the time it takes from the time the failure arises to the time it takes for you to be made aware of it. There are easy, reproducible ways to bring that down to “a few minutes or less.”
Low-hanging fruit: The absolute easiest possible solution is to point Pingdom or Mon.itor.us at your home page and have them contact you if it doesn’t resolve. They’re fairly simple: if the server doesn’t respond with HTTP 200 when they try to access it (every 15 minutes or so), you get an email or SMS. This is the simplest thing that could possibly work, but there are circumstances where it won’t catch failures. (For example, applications of non-trivial complexity often have parts which can fail without taking the front page down.)
I recommend creating an internal status page which automatically checks all the things you think are crucial, risky, and tractable to resolution if you were to know about them. (If an external API provider goes down and you already know your response is going to be “I wait until it comes back up”, then no sense disturbing your sleep about it, right?) For example, mine will fail to return properly if Nginx, Mongrel, the Delayed::Job workers, memcached, or Redis is having a bad day. You can then have your external monitoring poll that page and, if they don’t get the HTTP 200 all clear, send you an email.
For folks who are feeling adventurous (or excessively stingy), you can rig your own server monitoring terminating with a phone call or SMS with Twilio and about 30 minutes of work. If you do this, you can escape the time or notification limitations which the notification services use to segment their customers into “hobbyists” and “enterprises who have an awful lot of money to spend to make sure nothing breaks.” Personally I don’t think it is worth it but, hey, it is an option. (Note that this introduces another moving part into your system which, if it fails, you will probably find out about only when your main system is down… a couple hours after the fact. This is a compelling reason to not be a cheapskate.)
I have a very simple solution for going from notification mails to instant awareness of the problem: my cell phone, which is a cheapo Japanese model, can do custom ringtones for individual callers or mail senders. In the event I get an email from my notification service, it plays Ride of the Valkyries. As I recall, I’ve heard it twice, once while sound asleep and once on date night. (The interesting question of whether I would have begged off of date night may not ever be answered, since I was able to successfully reboot before my girlfriend noticed.)
In addition to the phone-in method of server monitoring, you can also use the phone-out method. I have been using Scout for the last couple of weeks, and it is wonderful. Basically, a cron job running locally reports a variety of statistics to their server every few minutes, and if the statistics are anomalous or the report fails to happen as scheduled, you get detailed warnings of it. My sole problem with Scout is that it is one chatty little robot: by default, it sends me emails about things like not-even-close-to-critical demand spikes (you went to 2% CPU utilization for a minute? Poor baby! My business got linked to from Reddit and requests spiked 1,000% without any performance degradation: great, why are you telling me?). After a few weeks of tuning I’ve mostly shut it up about non-critical notifications. (One other gripe with Scout is that they reserve SMS notifications for their priciest plan. It is market segmentation, I know, but in an age of Twilio it is almost petty. Still, I feel quite satisfied on their $20 a month option.)
Quickly Recover From Failure
Ideally, you’re either recovering automatically or you’ve been given timely notice of a failure you anticipated and now all you have to do is open your checklist. Neither of the above? Well then, this is why you earn the big bucks. Godspeed.
Learn From Failures To Prevent Re-occurrence
We’re a big fan locally of Five Whys, which has lately achieved a bit of prominence in US startups due to Eric Ries and the rest of the Lean Startups crowd. Boiled down to its essence, Five Whys says that no failure ever has one cause. There might be a single surface-level immediate cause, but the failure is also a symptom of multiple process failures because you had things in place to prevent that failure from happening and they did not trigger or were not effective.
I’m ruthless about doing the corrective action — root cause analysis — in my own business. You can see an example here. I’m more proud of that failure than a lot of my successes, because I learned from it.
Basically, you keep peeling layers of the failure onion until you’re satisfied that you’ve gone deep enough: five layers is a guideline, not a rule. You then invest proportionate resources into making sure that each of the failures does not happen again. This could mean updating procedures/checklists, adding features to your autorecovery code or diagnostics, beefing up employee training, etc etc etc.
A note for those with employees: Five Whys will frequently — very, very frequently — implicate human or cultural issues in your organization. Just trust me on this: you’re not alone, and you can persevere through the difficulties. Resolving critical defects was ironically one of the least contentious processes we had at my ex-day job — even in a Japanese megacorp we had internalized that it was too important to coat with the usual amount of corporate horsepuckey. (It helps that my Japanese megacorp is in Nagoya and that the development practices of a certain large automobile manufacturer are practically state religion here.) Check the egos, chuck the org chart, and make the fixes you need to make to uphold your responsibilities to your customers.
Quick Start For Rails on Windows Seven
Today I killed a few hours getting my Rails environment working on my brand new shiny 64 bit Windows Seven laptop. These instructions should also work with Windows Vista. I’m assuming you’re a fairly experienced Rails developer and just ended in dependency purgatory like I did for the last few hours.
1. Grab the MySQL developer version for your architecture (32 bit or 64 bit as appropriate) here.
2. Grab Ruby here. I used the 1.8.6 RC2 installer for my 64 bit architecture.
3. Add C:\Ruby\bin to your path. You can do this on Windows by opening the Start Menu, right clicking My Computer, clicking Properties, clicking Advanced / System Settings, and then adding it to the end of the PATH variable on the lower of the two dialogs. Apologies for inexact setting names, my computer is Japanese so I’m working from memory.
4. Verify that your path includes C:\Ruby\bin by opening a new command line and executing “path”.
5. Good to go? OK, execute:
gem install --no-rdoc --no-ri rails gem install mysql
You’ll get all manner of errors on that MySQL installation. That is OK.
6. Here’s the magic: copy libmySQL.dll from here to C:\Ruby\bin . If you do not do this, you will get ugly errors on Rails startup about not being able to load mysql_api.so.
You should now be able to successfully work with Rails as you have been previously, even from your Windows machine, and you will amaze your Mac-wielding friends.
Getting Interviewed By Andrew Warner at Mixergy
Andrew Warner of Mixergy will be interviewing me at 11 AM Pacific tomorrow, which is something like 14 hours from the timestamp on this post. If it is 11 ~ 12 AM Pacific, you can catch the live interview and participate in a chatroom. I’m told the main theme for the interview will be a business biography, so my regular readers are likely going to hear a lot of things you already know (“It makes bingo cards! Wow, fancy that.”), but Andrew has a way of wheedling secrets out of people so I’m sure you’ll still enjoy it.
If you have any subject you’d particularly like to hear about, please post it in the comments and I’ll tell Andrew to ask about it.
Interviewed by Gabriel Weinberg [video]
Gabriel Weinberg, the entrepreneur behind the search engine Duck Duck Go interviewed me earlier today for his upcoming book on getting traction. The video, which runs about an hour in length, is available here.
I always look for a summary of contents prior to committing myself to a video (since they’re so much longer than reading a post), so here you go:
- An outline of Bingo Card Creator’s SEO strategy including…
- using mini-sites
- using widgets
- using scalable content generation
- My thoughts on conversion optimization
- A/B testing
- The multiplicative effect of improvements in your funnels.
- A wee bit of “How do I do it all?” while previously being employed (outsource, automate, eliminate).
- “How’d you end up in Japan, anyhow?”
In fact, I was so convinced that I’d rather read videos than watch them 99% of the time that I took the liberty of transcribing it, with Gabe’s permission.
Some links to things mentioned in the interview:
- The Conversion Optimizer case study Google wrote about me.
- A/Bingo, my OSS Rails A/B testing library.
- Hacker News and the Business of Software boards, who are the smartest minds about online software businesses anywhere, and keep me sane. (I went down to City Hall today and filled out a bunch of paperwork, and the clerk’s response on hearing I was a software developer was “Web applications? Wow, you’re an iPhone developer?!” sigh It is nice to have people who speak your language, and I don’t mean English.)
- SEOMoz and SEOBook. (P.S. I’m not sure if I adequately communicated this when speaking: both are great and I recommend them.)
In somewhat related news, I have an interview scheduled with Andrew Warner of Mixergy.com for April 30th at 11:00 AM Pacific time. Andrew tends to do his interviews live, so if you have any questions you want to ask, be sure to tune in to the live chat. Andrew has told me that he hopes to focus on my business biography, so I assume there will be less technical/marketing/SEO content and more storytelling — it should be fun.
Speaking of which, it looks like he has Peldi from Balsamiq booked for April 28th. I highly recommend all the uISVs in the audience watch that one — Peldi is near the top of our profession in every way, and quite generous with his insights.
I’m absolutely floored that I’m appearing on guest lists next to folks like the 37Signals crew or Eric Ries, who rank among some of the largest influences in how I run my business. Crikey. It is an honor.
CrazyEgg To The Rescue Again
CrazyEgg is really one of my favorite secrets of being a metrics junkie, simply because it makes problems look so freaking obvious, when they could be buried if you relied on quantified analysis. For example, there is nothing I can drill into in Google Analytics, A/Bingo, or my homegrown stats tracking which would have told me what this picture does:
That is my AdWords landing page. I paid good money to get folks to see it, and I want them clicking the photo or the big purple button, not clicking the non-active text in the sidebar! Thank you, CrazyEgg, you again earned your monthly keep and then some.
Sure, now that I know the problem is actually happening, fixing it is a matter of adding one line in a Rails template (to cause those bullet points to be hyperlinks to the conversion). OK, two lines: one line to make the “obvious” fix… and another line for the A/B test.
Running A Software Business On 5 Hours A Week
Some four years ago, I started Bingo Card Creator, a business which sells software to teachers. At the time, my big goal for the future was eventually making perhaps $200 a month, so that I could buy more video games without feeling guilty about it. The business has been successful beyond my wildest expectations and has made it possible to quit my day job at the end of this month. The amount of time I’ve spent on it has fluctuated: the peak was the week I launched (50 hours in 8 days), a very busy week in the last few years spiked up to as many as 20 hours, and the average over the period is (to my best estimate) about 5 hours.
During the majority of the time I’ve had the business, I’ve also been a Japanese salaryman at a company in Nagoya. For those of you who are not acquainted with the salaryman lifestyle, I leave the office at 7:30 PM on a very good day, and have an hour and a half of commute both ways. In our periodic bouts of crunch time, such as the last three months, I end up sleeping at a hotel next to the office (about 25 times this calendar year).
I’m not saying this to brag about my intestinal fortitude — this schedule is heck on your body and life, and absolutely no one should aspire to it. That said, I snort in the general direction of anyone saying a nine-to-five job is impossible to juggle with a business because “businesses require 100% concentration”.
Here are practical, battle-tested ways for you to improve the efficiency of your business and deal with some of the niggles of partial self-employment. They’ll hopefully be of use whether you intend to try running it in your spare time or just want to squeeze more results out of the time you’re already spending. Many of these suggestions are specific to the contours of running a software business on the Internet, which has a lot to recommend it as far as part-time businesses go — take care before trying these willy-nilly with an unrelated industry. (Part-time pacemaker research is probably not the best idea in the world.)
Time as Asset; Time as Debt
The key resource if you’re running a business by yourself is your time. Other businesses might worry about money — however, you’ve probably got all your needs and then some covered by your day job salary, and capital expenditures in our business are so low as to be insulting. (I started my business with $60. Literally.) And the key insight about time is that software lets us take the old saying about how “Everyone gets the same 24 hours per day” and break it open like a pinata.
Time can be stored. One of the great features about currency is that it functions as a store of value: you create some sort of value for someone via your labor, trade that value for currency, and then the currency will retain value even after the physical effect of the labor has faded. For example, a pumpkin farmer might not be able to conveniently store pumpkins, but if he sells them the currency will (under normal circumstances) not rot.
Most people think, intuitively, time always rots. You get 24 hours today. Use them or lose them. The foundation of most time management advice is about squeezing more and more out of your allotted 24 hours, which has sharply diminishing returns. Other self-help books exhort you to spend more and more of your 24 hours on the business, which has severely negative effects on the rest of your life (trust the Japanese salaryman!)
Instead of doing either of these, build time assets: things which will save you time in the future. Code that actually does something useful is a very simple time asset for programmers to understand: you write it once today, then you can execute it tomorrow and every other day, saving you the effort of doing manually whatever it was the code does. Code is far from the only time asset, though: systems and processes for doing your work more efficiently, marketing which scales disproportionate to your time, documentation which answers customers’ questions before they ask you, all of these things are assets.
The inverse of time assets is time debt. Most programmers are familiar with technical debt, where poor technical decisions made earlier cause an eventual reckoning where forward progress on the program becomes impossible until the code is rearchitectured. Technical debt is one programmer-specific form of time debt. Basically, time debt is anything that you do which will commit you to doing unavoidable work in the future.
Releasing shoddy software, for example, commits you to having to deal with customer complaints about it later. So don’t do that. Better yet, rather than a useless bromide like “don’t release bad software”, spend time creating systems and processes which raise the quality of your software — for example, write unit tests so that regressions don’t cause bugs for customers.
However, not all time debt comes from intrinsically negative activities: there are many things that successful businesses do which cause time debt and you probably do not have the luxury of engaging in them. For example, high touch sales processes incur time debt almost as soon as you put out your shingle: you’re committed to spending many, many hours wining and dining clients, often on a schedule that you cannot conveniently control. That is generally a poor state of affairs to be in for a part-time entrepreneur, even though there are many wonderful businesses, small and large, created in high-touch industries.
Code Is About 10% Of Your Business. Maybe Less.
Are you considering starting up a business because you wish to work on wonderfully interesting technical problems all of the time? Stop now — Google is hiring, go get a job with them. 90% of the results of your business, and somewhere around 90% of the effort, are caused by non-coding activities: dealing with pre-sales inquiries, marketing, SEO, marketing, customer support, marketing, website copywriting, marketing, etc.
Bingo Card Creator has been memorably described as “Hello World attached to a random number generator.” If anything, that probably overstates its complexity. Customers do not care, though — they have problems and seek solutions, regardless of whether the solution required thousands of man years of talented engineers (Excel) or one guy working part-time for a week. (You’ll note that you can make bingo cards in Excel, too. Well, you could. Many people can’t. If I sell to them, I don’t necessarily have to sell to you.)
Relentlessly Cut Scope
37Signals had many good ideas in their book Getting Real, but probably the best one is to “Build Less”. Every line of code you write is time debt: it is another line that has to be debugged, another line that has to be supported, another line that may require a rewrite later, another line that might cause an interaction with a later feature, another line to write documentation for.
Cutting your feature set to the bone is the single best advice I can give you which will get you to actually launching. Many developers, including myself, nurse visions of eventually releasing an application… but always shelve projects before they reach completion. First, understand that software is a work in progress at almost every stage of maturity. There is no magic “completion” day on an engineer’s schedule: “complete” is 100% a marketing decision that the software as it exists is Good Enough. If you have to cut scope by 50% to get the software out the door, you’re not launching with a 50% product: you’re launching with 100% of the feature set that is implemented, with 100% of (hopefully decent) ideas for expansion in the future.
Pick Your Problem Well
Long before you sit down to write code, you should know what your strengths are and what your constraints are. If you can only afford to spend 10 hours a week and your schedule is inflexible, then anything which requires calling customers in the middle of the day is out. Scratch B2B sales for you. If you have the graphical skills of a molerat, like myself, you probably should not develop for iPhones. (Minor heresy: while Mac developers are very graphically intensive people who will buy software just to lick it if the UI is good enough, many Mac users are just regular people. My Mac version has a conversion rate fully twice that of the Window version, and it is not noticeably pretty.)
Some people profess difficulty at finding applications to write. I have never understood this: talk to people. People have problems — lots of problems, more than you could enumerate in a hundred lifetimes. Talk to a carpenter, ask him what about carpentry sucks. Talk to the receptionist at your dentist’s office — ask her what about her job sucks. Talk to a teacher — ask her what she spends time that she thinks adds the least value to her day. (I’ll bet you the answer is “Prep!” or “Paperwork!”)
After you’ve heard problems, find one which is amenable to resolution by software and that people will pay money for solving. One quick test is to see whether they pay money for solving the problem currently: if people are spending hundreds of thousands of dollars on inefficient, semi-manual ways to do something that you could do with Hello World and a random number generator, you may be on to something. (For example, if you knew nothing about the educational market, you can infer that there are at least several hundred thousand dollars sold of reading vocabulary bingo cards every year, just by seeing those cards stocked in educational stores across the country and doing some quick retail math. So clearly people are spending money on reading vocabulary bingo. It isn’t that much a reach to assume they might pay money for software.)
Other things you would look for in your idea are anything you see yourself using in your Benefits section of the website to entice people to buy it. (Benefits, not Features. People don’t buy software because of what it does, they buy it for the positive change it will make in the life.) If you think “People should buy this because it will make them money, save them time, and get them back to their kids faster”, then you probably have a viable idea.
Another thing I’d look for prior to committing to building anything is a marketing hook — something you can take advantage of to market your product in a time-effective way. For bingo cards, I knew there were more activities possible than any one company could ever publish, and that gave me hope that I could eventually out-niche the rest of the market. (This is core idea still drives most of my marketing, four years later.) Maybe your idea has built-in virality (nice if you can get it — I really envy the Facebook crowd sometimes, although I suppose they probably envy having a customer base which pays money for software), a built-in hook for getting links, or something similar. If you can’t come up with anything, fix that before you build it.
This should go without saying, but talk to your customers prior to building anything. People love talking about their problems to anyone who will listen to them. Often they won’t have the first clue about what a solution looks like, but at the very least repeated similar emotional reactions from many people in a market should tell you that the problem is there and real. After that, it is “just” a matter of marketing.
One note about business longevity: you will likely be involved in this business until you decide to quit. That means planning for the long term. Markets which change quickly or where products rot, such as applications for the iPhone (which have a sales window measured in weeks for all but the most popular apps) or games (which have constantly increasing asset quality expectations and strong fad-seeking in mechanics/themes/etc) interact very poorly with the constraints you are under. I would advise going into those markets only with the utmost caution.
Get Your Day Job Onboard
Don’t do work on your business at your day job. DO NOT do work on your business at your day job. Do NOT do work on your business at your day job. It is morally and professionally inappropriate, it exposes you to legal liability (particularly if your business ends up successful), and it just causes headaches for all concerned.
As long as you follow that one iron law of doing a part time business, all other obstacles are tractable. Many engineers these days code outside the clock — for example, contributing to OSS projects. Tell your boss that you have a hobby which involves programming, that it will not affect your performance at work, and that you want to avoid any misunderstandings about who owns the IP. You can do something culturally appropriate to actually effect that: it might involve a contract, a memorandum of understanding, or even just a promise that there is no problem.
(Aside: I know many Americans consider the last option shockingly irresponsible. My ability to prevail over my employer — a major multinational — in a lawsuit is effectively nil. A contract is just a formalization of a promise. In Japan, the ongoing relationship with my bosses is the part of the agreement that provides security, not the piece of paper.)
One sweetener you can offer any employer: providing you with discretion to continue with your hobby costs the employer nothing, but it will result in you getting practical experience in technologies and techniques you wouldn’t normally get at the day job, and they can then make use of that expertise without having to send you to expensive training or seminars. I generated conservatively six figures in business for my day job as a result of things I learned from my “wee little hobby.” Feel free to promise them the moon on that score — all they have to do in return is not object to your hobby.
Speaking of day jobs: if you know entrepreneurship is in your future, you might pick a job which dovetails nicely with it. Prior to becoming a salaryman I was employed by a local government agency which had stable salaries and a work-day which ended at 4:30 PM. Hindsight is 20/20, but that would have been perfect for nuturing a small business on the side. (What did I do with my free time back when I had so much of it? I played World of Warcraft. sigh Youth, wasted on the young…)
Avoid Setting Publicly Visible Deadlines
One thing I did not know four years ago was how dangerous it is to promise things to customers. For example, suppose a customer asks for a feature which is on the release roadmap. I might, stupidly, commit to the customer that “Yes, this will be available in the next release, which I hope to have ready on next Monday.” If the day job then has me spend the rest of the week at the hotel, or I have a family emergency, I will miss that deadline and have one ticked-off customer to deal with. That is 100% avoidable if you simply don’t commit to schedules. (Also note that committing to a schedule is time debt, by definition. If you ever say “Yes, I will implement that”, you’ve lost the ability to decide not to implement it if your priorities change.)
One of the most useful things I learned in college was a line from my software engineering professor. “The only acceptable response to a feature request is: ‘Thank you for your feedback. I will take it under advisement and consider it for inclusion in a later version of the software.'” That line actually works. (There are industries and relationships in which it won’t work — for example, if you’re in a regulated industry and the regulations change, you can’t fob the regulatory authority off with that. Don’t be in a regulated industry.)
Release schedules are not the only type of deadline out there. Ongoing relationships with freelancers will occasionally have deadline-like characteristics, too. For example, if you have a pipeline where you generate requests for work and then the freelancer fills it, if you unexpectedly are unable to do your part, the freelancer will be idle. Thus, you want a bit of scheduling flexibility with them, a store of To Be Done On A Rainy Day requests queued up, or a rethink of your relationship such that your brain is not required for them to be able to do their job.
Cultivate Relationships With Effective Freelancers
Dealing with outside talent is one of the most important skills of being a part-time entrepreneur. It lets you work more hours than you have personally available, it lets you use skills that you don’t possess, and especially when combined with software you’ve written you can do truly tremendous things with with a little bit of elbow grease. Many folks get started with freelancing from posting to sites like Rentacoder (awesome article about which here) or Craigslist. That is fine — everyone has to start somewhere. However, you’ll quickly find that there is literally a world of people out there who are willing to work for $1.50 an hour… and would be terrifically overpaid at that price.
My suggestion is that, when you find a freelancer who you click with, hold onto them for dear life. Pay them whatever it takes to keep them happy. Additionally, since most clients are just as incompetent as most freelancers, don’t be one of the flakes.
- Pay freelancers as agreed, promptly. I jokingly refer to my payment terms as Net 30 (Minutes), and that ends up being true 90% of the time.
- Provide sufficient direction to complete the task without being overbearing. (Freelancers with a bit of personal initiative are worth their weight in gold.)
- Don’t schedule things such that freelancers are ever blocking on you or that you are ever blocking on freelancers. You have all the time in the world if you get things done well in advance of need. For example, I just got my St. Patrick’s Day wordpress theme done — for next year. If I was getting the Easter bingo site cranked out now, any hiccup would mean it missed my window. (Technically speaking it would already be too late for SEO purposes, but that is a long discussion.)
- Recurring tasks are a great thing to systemize and outsource. You can write software to do the painful or boring bits, greatly increasing productivity, and as your freelancers get more experienced at the task you take on less time debt for explanation and review of their work.
Speaking of which, the most successful freelancing relationships are ones where you correct the labor market’s estimation of someone’s value. (That is the positive way to say “You spend much less on them than you’d pay someone else for the same work and they’re happy to get it because you’re the highest paying offer.”) Much ink has been spilled about how the globalization of labor makes it possible to get work done by folks in low-wage countries. To the extent that you identify skilled, reliable workers, this is certainly one way to do things, but it is not the only way. The current economic malaise has left many folks in high-wage countries either unemployed or underemployed. In addition, the labor markets have huge structural impediments to correctly valuing the expertise of stay-at-home mothers, retirees, and college students. All of those are potentially good resources for you.
Understand the Two Types of Time
There are two types of time involved in business: wall clock time and calendar time.
Wall clock time: minutes/hours which you spend actually working.
Calendar time: days/weeks/months/years where time passes so that something can happen.
We expect the world to be very, very fast, because the Internet is very, very fast, but when dealing with non-Internet processes we are frequently reminded of how slow things are.
Paul Graham mentions this as one of the hard things to learn about startups. I really like his metaphor for how to deal with it: fork a process to deal with it, then get back to whatever you were doing. For example, while Google rebuilds its index in a matter of minutes these days (this blog post will be indexed within fifteen minutes of me hitting the post button, guaranteed), getting a new site to decent rankings still takes months of calendar time. That doesn’t mean you stand around waiting for months — you get your site out and aging as fast as humanly possible, and then start working on other things. Get good at task switching — you’ll be doing it a lot. (I literally just alt-tabbed to Gmail and squashed a support inquiry.)
You can incorporate calendar time into your planning, too, and since it is essentially free to you (you’re planning on being here in a week, right?) it is often advantageous to do it. For example, A/B testing requires lots of calendar time but very little wall-clock time: you spend 15 minutes coding up the test and then have to wait a week or two for results. That works very, very well in a part-time business. Often, you can get into a rhythm for feedback loops like that. Do whatever works for you: for me, Saturday typically sees me end my old tests and start new ones.
Avoid Events, Plan For Processes
There is a temptation to see business as series of disconnected events, but that should probably be avoided. For example, you might see a dozen emails as a dozen emails, but it is probably just as true that it is six of Email A, 3 of Email B, and three emails with fairly unique issues. You should probably turn your response to Email A and Email B into some sort of process — address the underlying issue, write your web page copy better, add it to your FAQ, create an auto-text to answer the problem, etc etc.
Similarly, spending your time on things which help your business once is almost always less effective than making improvements which you can keep. For example, running a sale may boost sales in the short term, but eventually the sale will end and then you cease getting additional advantage from it. There is a time overhead assorted with running the sale: you have to promote it, create the graphics, code the logic, support customers who missed the sale by 30 minutes but want the price (give it to them, of course), etc etc. Spend your time on building processes and assets which you get to keep.
Another example: attempting to woo a large blog to post about you may require quite a bit of time in return for one fleeting exposure to a fickle audience. Instead, spend the time creating a repeatable process for contacting smaller blogs, for example something along the lines of Balsamiq’s very impressive approach. (Other examples: repeatable piece of linkbait such as the OKCupid’s series on dating also works, or a repeatable method of building linkable content, or a repeatable way of convincing customers to tell their friends about you.)
You can also avoid spending hours on incident response if you spend minutes planning your testing and QA procedures to avoid it. When they fail — and they will fail — fix the process which permitted the failure to happen, in addition to just responding to the failure.
Document. Everything.
I’m indebted to my day job for teaching me the importance of proper internal documentation. As weeks stretch into months stretch into years, no matter how good of a memory you have, you will eventually have things fall through the cracks. Your business is going to produce:
- Commit notes. Thousands of them.
- Bug reports.
- Feature requests.
- Pre-sales inquiries
- Strategic decisions
- Statistical analyses
… etc, etc. The exact method for recording these doesn’t matter — what matters is that you will be able to quickly recall necessary information when you need it.
I tend to have short-term storage and long-term storage. Short term things, like “What do I need to do this week?”, get written down in a notebook that I carry with me at all times. (I lock it in the drawer when I get to work, but feel no compunction about sketching things on my train ride.) Things that actually need to get preserved for later reference go into something with a search box. This blog actually serves as a major portion of my memory, particularly for strategic direction, but I also have SVN logs (with obsessive-compulsive commit notes… often referencing bugs or A/B tests by number), email archives, and the like. (One habit I picked up at the day job is sending an email when I make a major decision outlining it and asking for feedback. Note this works just as well even if you’re the only person you send it to — at least you’ll force yourself to verbalize your rationalizations and you can compare your expectations with the results later.)
There are a million-and-one pieces of software that will assist in doing this. My day job uses Trac, which has nice SVN integration. I have heard good things about 37Signals’ stuff for project planning/management and also about Fogbugz for bug tracking. Use whatever works for you.
Note that quality documentation of processes both prevents operator error and makes it possible for you to delegate the process to someone else. Also, if you have eventual designs on selling this business, comprehensible and comprehensive documentation is going to be a pre-requisite.
Dealing With The Government
I’ve been pleasantly surprised how little pain I’ve suffered in dealing with the government. Part of this is because software is such a new industry that we often slide by on regulation — if I ran an actual Italian restaurant instead of the software analogue, I would have to keep health inspectors happy on a regular basis, but there is (thankfully) no one auditing my code quality. Speak with competent legal advice if you’re not sure, but for the most part the only thing Japan and America want from me is that I pay my taxes on time.
Paying taxes is weeks of hard work really freaking easy. The typical Italian restaurant has to do lots of bookkeeping involving thousands of sales, most of them involving cash, juggle record-keeping demands for half-dozen employees, and has expenditures ranging from rent to wages to capital improvements to food with a thousand rules about depreciation, etc, to worry about. By comparison, the typical software business gets half of bookkeeping for free (if you can’t tell me to a penny how much your software business has sold this year with a single SQL query… well, I don’t know whether to deride your intelligence or congratulate you on your evident success), we have absurdly high margins so if you forget to expense a few things it won’t kill you, the number of suppliers we deal with is typically much lower, and the vast majority of what we do is amenable to simple cash accounting.
Additionally, your local government almost certainly has a bureau devoted to promoting small businesses. They are happy to give you pamphlets explaining your legal responsibilities — in fact, sometimes it seems the only thing they do is create ten thousand varieties of pamphlets. Your local tax office will also fall over backwards telling you how quick and easy it is to pay them more money.
Incorporation? Incorporate when you have a good reason to. (I still don’t, but I might do it after I go full-time, largely for purposes of dealing with Immigration.) If you’re selling B2C software, your number one defense against getting sued is promptly refunding any customer who complains, and that pretty decisively moots the LLC’s (oft-exaggerated) ability to limit your personal liability. You’ll be personally liable for debts from the business, but since the business is fundable out of your personal petty cash that isn’t the worst thing in the world. If sales collapsed tomorrow I’d be on the hook for my credit card bill, which runs about $1,200 a month — not a financial catastrophe for an employed professional, particularly when the business generates far more than that in profits well in advance of the bill being due. Sole proprietorship — i.e. merely declaring “I have a business” — is the most common form of business organization, by far.
Ask Someone Else About Health Insurance
I’m only putting this here to mention I have no useful information, because I live in a country with national insurance. That isn’t a veiled political statement — I am not really emotionally attached to either model, I just don’t have useful experience here. (My impression is that young single businessmen around my age are probably well-served with getting cheap catastrophic coverage.)
Keep A Routine, When Appropriate
Through sickness, health, and mind-numbing tedium, I’ve woken up every day for the last four years, checked email, gone through the day, checked email, and gone to sleep. This is the single best guarantee that I would deliver on the promised level of service to customers — almost all questions answered within 24 hours. There have been many, many weeks where this is literally all I’ve done for the business.
I try to keep creative work — such as writing, coding, or thinking up new tacts for marketing — to a bit of a routine, too, with flexibility to account for days where I’m not mentally capable of pushing forward. For example, generally I do planning for the week at dinner on Monday and have four hour block to the business on Saturday. If on Saturday it turns out that I can’t make forward progress on the business, I clock out and go enjoy life.
Routines aren’t limited to the business, either. They help me incorporate my other priorities — family, friends, church, gym, hobbies — into a schedule that would otherwise descend into total anarchy. (If you want to see what happens to the things that I don’t prioritize when the day job starts knocking, well, suffice it to say that I was cleaning today and removed 13 pizza boxes from my kitchen table. I hope to put both cleaning and cooking back in the rotation after separating from the day job.)
Seek The Advice Of Folks You Trust. Disregard Some Of It.
One of the major things which pushed me to (a small measure) of success these last four years has been advice from the communities at the Business of Software boards and Hacker News and the writings of folks like Joel Spolsky, Paul Graham, and the 37Signals team. Much of the advice I received has been invaluable. I disagree quite strongly with some of it. When reading advice from me or anyone else, keep in mind that it is a product of particular circumstances and may not be appropriate for your business. And always, always, always trust the data over me if the data says I’m wrong. (That’s the easy part. The hard part is trusting the data when it is overruling you.)
I’m thinking of making this first in a series. If you have topics you’d like me to cover in more detail, please, let me know in the comments.
Getting Ready For Going Full Time
I’m quitting my day job as of March 31st. Today I was running out to lunch prior to a day packed with various uISV-centered activities and it struck me: crikey, this is really happening. I’m exhilarated and that weird not-scared-not-settled feeling one gets when one has run out the door in a hurry and is absolutely positive they have left something behind despite having verified the presence of cellphone, wallet, and day job ID card. (Oh, to no longer need the day job ID card.)
Let’s see, what is new and exciting:
Mini-sites: I met a new designer over the Internet who is helping me out with my stable of mini-sites, such as the (previously created) Easter bingo cards that I’m sure are going to be very popular over the next two weeks.
A/B tests: I have six A/B tests currently ongoing, which I think is my personal record for simultaneous A/B tests.
1) Landing page: Text heavy versus an image, a few sentences, and a signup button.
2) Shopping cart: A tweak of some microcopy (“Thank you for your interest in Bingo Card Creator!” -> “Get instant access to Bingo Card Creator!”)
3) Shopping cart: Addition of microcopy (“You don’t need an account to pay with your credit card though Paypal.”)
4) Online version dashboard: Addition of one-click access to top 10 word lists. Seems to be a crashing failure at increasing task completion, incidentally. I’ll be reverting this one in a moment.
5) Guest signup: Presence/absence of guest signup seems to have no effect whatsoever on sales. Good — I’m going to probably disable it, as they cost me support issues through the wazoo.
6) Button redesign. I found another talented designer recently, and decided to get a few dozen buttons drawn up to start testing versus my existing ones.
The old buttons:
The new buttons:
After seeing whether those new buttons are roughly comparable to the old ones or not, I had the designer make many, many variations on color and button texts. As always, I’m testing my way to victory, one 3~5% lift at a time.
Next Application: I have more or less mentally committed myself to my next application, although I should probably give it some more thought. Hint: it uses Twilio. Stay tuned to the same blog time, same blog channel for when I have something to announce.
Substantive Blog Posts: I’m working on a few long-form posts about what it is/was like running a business in my side time and how to do that without losing your sanity. If there are particular aspects of that you’d like to hear about, I’d love to hear them. Folks keep telling me to focus on time management techniques, so I guess I’ll be covering that, but I think they’re a little dry.
Stats Bug In A/Bingo v1.0.0 and earlier
Many thanks to Ivan for reporting this one: there is a significant bug in A/Bingo calculation of z-scores for versions 1.0.0 and earlier, which borks substantially all z-score calculations and in some cases can change whether A/B test results are reported as statistically significant or not.
The bug is all of one character long:
def zscore() #omitted for clarity cr1 = alternatives[0].conversion_rate cr2 = alternatives[1].conversion_rate n1 = alternatives[0].participants n2 = alternatives[1].participants numerator = cr1 - cr2 frac1 = cr1 * (1 - cr1) / n1 frac2 = cr2 * (1 - cr1) / n2 #this line is bugged numerator / ((frac1 + frac2) ** 0.5) end
I have fixed the bug (via the Slicehost console, on a Japanese cafe Internet PC, because I am stuck in Nagoya today again) and pushed the fix to the git repository.
Does this make my results invalid?
You can probably still have confidence in results you got from A/Bingo previously. While the numerical calculation of the z-score was borked, it was borked in a subtle enough fashion that most statistically significant tests will retain their statistical significance under the borked calculation and most statistically insignificant tests will not gain statistical significance magically as a result of the borked calculation. (My quick eyeball suggests that it causes BCC to overstate the significance of tests which are very significant and understate the significance of tests which are insignificant, which is a very fortuitous set of properties for a random bug to have in an A/B testing framework.)
I have re-run statistical confidence tests for everything I’ve ever done for BCC that I still have data for, and no experimental results changed as a result of the error. Nonetheless, I deeply regret the bug, and will write unit tests for the statistics code as soon as I am physically capable of doing so to rule out the possibility of this sort of thing in the future.
Business Stats On A Photo Frame
I got inspired by a blog post from Panic, a Mac software company, and created myself a dashboard for the business, currently residing on a photo frame right on my desk. The full writeup is on my main site, including code if you want to use it.
Recent Comments