Archive | Uncategorized RSS feed for this section

I’m Joining Stripe to Work on Atlas

I’m joining Stripe to work on Atlas. Let me tell you why.

(If you don’t know me, howdy, I’m Patrick. I’ve run a succession of small software businesses from Japan for the last 10 years. They’re documented in a bit of depth on this blog and in HN comments, conference presentations, and the like.)

Some years ago, when I was still working for a Japanese company, I had a wee little side project called Bingo Card Creator. I wanted to be able to take payment for it over the Internet. Options for doing this in 2006 were absolutely terrible. I eventually signed with a shareware payment processor who wanted to keep 11% of sales for renting me their merchant account, and “signed” is to be read literally: it cost me over 1/4 of my budget to send them a fax internationally accepting their (egregious) terms and conditions.

I was quite excited when Stripe launched back in 2011, and quickly moved all of my businesses to it. In addition to being transformatively better from an API and business perspective, it was obvious from the first that Stripe gets people like me. (See, in particular, where their CEO Patrick Collison helped me debug a Ruby dependency issue with the use of git bisect. That actually happened.)

The Global Community of Small Software Entrepreneurs

Neither Chicago (where I grew up) or Ogaki, Japan (where I lived when starting my business) is the beating heart of the global software industry. I knew no one in software growing up and assumed that the only options available were safe boring jobs at big boring companies. When a classmate at college interviewed with Google, everyone I knew advised her to get a job at a consulting company, instead, rather than taking a flyer on a flash-in-the-pan Internet thingy.

That social and educational background explains a bit of how I found myself doing soul-crushing J2EE Big Freaking Enterprise Web Applications as a Japanese salaryman. (Ask me for the full story sometime.) It doesn’t explain how I escaped that hellish existence.

That was largely due to a single decision of Joel Spolsky: he made a forum attached to his blog where people interested in the business of software could talk shop. I read it obsessively in my spare moments, and it introduced me to the unthinkable notion that regular old geeks like me could run software companies. I thought it was illegal, because, like Jon Snow, I knew nothing — nobody had ever told me that “creating intellectual property” is something that you don’t need to ask permission to do and my background therefore suggested it was either forbidden, risky, or risked being forbidden.

I never thought I could build a Fog Creek, but I saw a bunch of other geeks building Poker Co-pilot and Perfect Table Plan and skeet-shooting scoring software, and I was pretty sure I could at least do something like that.

So I released Bingo Card Creator, and — still cognizant of the fact that I knew nothing — I decided to blog about what I was learning in real time. And I kept doing that for the next few years. I knew precious little about software development, but luckily the Internet is full of people who do. I knew nothing about marketing or sales, but the forum regulars were happy to talk me through the basics of AdWords and the Internet had some decent guidelines for SEO. And so on.

Fast forward ten years: I’ve built three software companies, worked in others, and now know more-than-nothing. My wee little blog and HN comments now total about ~3 million words, and other geeks have reported to me that they were able to use them to get their own businesses / careers off the ground, which makes me happier than anything else businesswise.

I’ve also spent a lot of time (a lot of time) online and offline with a worldwide community of practice. Just like there is a cognizable tribe of Rubyists or Pythonistas or ballet dancers, who share something with all other Rubyists or Pythonistas or ballet dancers regardless of borders or language, We Who Work For The Internet have gone out, found each other, and decided we rather like each others’ company.

We have forums. We have (nascent) professional journals. We have conferences. We started talking shop and we ended up as friends.

I’ve gone to work substantially every day for the last 10 years in my own software company, fixing bugs and running AdWords campaigns, but my biggest impact — and the most personally fulfilling one — has been helping others start and grow their own businesses.

Silicon Valley Is Full Of Crazy People

As part of rubbing elbows online with my tribe, I’ve had contact with another tribe over the years, which is venture-backed Silicon Valley entrepreneurs. To paraphrase a remark made by a Japanese businessman of my acquaintance, they’re a society organized around attempting to find the optimal level of crazy.

When you have too much crazy, you start a social network for cat photo sharing and say — in all earnestness — that it will change the world for all days to come.

When you have too little crazy, you end up taking a safe job at a megacorp and staying even though you hate it.

When you have just enough crazy, you found a payments company, heedless of the fact that founding a payments company is doomed to failure because it involves
mountains of hard and boring work and the incumbents have billions of dollars.

Patrick and John Collison are close-to-optimally crazy. When Patrick says “Stripe’s goal is to increase the GDP of the Internet”, that isn’t like Cthulhu’s goal to make a dent in the universe. It implies real people are going to make real businesses selling real products and thereby experience real improvements in their real lives. I know this works. I have seen it. I have lived it. I have watched it work for hundreds of friends and acquaintances.

Atlas Is Crazy

A couple of months ago, Patrick Collison came to me with another crazy idea. He said Stripe wanted to make “simple incorporation as a service”, so that any entrepreneur worldwide could have a corporate entity and a bank account spun up about as easily as they could get an EC2 server.

This idea is crazy. I’ve incorporated four companies and opened business bank accounts for all of them. The most recent required over a hundred pages of documentation and six weeks of negotiation to assuage a risk department’s concerns about foreign tech entrepreneurs. (Thanks, Bitcoin.) You’re not supposed to be able to do this.

Stripe did it. With crazy speed: the project was in beta within 11 weeks of conception. It can take that long to form a single company in much of the world. Stripe solved the problem like an engineer: establishing one company requires an annoying amount of form-filling so instead of buckling down and doing it you just make a company-establishing web application and abstract away form-filling for all time.

And they’re crazily ambitious about where it ends up: not simply incorporating companies, but eating all of the crufty back office work which distracts Internet businesses from getting more real products into the hands of real customers. Payments, contracts, invoices, bookkeeping, incorporation, taxes, etc etc, all things you have to do even if what you’re actually doing is selling bingo cards to elementary schoolteachers.

Patrick describes the success case for Atlas as being visible in global macroeconomic indicators, which is crazy. That said: if you’ve seen how fragile new businesses are at the margins (“Can I get a bank account? Can I get an insurance policy issued in time to close this first deal? Can I get a corporate entity spun up to actually be able to sign a lease or write employment contracts?”), then interventions early in the business funnel may well increase the number of successful businesses surviving to major milestones like launch, profitability, bringing on employees, and sustained economic impact. And if you believe that new businesses are where economic growth is going to come from, that sounds very impactful. Perhaps even crazily impactful, for well-established economies like the US and Japan and for emerging economies worldwide.

The Crazy, It Is Catching

My co-founders and I made the decision recently to wind down our last business, Starfighter, and pursue new adventures. I thought I was going to sell Appointment Reminder (the SaaS business which I’ve run for the last few years, and never really loved) and start another small SaaS business. That would pay the rent and let me continue writing and speaking with other entrepreneurs, which is the part of this gig that I really enjoy.

Patrick had a crazy proposition for me: why don’t I come work for Stripe on Atlas.

I found myself saying yes, largely because I think the potential impact of Atlas (and Stripe generally) is crazily high.

There are probably about a hundred thousand We Who Work For The Internet. There will shortly be a few million, and there will eventually be hundreds of millions.

The firm is a technological innovation which changed the world forever. The Internet is a technological innovation which changed the world forever. Firms which live on the Internet are already happening. Their growth, in number and aggregate impact, is inevitable.

There Is No Future For Scarcity

(This section owes more than a little bit to Naval Ravikant’s thoughts on leverage, most succinctly captured here.)

There is no conceivable future in which Internet-enabled firms are less numerous than e.g. insurance agents, which the US alone has ~450,000 of.

There is no conceivable future where it becomes harder to make products people are willing to pay for than it is in 2016. The technologies will change, but Rails is now the lower bar of how productive a software platform can be. Getting your physical product manufactured by a contract factory in China is within the capacity of college students in the first world; that or similar capabilities will only get more broadly distributed.

There is no conceivable future in which it gets harder to charge people money than it is in 2016. It cost $250,000 and six months of integration to charge a single credit card online in 1999. By 2006, that was down to hundreds of dollars and weeks of effort. In 2016, the integration can be done in a morning and the costs have likewise cratered. We will not forget how to do it.

There is no conceivable future where the network-effect businesses we use to reach customers — Google, AdWords, Facebook, the App Store, Twitter, Kickstarter, Alibaba, etc — collectively retreat in the number or aggregate affluence of the potential customers they can address. (See footnote.)

There is no conceivable future in which the number of people connected to the Internet shrinks. There is no conceivable future where smartphones become more exclusive products than they currently are.

There is no conceivable future in which the percentage of transactions consummated online decreases from its present ~1% in the most connected economies. The number of transactions worldwide will rise, by orders of magnitude.

These fundamental economic forces will continue bringing down the cost of starting new businesses and increasing the potential impact they have, even at very low levels of capitalization. We will see an explosion of them.

So Why Hasn’t It Happened Yet?

Our future colleagues are presently prevented from starting by a host of logistical difficulties and informational barriers. They don’t have business bank accounts. They don’t know what bookkeeping is. They’ve never negotiated a master services agreement. These are all things that can be learned, but the depth and breadth required from an aspiring entrepreneur feels forbidding.

So I’m joining Atlas to work on community and communication, which means something like “scalably educate the world’s Internet-enabled entrepreneurs, reduce any barrier to entry, and assist the global community of practice in growing to accommodate a lot more people.”

Hopefully the next few years look a lot like the last few years: code, write, talk, present, and be as helpful to as many people as possible. The main difference is that I now get scored on that impact directly, as opposed to it being a fun sideline hobby while my day job is actually shipping and selling business productivity software.

Stripe’s interest in increasing the GDP of the Internet is fairly transparent: they intend for Stripe to be the obvious choice for payment rails of every Internet business, and for Atlas to be the obvious choice for the back office of every Internet business. If that comes to pass, Stripe will be an enormously successful company.

Back To Working In Japan

I spend most of my working time on the Internet, but I actually live in Japan, and rather enjoy it here. If I have one professional regret, it is that fairly little of my work directly improves my local community.

I’m excited for the future of Stripe Japan.

Stripe Japan has a weird constraint where it has to be true to being Stripe (an entrepreneur-friendly focused-on-developers company) and, simultaneously, an authentically Japanese company.

I happen to be a bilingual developer-turned-entrepreneur who can pull off Responsible Japanese Salaryman when required. I look forward to helping Stripe Japan earn the trust of Japanese companies and be a pleasant environment for a multinational, multicultural team to work in. Also, I can help the mothership understand when they’re asking Stripe Japan to operate at incorrect levels of crazy.

We hope to grow Stripe Japan to a thriving software company in its own right. In addition to directly employing people in jobs which are better than those historically available here, we hope to provide additional evidence that a better way of working is possible.

Naturally, we hope to reduce payments as a barrier to entry among Japanese companies. There is a mind-boggling amount of cost and complexity to taking non-cash payments in Japan. Ask me if you’re curious, but the short version is that historically minimum-viable-online-payments has required more than a million dollars in capital. (Thankfully, this is starting to change.) That’s why anecdotally Japanese entrepreneurs have told me for years that they’re more jealous of Americans having access to Stripe than any other company.

We’re going to give entrepreneurs at all stages the Stripe experience that US startups now take for granted and also access to the walled gardens of Japan’s native payments ecosystems.

There are plenty of barriers to running a company in this country. Historically, the mechanics of collecting money are a major one. We will solve that problem.

Back to Being An Employee

I didn’t think I was ever going to be an employee again, and honestly, I have mixed emotions about it. That’s why I turned down other opportunities in the past.

Stripe is offering a very compelling mixture of impact and autonomy. I’m particularly attracted to the company culture of employees not being restricted to individual job descriptions, but rather getting autonomy and ownership to bring projects to completion. That appeals to the broad-spectrum generalist in me. It helps that the Atlas team is presently tiny and in dire need of broad-spectrum generalists. Some people would feel overwhelmed if asked to write Ruby, negotiate with vendors, and construct lifecycle email sequences — that sounds to me like a fulfilling way to spend any given Tuesday.

At the same time, I’m looking forward to working with a team. One of my favorite parts of Starfighter was working with smart co-founders and not having sole responsibility for generating forward progress in the company, which is the nature of the beast in self-employment. It will be nice to know that payroll, legal, devops, and the like are sorted by competent professionals who enjoy doing them, so that I can be a competent professional on the things I’m actually good at and not have to worry if the server is down at 2 AM.

It’s going to be an adjustment to having a boss again. It might be a bit of an adjustment for Stripe, too, as their expectation that I operate autonomously will probably not always coincide with their immediate desires.

Fun story: I’m probably their only employee who asked for “Employee will be permitted to continue criticizing Bitcoin in their personal capacity, despite the fact that Bitcoin is a technology that Stripe uses and promotes” to be written into the employment contract. (One might sensibly wonder if that were actually intended as a “no brown M&Ms” clause simply to test how serious they were regarding autonomy, but no, I just wanted to continue criticizing Bitcoin. Everyone needs a hobby.)

What Happens To Your Other Businesses?

Starfighter: Starfighter will be winding down operations. We experienced the oldest startup story: we shipped a great product but the business didn’t end up working out. I wish Thomas and Erin the best in their new adventure and they are likewise rooting for me at Stripe. I also wish our players and clients success in their future endeavors — feel free to contact me down the line if I can help you out (in my personal capacity).

If you want to know more about Starfighter, please check Starfighter spaces.

Appointment Reminder: Appointment Reminder will be sold in the next few weeks, through FEI, who I used to sell Bingo Card Creator last year. As of the publication of this post, the sale is already deeply in progress. If you have questions, please route them to FEI.

I’ll likely write up what I learned from AR at some point.

Kalzumeus Software: I will continue periodically writing and speaking in my personal capacity here and at my usual watering holes, but imagine that Atlas will be more than enough work for the next few years. I will likely not do any new products under the Kalzumeus banner.

Incidentally, I’ve partnered with Nick Disabato to finally finish that conversion optimization course I have been working on for forever-and-a-day, as my last bit of major creative work prior to starting with Stripe. (We wrapped two days of video shoots five minutes ago.) More details from Nick and I on our respective mailing lists as that is ready to show.

Want To Work At Stripe?

Stripe is hiring. Atlas is hiring. Stripe Japan is hiring. Their jobs page is here.

The overwhelming majority of jobs in the technology industry do not go to people who cold apply via jobs pages. This is important for you to know and operationalize. In general, you want to find someone inside the company who has hiring authority, make a good impression on them, and then get them to start the ball rolling on an internal application process.

I don’t have hiring authority, but I’ve had a standing invitation to talk to anyone interested in the software industry open for several years, and that isn’t going anywhere. I work at Stripe; I work for the Internet. If you’re on the Internet, I work for you, too. If you’re interested in potentially exploring options at Stripe, feel free to email me. I’ll happily take you out to coffee and/or a Skype call, give you the inside scoop, and make sure you’re routed to one of the friendly, accessible people internally who can actually make the hiring process happen. You cannot possibly waste my time.

See you on the Internet, and thanks.

Footnotes


There is a conceivable future where the de-facto monopolies that control discovery levy a 30% business privilege tax on the entire Internet, but living in that hypothetical cyberpunk dystopia would still be superior to every economic arrangement prior to the Internet. Man, what a downer of a footnote. Have I mentioned that it’s a good thing to make payment rails that don’t go through AppAmaGooBookSoft?

Design and Implementation of CSV/Excel Upload for SaaS

I usually write more about marketing/sales than I do about actually making software products, but I have recently been working on the product side a little more intensively for Appointment Reminder.  One of the features we implemented was CSV upload.  This is a very, very common task for virtually every B2B SaaS product, so I thought I’d share how we did it, as I’m pretty happy with how it is working out. Hopefully it will be useful to some of you.

The Problem With Upload Interfaces

Substantially every B2B SaaS product benefits from interoperability with other recordkeeping systems your customers use, including “formal” software (your competitors or products you interoperate with) and “informal” software like e.g. spreadsheets, Trello lists, and email-inboxes-used-as-a-database.

This is particularly true early in your customer’s lifecycle with your company: most data which will be new to your system is not actually new.  It presently exists somewhere in the customer’s organization.  You presumably want to make it as easy as possible for them to put it into your system and make your system the “source of truth” about that data.

Frequently, savvy SaaS entrepreneurs do this with “concierge onboarding” — basically, using high-touch human handholding to substitute for features which do arbitrary data source to your DB schema importing.  Why?  Partially because the human touch earns a lot of customer loyalty.  Partially because you can offer concierge onboarding as soon as you have one smart person who has an inbox and free time, without necessarily having to build a whole lot of software support for them.

Why I Punted On CSV Import for 4 Years

My launch list of features for Appointment Reminder included CSV import for client data (names/phone numbers/emails) and for appointment data (client data plus date/times of appointments).  Unfortunately, I assessed this feature as probably costing $100,000 in engineering time to implement well, so I punted and implemented a quick “concierge upload feature.”

Concierge Onboarding Upload

This was backed by me literally SCPing the files to the server, then opening the Rails console, and typing commands to parse out the file format in real time.  It would typically look something like:

lines = File.readlines("/home/patrick/new-upload.csv")

u = User.find_by_email "someone@example.com"

a = u.account

records = lines.map {|line| line = line.split; record = {}; record[:name] = line[3]; record[:email] = line[7]; record[:home_phone] = record[:mobile_phone] = line[12]; record}; records.size

records.last #Check to make sure that command actually worked.

clients = records.map {|record| c = a.clients.build; c.name = record[:name]; c.home_phone = "blah blah blah"; c}; clients.size

clients = clients[1..(clients.size - 1)] #Skip the header line

clients.map {|c| c.save}; "Clients saved!"

errors = clients.select {|c| c.errors.present?}; errors.size

#While there are still records with errors

c = errors.pop

puts c.inspect

#c.email is "bob@gmail", needs a .com

c.email += ".com"

c.save

#Worked!

c = errors.pop

#Blah blah blah.

For a well-behaved CSV file, this took about 5 to 15 minutes of work. For clients who had very unclean CSV files, I often ended up doing an hour of data entry into IRB. That certainly makes sense for clients with predicted LTVs of $5,000+, but ideally I wouldn’t be doing this sort of work when I could be doing other stuff to drive the business forward. Additionally, the latency in sending me an email and actually having one’s account ready was 24~48 hours in the best of times, and that often caused clients to seek alternatives.

So why didn’t I just have a generic CSV importer ready? Because it is hellaciously difficult to do CSV import well in the general case.

Why is this?

  • Column-to-column mapping: Clients’ existing understanding of their data very rarely matches with your understanding of how the same data is organized, so you have to map their columns to your columns. This is often not a one-to-one mapping. For example, Appointment Reminder gives each client a single freeform name field, but many customers have systems which differentiate between first and family names (presumably because they think those things exist), so you have to have a way to stitch together those columns.
  • Pervasively unclean data: Informal software often includes data which is not exactly machine readable. Email addresses like bob@aol. Phone numbers like [(555) 555-5555 or -5556 (his mother)]. Names like [” Catherine] which make sense if you’re a human seeing it under [Smith James] but make no sense to a CSV file reader (and, incidentally, will frequently choke it).
  • File formats: You would think the CSV file format is fairly well-standardized. Hah. Hah. Hah. cries

As a result, the typical web upload workflow is very inadequate for CSV file upload. You’d like to just give them an upload control, grab the file, pass it to a background process, then update the UI with the result, but often the result is going to be “Lines 37, 45, and 392 had problems. Also, although you wouldn’t have expected it, we left off all the email addresses because you picked the wrong column. Whoopsie! Edit the file then re-upload.”

This is enormously frustrating for the user.

Enter SheetJS and Handsontable

I can’t remember how I stumbled upon SheetJS, but as soon as I saw their tech demo, I knew AR was finally going to get file uploads.

SheetJS is an OSS project which parses a variety of common spreadsheet formats, including (through libraries) Excel, OpenDoc, and CSV. Because it is Javascript, you can actually do this directly in the browser, which allows me to completely eliminate the “upload” stage of the “Upload your CSV/etc, we parse it, then we display feedback” workflow.

I can’t express how magical this is in mere words. You have to see it in action. Either use that tech demo or watch the following ~50 second video.

I’ve described this to my friends as “It’s like embedding Google Sheets into your application, for less than a day of work.”

Carefully Considered Glue Code

I had “minimal” import capabilities available in about 3 hours of work with SheetJS, but getting a decent editing workflow with Handsontable (the grid component which provides a lot of the “magic”) took a solid three days of work.

Why?

The above example shows a well-behaved Excel file. We heuristically guess what columns in the Excel file map to which columns in our data model. Files which are a) parsed and mapped 100% successfully and b) contain no errors at all… are not the most common case with CSV uploads.

I had to build a way for users to both a) quickly understand that the column mapping was not correct (ideally, without having them have to understand the phrase “heuristics used to parse your document”, since most are office managers) and b) correct it, without requiring a heck of a lot of explanation.

UX-wise, I figured that, as long as we were using the typical right-click context menu that people would be familiar with from Excel, we’d extend that with the ability to mark columns appropriately. Additionally, we’d add a bit of external-to-the-grid feedback to the user to make it obvious what their file was missing.

Then there’s the question of “What do we do about errors?”

I figured that, for AR’s particular use case, errors in a given line shouldn’t block processing of the other lines, since customer records typically don’t “bleed” into each other. Accordingly, we’d save all records which had no errors (similar to my IRB session above), then ask the user to bulk update records with errors and try again.

This is handy, since customers often have very consistent forms of errors in their documents, and the remedy for them (e.g. deleting an entire column of bad data) is often fairly easy to do in a full-featured spreadsheet interface. Putting that in the browser rather than back in Excel makes it minimally difficult for them to actually succeed in doing this.

Take a look at this 65 second video for a user who uploads 98 users successfully while also having two records with errors. (Don’t worry about violating anyone’s privacy: all of the data is faked using GenerateData.com, a really handy service for cranking out CSVs/Excel files/etc which exercise various features in upload features.)

I rush through the process pretty quickly for the purpose of the demonstration, but in real life, it is close to this easy, particularly for a user who has done this before (like, e.g., my support reps or myself — see below).

What was tough about the glue code? Mostly, that it involves a lot of jQuery callback hell. Javascript development is, in general, not exactly my strong suit. 80%+ of development time for this feature was making sure that right clicking on the mobile column to mark it as mobile actually worked as expected. You won’t believe how many reloads I did during this process.

I give an A+ to Handsontable for documentation quality and a well-designed JS API. This project would have been enormously frustrating without that. SheetJS is a little newer of a project with a substantially broader brief, so it involved a little more source code spelunking to get working (particularly when integrating new file formats into the provided dropsheet.js magic which imports the dropped data into Handsontable), but I’d rate it as production-quality. Both sets of devs are quite responsive and substantially every B2B SaaS product should adopt them (hopefully adding a bit more polish on the UX than I did).

What Does The Backend Look Like?

Customers routinely upload thousands of records at once (which, n.b., neither of these OSS projects have any problem with). Given that my Rails app treats each row in isolation rather than using a batch insert statement, mostly to run validations, this means that doing the upload parsing in the request/response cycle is probably not optimal. You never want a request waiting several seconds unless you’re intentionally doing long polling.

I instead immediately stuff the uploaded data into a semi-ephemeral data store (Redis), then acknowledge to the AJAX request “OK, got it. ID number was $FOO.” The client then begins polling an endpoint for status updates about $FOO.

(Worth mentioning: why Redis as opposed to e.g. just persisting the files to disk? One, our Redis database is encrypted and our standard file storage is not, so not putting the files on disk saves us from having to worry about whether they include HIPAA-privileged data at all. Two, while we could theoretically clear out uploads on a regular basis, Redis has the expiry logic already built into it, so the data has built-in ephemerality. Three, it’s easy for my app to assert “We’ll always have a Redis instance available!” but not quite so easy to assert “We’ll always have a local filesystem shared by all processes relevant to servicing a particular user’s account” — if we eventually move to a multi-machine architecture then the file system suddenly becomes a really bad place to be storing uploads. Four, uploaded files are inherently a security risk if a misconfiguration lets someone execute upload.csv as e.g. a PHP file, and I’m decently certain there is no URL which will ever route to an arbitrary key in Redis and cause it to be evaluated as anything other than a string.)

Our queue worker processes then process upload $FOO, updating the database directly for working records and saving errored records as a new job (also in Redis). We then have the client poll result return both instructions to the user and new records to repopulate the table with. While our upload infrastructure at the model level is reasonably well architected with separations of concerns and everything, the bit which synchronizes the jobs and the UI is incredibly tightly coupled between the UI and the controller layer. That decision will eventually make maintenance of this more painful than it would be if they were decoupled. I shipped it anyhow, because shipping beats not shipping.

To give you a rough idea of complexity:

upload_processor.rb contains the logic orchestrating jobs, persisting to/from Redis, and turning JSON blobs which were POSTed to the server into client records. It is 335 lines long, and required approximately 20 lines of additions to our Client class to support. Unit tests run another 500 or so lines.

uploads_controller.rb turns AJAX requests into upload processor method calls. It is 135 lines, of which 35 are the incredibly coupled method for answering poll requests. I’ll avoid showing a screenshot for fear of being blocked as obscenity in the offices of good engineering teams everywhere.

The Javascript glue code is 450 lines long. No unit tests because, frankly, I have no clue how one would test this in an automated fashion. I’d rate it as C+ in terms of code quality — jQuery callback hell, what can I say.

Rolling Out To Customers

I don’t like exposing new features directly to customers, particularly features which I suspect are probably fiddly. For example, while I’ve tested this feature with documents in a variety of encodings and file formats, I have no clue what will happen if someone uses an outdated version of MS Excel on a Windows 98 machine to upload a file written in a Hebrew code page, and I’m honestly a little scared of finding out.

Accordingly, we put this behind a feature flag. Feature flags are structurally similar to A/B tests:

#In a view.  @account is, by convention, the currently logged in account.
#Account#allows?("feature") is our feature flag method.

<% if @account.allows? "uploads" %>
  Click here to <%= link_to "upload", {...} %> your client data.
<% end %>

Naturally, we also control access at the controller level (to prevent anyone from playing guess-the-URL), but this demonstrates the basic idea. We can assign users access to this feature individually or in groups, without having to roll it out to the entire userbase.

At present, client data uploads are available in production for our internal users only. Our administrator accounts can upload data into any account, but users can’t access the corresponding upload feature for their own accounts unless I hand-grant their account that permission.

This forces clients to continue emailing us their CSV files to get them uploaded, which lets me have the catastrophic, terrible UX, blow-up-straight-in-my-face errors that still exist in this feature happen to a patient and incredibly invested product owner (me) rather than an impatient user on their first day. For example, it wasn’t obvious to me, but if JSON interpreted someone’s phone number as an integer (5555555555) rather than a string (555 555-5555), our upload job would die with an uncaught exception and the polling method would continue polling forever. I fixed that and the user never even knows it happened.

This also lets us verify that the parsing/heuristics/etc work for someone’s particular “informal software” used at their business. Typically, they have one or a handful of Excel files, often maintained on a single version of Excel on a small set of machines. We can handle the first upload ourselves, verify that it works as expected, and then extend them the ability to do uploads. It is unlikely that they’ll add row-level data which comprehensively borks the upload feature assuming it works on their existing files at least once.

What’s Next In Our App

Now that we have the ability to do this for client data uploads, which are (by far) the easier of the two upload types for us, we’re going to deal with appointment data uploads next. These are a) less forgiving of error than client data, b) conceptually harder to deal with (parsing date/times… ick), and c) vastly more numerous than client data, since client data is typically essentially append-only with minimal editing after upload but appointment data changes on a multiple-times-a-day basis.

Luckily, we’ll be able to reuse a bit of the infrastructure which we built and also will, hopefully, have shaken out a lot of the UX/error handling/monitoring/etc challenges prior to playing the upload game on hard mode.

Presently, we have a single set of heuristics/parsers/etc which are used for all clients. Eventually, I’d like this to be plug-and-play on a per-customer basis, so that we could write ~50 lines of Ruby if required to slurp in a godawful file format required to support a particular enterprise client. c.f. the Strategy pattern in the GoF book, except less painful, because metaprogramming makes this much less painful than it would otherwise be.

A Brief Meditation on OSS

I estimated this feature as costing $100k in engineering time if I were scratch building it. We got it done in +/- 3.5 days of work, which is easily a 90%+ savings, as a result of SheetJS and Handsontable being available. I simply didn’t feel right getting that amount of value for free from two projects which are run by very small teams, so I approached both and convinced them to sell me an enterprise license to their project. It is equivalent to the usual OSS license, except it comes with an invoice. (I don’t think it would be appropriate to name numbers, as that might constrain their ability to price enterprise licenses going forward. Let’s say that my initial outreach emails — titled “Can I pay you to work on $PROJECT?” — probably sounded like a less-wealthy-than-usual Nigerian prince who happened to live in Tokyo and have a really odd interest in CSV files.)

If you have an OSS project which is as useful for for-profit businesses as these two projects, I would strongly, strongly advise taking down any mention of “donations” and offer commercial licensing for the project. This doesn’t have to change anything about your project. (What is the difference between a commercial license that is completely discretionary and a donation? I can donate money, and like most middle class people, my actual behavior with regards to donations is “occasionally donates tens of dollars to poor people and other deserving causes.” My company is literally not allowed by the tax code to donate money, but it can buy any software it feels like, it does not require “You must be poor” to write a check to you, and it assumes that software costs hundreds/thousands of dollars. Ask for license revenue, not for donations.)

If you’re willing to let commercial viability influence your choice of licenses, the dual licensing model also works pretty well, which my buddies at Binpress explain in more detail here. (Disclaimer: I’m a small investor in Binpress.) For example, Appointment Reminder would be totally unwilling to put GPLed code anywhere in our codebase (viral infection of the rest of the code is a non-starter for me), but if you (the copyright holder) said “We’ll let you use our code on a basis equivalent to an MIT license, if you pay us $X,000″, then (assuming the code solved a $X,000 problem for me) I’d write a check immediately.

More broadly, I think that SaaS businesses and other heavy consumers of OSS owe it to the community to provide the funding for projects which represent significant advances, as we are — frankly — much better than the typical OSS dev at actually monetizing software. I’m less worried about the likes of Ubuntu or Chrome, which have massive corporate backing behind them, and even much smaller projects like Rails do pretty well with a few larger sponsors (Basecamp, NewRelic, Heroku, etc) which full-time employ the largest contributors. I think it is right and proper for for-profit businesses to assist the labors-of-love OSS projects at the lower end of the scale though in going full-time on those projects if that meets their goals. So, where possible, I try to pay professional wages for professional work. There exist a variety of ways to contribute to OSS projects, but nobody’s landlord accepts pull requests as currency, so I prefer contributing with money.

Try it sometime — it’s fun and easy. I wrote two emails explaining that I wanted to buy a commercial license and that my only requirement from them was an invoice which said “commercial license for $SOFTWARE” and a figure on it. After collecting the invoices, I had my bank send checks/wire transfers as appropriate. Payments between parties in the global wealthy class (i.e. most software companies and developers) is a solved problem — don’t let whole minutes of hassle scare you away from trying it.

Productizing Twilio Applications

This post includes video, slides, and a full-text writeup. I recommend bookmarking it if you’re on an iPhone right now.

I make extensive use of Twilio (a platform company that lets you do telephony with an API) in running Appointment Reminder, my core product focus at the moment.  (Wait around a day or two and I’ll tell you a bit about how it is doing in my annual end-of-the-year wrapup.)

Twilio has a very passionate developer community and fairly good documentation on their website, but I’ve sometimes been frustrated at it, because it teaches you the bare minimum to get phones ringing.  That is truly a wonderful thing and a necessary step to building a telephony application.  However, if you continue developing your application in the way that the Quick Start guides suggest, you will routinely run into problems regarding testing your code, maintaining it, securing it, and generally providing the best possible experience to your customers and the people they are calling.

I have a wee bit more than a year of practical experience with a Twilio application in production, so I went to TwilioConf and did a presentation about how to “productize” Twilio applications: to take them past the “cool weekend hack” stage and make them production-ready.  Twilio has graciously released videos of many of the presentations at TwilioConf, so I thought I’d write up my presentation for the benefit of folks who were not at the conference.

The Video (~30 Minutes)

Twilio Conference 2011: Patrick McKenzie – Productizing Twilio Apps from Twilio on Vimeo.

The Presentation (40 Slides)

The Writeup

 

Why I Think Twilio Will Take Over The World

(This was not actually in the presentation, because I didn’t have enough time for it, but I sincerely believe it and want to publish it somewhere.)

I think Twilio is, far and away, the most exciting technology I’ve ever worked with.  The world needs cat photos, local coupons, and mobifotosocial games, too, but it needs good telephony systems a lot more, as evidenced by companies paying billions of dollars for them.

Additionally, Twilio is the nascent, embryonic form of the first Internet that a billion people are going to have access to, because Twilio turns every phone into a smartphone.  The end-game for Zynga’s take-over-the-world vision is the human race slaved to artificial dopamine treadmills.  The endgame for Twilio’s vision is that every $2 handset in Africa is the moral equivalent of an iPhone.  I know which future I want to support.

Smartphones aren’t smart because of anything on the phones themselves, they’re smart because they speak HTTP and thus get always-on access to a universe of applications which are improving constantly.  Twilio radically reduces the amount of hardware support a phone needs to speak HTTP — it retroactively upgrades every phone in the world to do so.  After that, all you need is the application logic.  And what application logic there is — because the applications live on web servers, they have access to all the wonderful infrastructure built to run the Internet, from APIs that let you get Highly Consequential Data like e.g. weather reports or stock prices or whatever, to easy integration with systems which were never built to have a phone operating as part of them.

You can’t swing a stick in a business without hitting a problem which a phone application makes great sense for.  I filled up three pages of a notebook with them in just a week after being exposed to Twilio for the first time.  Order status checking for phone/fax/mail orders.  Integrated CRMs for phone customer service representatives.  Flight information.  Bank balances.  Server monitoring.  Appointment reminders.  Restaurant reservations.  Local search.  Loyalty programs.  Time card systems.  Retail/service employee support systems.  Shift management.  The list goes on and on and on.

Seriously, start writing Twilio apps.

What This Presentation Will Actually Cover

I’m tremendously optimistic about the futures of Twilio and the eventual futures of companies which make Twilio applications, but I’m pessimistic about your immediate future as an engineer writing a Twilio app, because it is going to be filled with pain.  You’re probably going to make some choices which will cause you and your customers intense amounts of suffering.  I’ve already done several of them, so use me as the inoculatory cowpox and avoid dying.

Crying In A Cold, Dark Room

Back in February 2011, I moved from my previous apartment to my current house.  I unwisely decided to push a trivial code change prior to boxing things up.  This trivial code change did not immediately take down the server, but did cause one component (queue worker processes) to fail some hours later.  The most immediate consequence of this was that outgoing appointment reminder phone calls / SMSes / emails failed to go out.  Since I was busy moving, I did not notice the automated messages about this failure.

When I discovered the failure (8 hours into customer-visible downtime), I panicked.  Rather than reasoning through what had happened, I reverted the code change and pushed reset on the queue worker processes.  This worked, and the queue quickly went from 2,000 pending jobs to 0 pending jobs.  I then went to bed.

At roughly 3 AM, I woke up with a vague feeling of unease about how I had handled the situation, and checked my email.  My customers (small businesses using AR to talk to their clients) had left several incensed messages about how their client had reported receiving dozens of unceasing calls on the behalf of their business, in a row, at 7:30 PM the night before (right after I had restarted the queue workers).

Here was the error: my application assumed that the queue would always be almost clear, because queue workers operate continuously.  A cron job was checking the DB every 5 minutes to see whether a particular client had been contacted about her appointment yet.  If she hadn’t, the cron job pushed another job to the queue to make the phone call / SMS / email.  When the queue came back up, each client received approximately ~100 queued events simultaneously.  These did not themselves check, at the start of the job, whether the job was still valid, because the application assumed that the cron job would only schedule valid reminder requests and not execute 100 times in between queue clearings.

This resulted in approximately 15 people receiving a total of 600 phone calls, 400 SMSes, and 200 emails, in approximately a 5 minute period of time.

There are a variety of ways I could have avoided causing this problem for my customers:

  1. Don’t make code changes prior to planned unavailability, even if they look trivial.
  2. Don’t ever leave your phone that gets emergency messages out of your pocket.
  3. Switch to idempotent queues, so that adding the same job multiple times does not result in multiple copies of the job.
  4. Add per-client rate limits, so that application logic errors don’t cause runaway contact attempts.
  5. Add failsafes for historically unprecedented levels of activity, shutting down the system until I could manually check it for correctness.

Testing Twilio Applications

Unit testing and integration testing are virtually required to produce production-quality Twilio applications, and will make it much less likely for you to create catastrophically bad bugs in production.  Unfortunately, testing Twilio applications is much harder than testing traditional CRUD web applications, because of how TWIML is different than HTML (in terms of how minor syntax errors actually cause business problems), how it is not easy to replicate telephone operation in integration testing, and because Twilio sometimes has poor separation of concerns between the MVC of a web application, the Twilio helper library, and the Twilio service itself.

Twilio testing is inherently dangerous, because non-production environments (testing, staging, development, etc) could conceivably generate actual, real-world phone calls to phone numbers which were in your database but not actually under your control.  The first and most important tip I have for Twilio testing is to make it explicitly impossible to contact anyone not on a whitelist from code when you’re not in production.  I have a quick snippet that I put in a Rails initializer which monkeypatches my Twilio library to force it to only make phone calls or SMSes to whitelisted numbers.  (I don’t suggest actually re-using this code, particularly as you may not be using Rails or the same Twilio library that I am using, but you can reuse the idea of enforcing safety in non-production environments.)

 

 

A lot of Twilio testing will, unfortunately, require manual button-pressing (or scripts which simulate button-pressing on a telephone).  This is easier to accomplish if you can expose your local development machine to the actual Internet.  There are strong security reasons why you don’t want to do this but, if you’re comfortable with doing it, LocalTunnel is a great way to actually accomplish it.

Also see the section below on Modeling Phone Calls, because it will make Twilio phone trees and call logic much more tractable to unit testing.

You Should Have A Staging Server

A staging server is just a copy of the entire production system minus the actual customers.  (You probably shouldn’t put production data on it, because staging systems are designed to break and as a result they may leak data through e.g. SQL injections.  This is an easy way to lose your DB.)  You should use firewalls and/or server rules to make the staging server inaccessible to the world (aside from Twilio and any other APIs which need to access your site for it to work), but assume you will botch this.

Staging servers are virtually mandatory for Twilio applications, because Twilio apps can fail in ways which will not be detected until they are actually accessed over the Internet.  For example, even with unit and integration testing, failing to properly deploy all audio assets (MP3 files, etc) will cause Twilio to throw hard, customer-visible errors in production.  I have automated systems which check for this now, but since that isn’t an exhaustive list of things that can go wrong in production, part of my workflow for deploying all changes on Twilio is to push them to the staging server first, and then having automated scripts exercise the core functionality of the application and ensure that it continues to work.

How To Model Phone Calls

Twilio Quick Start guides generally don’t suggest modeling phone calls explicitly, instead relying on just taking user input and doing if/then or switch statements on it.  This is ineffective for non-trivial use cases, because as the application logic gets more complicated, it will tend to accumulate lots of technical debt, be hard to visually verify for correctness, and be extremely difficult to automatically test.  Instead, you should model Twilio calls as state machines.  I am a big fan of state_machine in the Ruby world.

I’ll skip the CS201 description of what a state machine actually is.  If you didn’t take that course, Google is your friend.

You should model calls such that they start in a predictable state and user input moves the call between states, causing side effects of a) running any business logic required and b) outputting Twiml to Twilio, to continue driving the call.  This lets you replace case statements with a lot of parallel structure with well-defined transition tables within the call models.  Those models are then trivial to unit test.  Additionally, adopting coding conventions such as “the Twiml to be executed at a given state is always state_name.xml and any audio assets go in /foo/bar/state_name/*.mp3 “allows you to write trivial code which will test for their presence, which will save you from having to manually go through the entire phone tree every time on the staging server to verify that refactoring didn’t break anything.

Additionally, state machines are much easier to reason over than masses of spaghetti code which case statements tend to produce.  For example, consider the following code, which attempts to implement the phone prompt “Press 1 to confirm your appointment, press 2 to cancel your appointment, press 3 to ask for us to contact you about your appointment.”  Spot the bug.

There are actually over six bugs in that code, above the trivial ones you probably saw with numbers not lining up to action names:

  • The Twilio API will pass this code params[:Digits] not params[:digits], which will cause an error that won’t be caught until you physically pick up the phone.
  • The comparisons of params[:digits] with integers will fail, because it includes string representations of numbers.
  • There are several mistakes in mapping numbers to actions.
  • One of the action names is spelled improperly.

These are very easy to miss because our brains get lulled into a false sense of security by parallel structure.  Instead, the model should be taking care  of that mapping between user input and state transitions.  This would radically simplify the code and make the controller virtually failure-free, while letting the model exhaustively unit-test possible user input, expected transitions, and business logic side effects.

State machines might seem like an unnecessary complication when you only have three branches in your code, but production Twilio applications can get very, very complicated.  Here is a state diagram from Appointment Reminder.  You do not want to have to test these transitions manually!

Dealing With Answering Machines

Dealing with the case where the phone calls is answered by an answering machine or voicemail system has been the hardest application design problem for me in doing outgoing phone calls in Twilio.  The documentation suggests using an IfMachine feature, which will cause Twilio to listen to a few seconds of the phone call prior to executing your code.  They do some opaque AI magic to determine whether the entity speaking (or not speaking) in that interval is a machine or not, and tell your application whether it is talking to a machine or a human.  In my experience, this has error rates in the 20% region, and many customers intensely dislike the gap of dead air at the start of their phone calls.  Also, if the heuristic improperly detects the beep, your message will start playing early, causing the recording to be cut off in the middle.

There are several ways you could attempt to deal with this:

  • Ignore the issue and treat both machines and humans the same.  This will produce the optimal result for humans, but your system will be virtually unusable when it gets a machine.  (This happens very frequently in my use case.)
  • Force a keypress (“1″) prior to playing your message, then give all users the same message.  This will force most machines to start recording immediately, stopping the cut-off-in-the-middle problem but annoying some clients.
  • Play instructions such as “This is an automated message from $COMPANY.  To hear it, press 1.”  Assume that anyone who doesn’t press 1 in 5 seconds is a machine and play the machine message.  If they interact with the call, play the human message.  This is my preferred solution (although not actually implemented in AR publicly yet, because customers don’t really grok this issue until it bites them personally).

There is one particular problem with recording messages on answering machines: if you give a user instructions such as “Press 1 to confirm your message” and they follow that instruction when listening to their voicemail, that keypress will not be caught by your application, it will be caught by their voicemail system, with unpredictable results (such as deleting the message) and an absolute certainty of not doing what your keypress would normally do.  Users do not understand why this happens.  They expect your instructions to them to work.

Securing Twilio Applications

Twilio applications have a superset of the security issues of web applications.  In addition to the usual SQL injections / XSS issues / etc, use of the telephone has unique security issues associated with it.

One issue is that confidential information is only confidential until you repeat it into a telephone.  Even assuming that the phone call isn’t intercepted (which is, ahem, problematic), there are very common user errors and use cases which will cause that information to be disclosed to third parties.  For example:

  • User error in inputting telephone numbers causes the message to go to the wrong party.
  • The message goes to corporate voicemail, where it will routinely be accessible to third parties.
  • The message is played over a speakerphone / cell phone / etc within earshot of third parties.
  • The message is saved on a physical device which can predictably leave the physical control of authorized parties.
  • etc, etc

Don’t ever put confidential information into an outgoing message, unless you have an automated way to authenticate who you are speaking with.

For incoming phone calls, Caller ID is not sufficient authentication.  It can be trivially spoofed, indeed, your phone company will probably sell you a product whose sole aim is to spoof Caller ID.  Instead, you should use a circumstance where the user is already authenticated and authorized, such as a face-to-face meeting or using a username / password pair in a web application, and then give them one-time PINs to do whatever they need to do on your system.  Alternatively, you can implement an entire password system for your incoming phone calls, but users tend to hate them, so I try to keep to the one-time PIN metaphor.  (When a user does something on the AR site which requires calling the system, such as setting up a recording for a reminder, I tell them “Call 555-555-5555 and put in your Task Code 1234″, which (since it is time-sensitive) both helps me look up what they were doing on a multi-user system and also conclusively demonstrates that they were able to read a web page which already verified their identity.

Not in the presentation because the slide got deleted for some reason: the 4chan rule.  Even if your free trial is discovered by 4chan, the world should not become a darker, more evil place.  There exists tremendous possibilities for abuse of free-form input/output to people’s telephones.  I gate access to my trial by requiring a valid credit card, and demonstration calls and the like have strict rate limits which prevent them from being used to spam someone’s phone to death.  (I should also make it impossible to send demo calls outside of standard work hours.  This is easy to say but a little tricky to implement across multiple time zones while still encouraging legitimate use of demo calls, which is why I haven’t done it yet.)

Twilio Scales Impressively

Twilio and modern web technologies scale impressively well by the standards of traditional businesses.  However, you should probably continue to rate-limit your systems, even though you could theoretically do substantially more volume.  For example, many customers who ask about scaling issues do not sufficiently understand that your application scales several orders of magnitude better than their business processes.  For example, a prospective client asked if my system could handle 10,000 phone calls a month.  I told them that I could handle that in under an hour.  They were quite excited about that, but as we continued to speak about their needs, it developed that actually doing that would have crushed their business.  They would have made 10,000 phone calls in an hour, received over 1,000 callbacks, and their two full-time telephone operators would have been overwhelmed by incoming demand for their time.

Grab Bag of Random Advice

  • Never contact Twilio, or any external API, inside the HTTP request/response cycle.  Doing so imposes an unacceptable delay in performance and slaves your reliability to that of the worst performing API you use.  (Twilio has never had user-visible downtime, but some APIs I rely on have.) Queue the request and tell the browser that you’ve done so.  You can drizzle AJAX magic on your website to make this feel responsive for your users.
  • The Twilio Say verb will have a robot read your message.  This is adequate for development, but for production, people prefer listening to people.  Fiverr.com is great for finding voice actresses for $5.
  • You can’t record too much information about Twilio requests, responses, and errors.  I stuff everything in Redis these days.  I strongly wish I had started doing this earlier, rather than writing “An error happened” to a log file and being unable to determine exactly what the error was or easily figured out whose account it actually affected.
  • When in doubt, don’t make that phone call.  Design your system to fail closed.  This is a continuous discipline, but it will drastically cut down on catastrophic problems.

Wrapup

That’s it for the presentation contents.  I remain very interested in Twilio apps, and am happy to talk to you about them whenever. My contact details are trivially discoverable.

I’m going to attempt to write a more comprehensive guide to developing Twilio applications, eventually. We’ll see what form that takes — I would really like to provide people an (even) easier way to get started, but at the same time I can’t justify dropping two months of my schedule to write a traditional book on it.

I Saw An Extremely Subtle Bug Today And I Just Have To Tell Someone

This post will not help you sell more software. If you’re not fascinated by the inner workings of complex systems, go do something more important. If you are, grab some popcorn, because this is the best bug I’ve seen in years.

Have you ever been logged into a site and get suddenly asked to log in again?  This is one of those minor nuisances of using the Internet, right?  If you’re not technically inclined, you think “Hmm, ghosts in the Googles.  Oh well, here is my username and password again.”  If you are technically inclined, you might think “Hmm, funny, my cookie picked a bad time to expire, or maybe my session was getting stored in Memcached on a machine which just went down.  Oh well, here is my username and password again.”

It turns out that Bingo Card Creator has been doing this pervasively to a fraction of my users for the last few months.  I never noticed it, and no one ever complained.

Here’s the scenario: Bingo Card Creator is a Rails application, originally coded against Rails 2.1.X and then gradually updated with Rails security releases.  Like many Rails applications, it stores sessions in a cookie (using CookieStore), and uses the session to hold only very limited data.  Specifically, it holds the (critical) user ID for logged in users and the (nice to have) pre-login session ID.  I use the pre-login session ID to tie some analytics stuff together on the back end — basically, it lets me associate newly created accounts with search terms and whatnot that bring them to my site.  The exact mechanism for doing that isn’t important to this bug — you just need to understand that the session resetting is a minor nuisance if it only happens once in a blue moon, and a huge nuisance if it happens pervasively.

Subtle Indications My Site Was Borked

BCC maintains a whole lot of internal analytics, because I’m something of stats junkie.  Because BCC is in maintenance mode this year, I don’t actually view the stats on a regular basis — as long as the server stays up and users don’t have any complaints, I let the sleeping dog lie.  (I’ve been busy with other projects.)  Anyhow, one example of such a stat is “Of recently created trial accounts, how many were referred from the Halloween bingo cards mini-site?”  For most of the year, that should be a negligible number.

Except right about on Halloween, when the mini-site sees on the order of 30,000 visits or more.  This usually sells several thousand dollars worth of software.  That is fairly hard to miss, because if several thousand dollars don’t show up in my bank account, I’d know right away.  (Sidenote: I did lose about $1,000 due to an ill-timed server crash while I was on a cross-continental plane ride right during the middle of the pre-Halloween rush. Oof.)  So naturally, several thousand dollars implies a hundred or more sales (at $30 each) which implies thousands of trials, right?

Well, my internal analytics code was telling me that the Halloween site had referred ~100 trials of which 6 converted.   Which means that I should have expected a $200 bump in my bank balance.  Which was not what happened.

I mentally filed this away under “Hmm, that’s odd” but didn’t investigate immediately because I had not lost any money (or so I thought) and was busy that week.  Then recently, after doing an unrelated code push (I integrated Stripe, it is awesome, full details later), I did my usual post-deploy smoke test and, after creating a new account, I got suddenly logged out of the application.

“Hmm, that’s odd.”  And I tried it again, twice, couldn’t produce the error, and mentally wrote it off to gremlins.

In Which I Become Doubtful Of The Existence Of Gremlins

Four hours ago, my brain suddenly decided to put these facts together. The discrepancy for the sales statistics strongly suggests that, prior to accounts getting created, the session was getting cleared.  This meant that, when the account actually got created, the referrer was not associated with the account in the DB, which threw off subsequent stats gathered by my internal analytics.  Sessions getting randomly cleared would also cause the user to become instantly signed out.

I tried to reproduce the problem in development mode and was pervasively unable to do so.  Then I started trying to reproduce it on the live site and was able to, sporadically, but only in Incognito Mode in Chrome, and only if I clicked fairly fast.  (Don’t ask how many dozens of tries it took to figure out that fast clicking was the culprit.)

Having verified that it actually existed, I added instrumentation to tell me what my session ID was, and noticed — like expected — that it changed when I was suddenly logged out.  Sure enough, the session was getting wiped.  But why?

Racking my brains to figure out “What could reset a session in Rails other than explicitly trying to do it?”, I started listing up and discarding some candidates:

  • The cookie expired in the browser — nope, expiry was set correctly
  • The cookie got eaten by the hop from Nginx to Mongrel — nope, after investigation, cookies always matched on both sides (like expected)
  • The cookie got too big and failed to serialize properly — nope, after looking through the Rails codebase, that looked like it would throw an exception
  • The cookie got reset when Rails detected malicious behavior coming from the browser — bingo!

CSRF Protection: When It Breaks, It Breaks Very Quietly

Cross-site request forgery (CSRF) is tricking the browser with a malicious (or compromised) site B to access something on site A.  Since requests for site A will carry A’s cookie whether requested by A or not, an image tag or embedded Javascript on B can do anything on A that a logged-in user can do, like accessing /accounts/wire_all_your_money_to_switzerland with the appropriate POST parameters to make it happen.  This is, to put it mildly, a bad thing.  Rails has lovely magic which defends against CSRF for you: all you have to do is include two lines of code
#In application_controller.rb
protect_from_forgery

#In your templates' HEAD somewhere

Rails will then basically generate cryptographically secure random number, totally transparently to you. This is called the CSRF token.

One copy goes in your Rails session, where only your server and the client can see it.  (n.b. Rails sessions are a bit opaque since they are Base64 encoded, but they can be trivially decoded by anyone who can read the cookie, including the end-user.  They can’t be forged because of another security feature, but don’t put anything you don’t want the user to know in the session.)

Another copy goes in the document’s HEAD (for access via Javascript) and in Rails-generated forms as a hidden value.  When Rails makes a PUT or POST request to the server (via helper-generated form or helper-generated Javascript), Rails will submit the copy included in the HTML code with the request, compare it to the one in the session, and bounce requests where they don’t match. Bad actions on other sites shouldn’t be able to read either a) a page on your site (the same origin policy prevents this) b) the contents of your cookie from your site, so this is secure.

The specifics of how it “bounces requests” are very important.

Point Releases Sometimes Contain Doozies

My personal understanding of Rails up until an hour ago was that a CSRF violation would raise an exception.  This would practically never get seen by a legitimate user operation, so few people are aware of that, but I had seen it a time or two when security auditing BCC.  (Some of my geeky friends had, back in the day, exploited BCC with a CSRF and helpfully told me how to fix it.  Naturally, after fixing it I verified that the site worked as expected with the fix.)

So if the CSRF protection was somehow eating sessions, I would expect to see that exception getting logged and emailed to me by Airbrake (formerly Hoptoad — it emails you when an exception happens in production, highly recommended).   That wasn’t happening.

Then I decided to dig into the Rails source.  Whereupon I learned that Rails 2.3.11 changed the behavior of CSRF protection: instead of throwing exceptions, it would silently just clear the session and re-run the request.  For most sensitive operations (e.g. those which require a signed in user), this would force a signout and then any potentially damaging operation would be averted.

Here’s the relevant code in Rails 2.3.11:

def verify_authenticity_token
  verified_request? || handle_unverified_request
end

def handle_unverified_request
  reset_session
end

Versus the relevant code in Rails 2.3.10 (sidenote: you can see all of this easily in Github because Rails is diligent about tagging releases, a practice you should certainly follow in your own development):

def verify_authenticity_token
  verified_request? || raise(ActionController::InvalidAuthenticityToken)
end

And, sure enough, checking Subversion showed that I upgraded the version of Rails I was using in January of this year in response to this security advisory. I read that, made the required modifications to my application, tested, and had no problems.

So What Went Wrong Then?

After I was sure that sessions were being reset (but only in production), I added a bit of instrumentation to the live site to record the session ID for people coming from my IP address and to log when it changed. This let me find the culprit: a bit of Javascript that A/Bingo, my A/B testing library, uses to verify that people are human. It assumes that robots generally won’t run Javascript engines capable of doing POST requests, so it does an ajax-y POST to my server to assert humanity of the end-user, thus keeping almost all bots out of my stats.

That code has been live over a year. Why did it suddenly start causing session resets? Oh, another change in the 2.3.11 upgrade:

The old code:

  # Returns true or false if a request is verified.
  # Comment truncated by Patrick
  def verified_request?
      !protect_against_forgery?     ||
        request.method == :get      ||
        request.xhr?                ||
        !verifiable_request_format? ||
        form_authenticity_token == form_authenticity_param
  end

Notice that request.xhr? will cause this request to be verified if it evaluates to true, regardless of the other things in the OR statements. request.xhr? tests whether a request is ajax-y in nature. A/Bingo’s humanity-verifying POST is, so it didn’t trigger the CSRF check.

The new code, however:

  # Returns true or false if a request is verified.
  # Comment truncated by Patrick
  def verified_request?
    !protect_against_forgery?                            ||
      request.get?                                       ||
      form_authenticity_token == form_authenticity_param ||
      form_authenticity_token == request.headers['X-CSRF-Token']
  end

Yep, as announced in the patch notes, we lost the exemption for XHR requests. So the A/Bingo mark_human request will, because it makes no particular effort to include a CSRF token (which I will be changing very quickly, as A/Bingo is my project), with certainty cause the CSRF check to fail in 2.3.11. This will result in not a noisy exception (the previous behavior) but instead a silent reset followed by re-running the action. A/Bingo, which doesn’t care a whit whether you’re logged in, will then mark your freshly new session as human. If the previous contents of your session mattered, for example to keep you signed in, they are now gone. A/Bingo will not reaudit your humanity, though, because your session now marks you as human, so this will only ever happen to your browser once.

Race Conditions: Not Just In Java Thread Programming

So why did this never show up in development and why did it show up only sporadically in production? Well, consider how a browser interprets a page presented to it: it first downloads the HTML, then downloads the assets, blocking when it discovers e.g. CSS or Javascript which alters the document. This means that Javascript very low on a page may never execute if someone above it blocks them until the user navigates away. (This is a pretty gross simplification of how multiple pieces of very complicated and often incompatible software do something very difficult. If you want details, read stuff by the people behind YSlow. They’re geniuses and taught me all that I successfully learned about this process.) Someone like, say, external analytics utilities loaded over the public Internet. My page includes a few such scripts, like Google Analytics and CrazyEgg. They are off in development to avoid polluting my stats.

This plus the lack of network latency means that, on development, a browser which sees a page that includes the humanity testing Javascript will almost certainly execute it. That will cause the session to be burned, once, on the first page load. Since my invariable workflow for manual testing is “Start with a new browser at the homepage or landing page, do stuff, make sure it works”, the order of execution is:

  1. I load the front page or a landing page. The session is initialized to some value S1.
  2. (A few milliseconds later.) The A/Bingo Javascript checks for my humanity, resetting the session to some new value S2.
  3. I hit the registration or login button, and the site works as I expect it to.
  4. Since the site knows I am human now, that never gets checked again, and the session never gets reset again.

In production though, the workflow could very well be:

  1. The user arrives at the front page or landing page. The session is initialized to some value S1, including (say) their referrer information.
  2. A bunch of Javascript starts loading ahead of the A/Bingo check.
  3. The user, within one or two seconds (depending on network latency to those external scripts), either logs in or creates an account.
  4. The browser never successfully executes the A/Bingo check.
  5. The user arrives at their dashboard. When it is rendered, the server (robotically) decides it isn’t quite sure if they are human yet, and includes that Javascript again. (This behavior is designed because I was aware of the timing issue, I just didn’t realize how it would shake out with the 2.3.11 upgrade.
  6. This time, the user ponders their dashboard enough for the A/Bingo Javascript to post successfully. This resets their session to some new value S2.
  7. The user clicks anything on the page, and (because S2 doesn’t include their logged in user ID) gets taken to a login screen.
  8. The user is now durably marked as human, so the A/Bingo check never fires again, preventing a second unceremonious logout.

This neatly explains the logged out users. How to explain the missing referrer information? Well, if the user is NOT very fast on the click on the first page, they’ll have their referrer cleared out of the session before they successfully signup. They’ll get marked as a human prior to creating their account, though, so they’ll never even notice the unceremonious logout. This is the behavior of the overwhelming bulk of new users, which is why the stats were getting comprehensively borked but almost no users thought to complain.

This difference in behavior based on the hidden interaction of two concurrent processes is called a race condition. Race conditions are why sane programmers don’t program with threads or, if they do, they use shared-nothing architecture and pass all communication between the threads through a message queue written by someone who knows what they are doing (if you have to ask, it isn’t you — seriously, multithreaded programming is hard). I haven’t seen a race condition in years, because the genre of web applications I write and their architectures makes me mostly immune to them. Well, I just got busted back to CS102. Sadly, the core lesson of CS102 hasn’t changed: reasoning through why race conditions happen is very hard.

Saved By Unsophisticated Users, Sort Of

Users returning after the session naturally expired (2 weeks) would go through the dance again, potentially getting asked to log in twice. However, it took most of them enough time to have the human check prior to finding where the Sign In button was, so the percentage of users who actually visibly saw the bug was fairly small. (I’m guessing, from a quick heuristic run on my log files, that it was below 1% of accounts. That’s the optimistic way to say it. The pessimistic way is to say that this bug negatively affected the better part of a thousand people, and probably cost me sales from some of them.)

Whose Fault Is This?

If my users are inconvenienced, it is my fault, always. I should have read the patch notes for 2.3.11 more diligently, to discover the very consequential line “In addition to altering the templates, an application’s javascript must be changed to send the token with Ajax requests.”, and I should have been more aware that there was a one-line Javascript method pulled in by a library (which I wrote, so that is no excuse) which was not automatically upgraded with the Rails helper methods.

I’m not sure if more diligent testing would have caught this. Race conditions are hard to diagnose, and while I might have caught it by including production levels of external Javascript in my development environment, the symptoms would only have been visible a fraction of the time anyhow, and in ways which didn’t visibly break the application most of the time. (Who checks their stats for the development version to make sure they’re sensible after implementing that function correctly the first time?)

What I really should have done about this is addressing it earlier, when I first got the inkling that there was some weird edge case which would cause a logged in user to become logged out. I futzed around with my configuration once or twice and saw the problem go away (because it was non-deterministic), but rather than futzing I should have figured out a complicated but reducible series of steps that would always cause the issue. That would have sent me down the right road for fixing it.

So How Do You Address This

Immediate term, a one-line patch turns off CSRF protection for the A/Bingo mark_human action, preventing it from accidentally resetting the session.

skip_filter :verify_authenticity_token, :only =&gt; :mark_human

I also added a note about this to the A/Bingo documentation. I’ll patch A/Bingo after I have enough brain cells left to do that in a way which won’t break anyone’s applications. After I patch A/Bingo, that work-around won’t be necessary.

Why’d You Write This Post?

Because, after hours spelunking in Firebug, my codebase, and the innards of obsolete version of Rails to understand what was happening, I had to tell somebody. Some people have water coolers. I have the Internet. Hopefully, someone in this wide world will find this discussion useful.

If you’re wondering what the day-to-day life of an engineer is like or why it’s so dang hard some of the time, this might be a good example (of the pathological case — the typical case is writing boring code which solves boring problems, like laying out a 5×5 grid on a bingo card and randomizing the word order). Bingo Card Creator is not terribly complicated software when compared to most applications, but it sits on top of other pieces of code (Rails, the web server the browser, the TCP/IP stack, the underlying OS, the hardware on both ends, etc) which collectively are orders of magnitude more complicated than any physical artifact ever created by the human race.

Most of the time that complexity is abstracted away from both the user and the developer, both as blissfully ignorant of the layers below as an ant walking on an aircraft carrier is ignorant of the depth of the ocean. But when a problem bubbles up and writing it off to gremlins isn’t getting the job done, you have to start looking at the lower levels of abstraction. That is rather harder than dealing with just the higher levels of abstraction. (Joel Spolsky has an article about this subject.)

Don't Call Yourself A Programmer, And Other Career Advice

If there was one course I could add to every engineering education, it wouldn’t involve compilers or gates or time complexity.  It would be Realities Of Your Industry 101, because we don’t teach them and this results in lots of unnecessary pain and suffering.  This post aspires to be README.txt for your career as a young engineer.  The goal is to make you happy, by filling in the gaps in your education regarding how the “real world” actually works.  It took me about ten years and a lot of suffering to figure out some of this, starting from “fairly bright engineer with low self-confidence and zero practical knowledge of business.”  I wouldn’t trust this as the definitive guide, but hopefully it will provide value over what your college Career Center isn’t telling you.

90% of programming jobs are in creating Line of Business software: Economics 101: the price for anything (including you) is a function of the supply of it and demand for it.  Let’s talk about the demand side first.  Most software is not sold in boxes, available on the Internet, or downloaded from the App Store.  Most software is boring one-off applications in corporations, under-girding every imaginable facet of the global economy.  It tracks expenses, it optimizes shipping costs, it assists the accounting department in preparing projections, it helps design new widgets, it prices insurance policies, it flags orders for manual review by the fraud department, etc etc.  Software solves business problems.  Software often solves business problems despite being soul-crushingly boring and of minimal technical complexity.  For example, consider an internal travel expense reporting form.  Across a company with 2,000 employees, that might save 5,000 man-hours a year (at an average fully-loaded cost of $50 an hour) versus handling expenses on paper, for a savings of $250,000 a year.  It does not matter to the company that the reporting form is the world’s simplest CRUD app, it only matters that it either saves the company costs or generates additional revenue.

There are companies which create software which actually gets used by customers, which describes almost everything that you probably think of when you think of software.  It is unlikely that you will work at one unless you work towards making this happen.  Even if you actually work at one, many of the programmers there do not work on customer-facing software, either.

Engineers are hired to create business value, not to program things:  Businesses do things for irrational and political reasons all the time (see below), but in the main they converge on doing things which increase revenue or reduce costs.  Status in well-run businesses generally is awarded to people who successfully take credit for doing one of these things.  (That can, but does not necessarily, entail actually doing them.)  The person who has decided to bring on one more engineer is not doing it because they love having a geek around the room, they are doing it because adding the geek allows them to complete a project (or projects) which will add revenue or decrease costs.  Producing beautiful software is not a goal.  Solving complex technical problems is not a goal.  Writing bug-free code is not a goal.  Using sexy programming languages is not a goal.  Add revenue.  Reduce costs.  Those are your only goals.

Peter Drucker — you haven’t heard of him, but he is a prophet among people who sign checks — came up with the terms Profit Center and Cost Center.  Profit Centers are the part of an organization that bring in the bacon: partners at law firms, sales at enterprise software companies, “masters of the universe” on Wall Street, etc etc.  Cost Centers are, well, everybody else.  You really want to be attached to Profit Centers because it will bring you higher wages, more respect, and greater opportunities for everything of value to you.  It isn’t hard: a bright high schooler, given a paragraph-long description of a business, can usually identify where the Profit Center is.  If you want to work there, work for that.  If you can’t, either a) work elsewhere or b) engineer your transfer after joining the company.

Engineers in particular are usually very highly paid Cost Centers, which sets MBA’s optimization antennae to twitching.  This is what brings us wonderful ideas like outsourcing, which is “Let’s replace really expensive Cost Centers who do some magic which we kinda need but don’t really care about with less expensive Cost Centers in a lower wage country”.  (Quick sidenote: You can absolutely ignore outsourcing as a career threat if you read the rest of this guide.)  Nobody ever outsources Profit Centers.  Attempting to do so would be the setup for MBA humor.  It’s like suggesting replacing your source control system with a bunch of copies maintained on floppy disks.

Don’t call yourself a programmer: “Programmer” sounds like “anomalously high-cost peon who types some mumbo-jumbo into some other mumbo-jumbo.”  If you call yourself a programmer, someone is already working on a way to get you fired.  You know Salesforce, widely perceived among engineers to be a Software as a Services company?  Their motto and sales point is “No Software”, which conveys to their actual customers “You know those programmers you have working on your internal systems?  If you used Salesforce, you could fire half of them and pocket part of the difference in your bonus.”  (There’s nothing wrong with this, by the way.  You’re in the business of unemploying people.  If you think that is unfair, go back to school and study something that doesn’t matter.)

Instead, describe yourself by what you have accomplished for previously employers vis-a-vis increasing revenues or reducing costs.  If you have not had the opportunity to do this yet, describe things which suggest you have the ability to increase revenue or reduce costs, or ideas to do so.

There are many varieties of well-paid professionals who sling code but do not describe themselves as slinging code for a living.  Quants on Wall Street are the first and best-known example: they use computers and math as a lever to make high-consequence decisions better and faster than an unaided human could, and the punchline to those decisions is “our firm make billions of dollars.”  Successful quants make more in bonuses in a good year than many equivalently talented engineers will earn in a decade or lifetime.

Similarly, even though you might think Google sounds like a programmer-friendly company, there are programmers and then there’s the people who are closely tied to 1% improvements in AdWords click-through rates.  (Hint: provably worth billions of dollars.)  I recently stumbled across a web-page from the guy whose professional bio is “wrote the backend billing code that 97% of Google’s revenue passes through.”  He’s now an angel investor (a polite synonym for “rich”).

You are not defined by your chosen software stack: I recently asked via Twitter what young engineers wanted to know about careers.  Many asked how to know what programming language or stack to study.  It doesn’t matter.  There you go.

Do Java programmers make more money than .NET programmers?  Anyone describing themselves as either a Java programmer or .NET programmer has already lost, because a) they’re a programmer (you’re not, see above) and b) they’re making themselves non-hireable for most programming jobs.  In the real world, picking up a new language takes a few weeks of effort and after 6 to 12 months nobody will ever notice you haven’t been doing that one for your entire career.  I did back-end Big Freaking Java Web Application development as recently as March 2010.  Trust me, nobody cares about that.  If a Python shop was looking for somebody technical to make them a pile of money, the fact that I’ve never written a line of Python would not get held against me.

Talented engineers are rare — vastly rarer than opportunities to use them — and it is a seller’s market for talent right now in almost every facet of the field.  Everybody at Matasano uses Ruby.  If you don’t, but are a good engineer, they’ll hire you anyway.  (A good engineer has a track record of — repeat after me — increasing revenue or decreasing costs.)  Much of Fog Creek uses the Microsoft Stack.  I can’t even spell ASP.NET and they’d still hire me.

There are companies with broken HR policies where lack of a buzzword means you won’t be selected.  You don’t want to work for them, but if you really do, you can add the relevant buzzword to your resume for the costs of a few nights and weekends, or by controlling technology choices at your current job in such a manner that in advances your career interests.  Want to get trained on Ruby at a .NET shop?  Implement a one-off project in Ruby.  Bam, you are now a professional Ruby programmer — you coded Ruby and you took money for it.  (You laugh?  I did this at a Java shop.  The one-off Ruby project made the company $30,000.  My boss was, predictably, quite happy and never even asked what produced the deliverable.)

Co-workers and bosses are not usually your friends: You will spend a lot of time with co-workers.  You may eventually become close friends with some of them, but in general, you will move on in three years and aside from maintaining cordial relations you will not go out of your way to invite them over to dinner.  They will treat you in exactly the same way.  You should be a good person to everyone you meet — it is the moral thing to do, and as a sidenote will really help your networking — but do not be under the delusion that everyone is your friend.

For example, at a job interview, even if you are talking to an affable 28 year old who feels like a slightly older version of you he is in a transaction.  You are not his friend, you are an input for an industrial process which he is trying to buy for the company at the lowest price.  That banter about World of Warcraft is just establishing a professional rapport, but he will (perfectly ethically) attempt to do things that none of your actual friends would ever do, like try to talk you down several thousand dollars in salary or guilt-trip you into spending more time with the company when you could be spending time with your actual friends.  You will have other coworkers who — affably and ethically — will suggest things which go against your interests, from “I should get credit for that project you just did” (probably not phrased in so many words) to “We should do this thing which advances my professional growth goals rather than yours.”  Don’t be surprised when this happens.

You radically overestimate the average skill of the competition because of the crowd you hang around with:  Many people already successfully employed as senior engineers cannot actually implement FizzBuzz.  Just read it and weep.  Key takeaway: you probably are good enough to work at that company you think you’re not good enough for.  They hire better mortals, but they still hire mortals.

“Read ad.  Send in resume.  Go to job interview.  Receive offer.” is the exception, not the typical case, for getting employment: Most jobs are never available publicly, just like most worthwhile candidates are not available publicly (see here).  Information about the position travels at approximately the speed of beer, sometimes lubricated by email.  The decisionmaker at a company knows he needs someone.  He tells his friends and business contacts.  One of them knows someone — family, a roommate from college, someone they met at a conference, an ex-colleague, whatever.  Introductions are made, a meeting happens, and they achieve agreement in principle on the job offer.  Then the resume/HR department/formal offer dance comes about.

This is disproportionately true of jobs you actually want to get.  “First employee at a successful startup” has a certain cachet for a lot of geeks, and virtually none of those got placed by sending in a cover letter to an HR department, in part because two-man startups don’t have enough scar tissue to form HR departments yet.  (P.S. You probably don’t want to be first employee for a startup.  Be the last co-founder instead.)  Want to get a job at Googler?  They have a formal process for giving you a leg up because a Googler likes you.  (They also have multiple informal ways for a Googler who likes you an awful lot to short-circuit that process.  One example: buy the company you work for.  When you have a couple of billion lying around you have many interesting options for solving problems.)

There are many reasons why most hiring happens privately.  One is that publicly visible job offers get spammed by hundreds of resumes (particularly in this economy) from people who are stunningly inappropriate for the position.  The other is that other companies are so bad at hiring that, if you don’t have close personal knowledge about the candidate, you might accidentally hire a non-FizzBuzzer.

Networking: it isn’t just for TCP packets: Networking just means a) meeting people who at some point can do things for you (or vice versa) and b) making a favorable impression on them.

There are many places to meet people.  Events in your industry, such as conferences or academic symposia which get seen by non-academics, are one.  User groups are another.  Keep in mind that user groups draw a very different crowd than industry conferences and optimize accordingly.

Strive to help people.  It is the right thing to do, and people are keenly aware of who have in the past given them or theirs favors.  If you ever can’t help someone but know someone who can, pass them to the appropriate person with a recommendation.  If you do this right, two people will be happy with you and favorably disposed to helping you out in the future.

You can meet people over the Internet (oh God, can you), but something in our monkey brains makes in-the-flesh meeting a bigger thing.  I’ve Internet-met a great many people who I’ve then gone on to meet in real life.  The physical handshake is a major step up in the relationship, even when Internet-meeting lead to very consequential things like “Made them a lot of money through good advice.”  Definitely blog and participate on your industry-appropriate watering holes like HN, but make it out to the meetups for it.

Academia is not like the real world: Your GPA largely doesn’t matter (modulo one high profile exception: a multinational advertising firm).  To the extent that it does matter, it only determines whether your resume gets selected for job interviews.  If you’re reading the rest of this, you know that your resume isn’t the primary way to get job interviews, so don’t spend huge amount of efforts optimizing something that you either have sufficiently optimized already (since you’ll get the same amount of interviews at 3.96 as you will at 3.8) or that you don’t need at all (since you’ll get job interviews because you’re competent at asking the right people to have coffee with you).

Your major and minor don’t matter.  Most decisionmakers in industry couldn’t tell the difference between a major in Computer Science and a major in Mathematics if they tried.  I was once reduced to tears because a minor academic snafu threatened my ability to get a Bachelor of Science with a major in Computer Science, which my advisor told me was more prestigious than a Bachelor of Science in Computer Science.  Academia cares about distinctions like that.  The real world does not.

Your professors might understand how the academic job market works (short story: it is ridiculously inefficient in engineering and fubared beyond mortal comprehension in English) but they often have quixotic understandings of how the real world works.  For example, they may push you to get extra degrees because a) it sounds like a good idea to them and b) they enjoy having research-producing peons who work for ramen.  Remember, market wages for people capable of producing research are $80~100k+++ in your field.  That buys an awful lot of ramen.

The prof in charge of my research project offered me a spot in his lab, a tuition waiver, and a whole $12,000 dollars as a stipend if I would commit 4~6 years to him.  That’s a great deal if, and only if, you have recently immigrated from a low-wage country and need someone to intervene with the government to get you a visa.

If you really like the atmosphere at universities, that is cool.  Put a backpack on and you can walk into any building at any university in the United States any time you want.  Backpacks are a lot cheaper than working in academia.   You can lead the life of the mind in industry, too — and enjoy less politics and better pay.  You can even get published in journals, if that floats your boat.  (After you’ve escaped the mind-warping miasma of academia, you might rightfully question whether Published In A Journal is really personally or societally significant as opposed to close approximations like Wrote A Blog Post And Showed It To Smart People.)

How much money do engineers make?

Wrong question.  The right question is “What kind of offers do engineers routinely work for?”, because salary is one of many levers that people can use to motivate you.  The answer to this is, less than helpfully, “Offers are all over the map.”

In general, big companies pay more (money, benefits, etc) than startups.  Engineers with high perceived value make more than those with low perceived value.  Senior engineers make more than junior engineers.  People working in high-cost areas make more than people in low-cost areas.  People who are skilled in negotiation make more than those who are not.

We have strong cultural training to not ask about salary, ever.  This is not universal.  In many cultures, professional contexts are a perfectly appropriate time to discuss money.  (If you were a middle class Japanese man, you could reasonably be expected to reveal your exact salary to a 2nd date, anyone from your soccer club, or the guy who makes your sushi.  If you owned a company, you’d probably be cagey about your net worth but you’d talk about employee salaries the way programmers talk about compilers — quite frequently, without being embarrassed.)   If I were a Marxist academic or a conspiracy theorist, I might think that this bit of middle class American culture was specifically engineered to be in the interests of employers and against the interests of employees.  Prior to a discussion of salary at any particular target employer, you should speak to someone who works there in a similar situation and ask about the salary range for the position.  It is <%= Date.today.year %>; you can find these people online.  (LinkedIn, Facebook, Twitter, and your (non-graph-database) social networks are all good to lean on.)

Anyhow.  Engineers are routinely offered a suite of benefits.  It is worth worrying, in the United States, about health insurance (traditionally, you get it and your employer foots most or all of the costs) and your retirement program, which is some variant of “we will match contributions to your 401k up to X% of salary.”  The value of that is easy to calculate: X% of salary.  (It is free money, so always max out your IRA up to the employer match.  Put it in index funds and forget about it for 40 years.)

There are other benefits like “free soda”, “catered lunches”, “free programming books”, etc.  These are social signals more than anything else.  When I say that I’m going to buy you soda, that says a specific thing about how I run my workplace, who I expect to work for me, and how I expect to treat them.  (It says “I like to move the behavior of unsophisticated young engineers by making this job seem fun by buying 20 cent cans of soda, saving myself tens of thousands in compensation while simultaneously encouraging them to ruin their health.”  And I like soda.)  Read social signals and react appropriately — someone who signals that, e.g., employee education is worth paying money for might very well be a great company to work for — but don’t give up huge amounts of compensation in return for perks that you could trivially buy.

How do I become better at negotiation?  This could be a post in itself.  Short version:

a)  Remember you’re selling the solution to a business need (raise revenue or decrease costs) rather than programming skill or your beautiful face.

b)  Negotiate aggressively with appropriate confidence, like the ethical professional you are.  It is what your counterparty is probably doing.  You’re aiming for a mutual beneficial offer, not for saying Yes every time they say something.

c)  “What is your previous salary?” is employer-speak for “Please give me reasons to pay you less money.”  Answer appropriately.

d)  Always have a counteroffer.  Be comfortable counteroffering around axes you care about other than money.  If they can’t go higher on salary then talk about vacation instead.

e)  The only time to ever discuss salary is after you have reached agreement in principle that they will hire you if you can strike a mutually beneficial deal.  This is late in the process after they have invested a lot of time and money in you, specifically, not at the interview.  Remember that there are large costs associated with them saying “No, we can’t make that work” and, appropriately, they will probably not scuttle the deal over comparatively small issues which matter quite a bit to you, like e.g. taking their offer and countering for that plus a few thousand bucks then sticking to it.

f)  Read a book.  Many have been written about negotiation.  I like Getting To Yes.  It is a little disconcerting that negotiation skills are worth thousands of dollars per year for your entire career but engineers think that directed effort to study them is crazy when that could be applied to trivialities about a technology that briefly caught their fancy.

How to value an equity grant:

Roll d100.  (Not the right kind of geek?  Sorry.  rand(100) then.)

0~70: Your equity grant is worth nothing.

71~94: Your equity grant is worth a lump sum of money which makes you about as much money as you gave up working for the startup, instead of working for a megacorp at a higher salary with better benefits.

95~99: Your equity grant is a lifechanging amount of money.  You won’t feel rich — you’re not the richest person you know, because many of the people you spent the last several years with are now richer than you by definition — but your family will never again give you grief for not having gone into $FAVORED_FIELD like a proper $YOUR_INGROUP.

100: You worked at the next Google, and are rich beyond the dreams of avarice.  Congratulations.

Perceptive readers will note that 100 does not actually show up on a d100 or rand(100).

Why are you so negative about equity grants?

Because you radically overestimate the likelihood that your startup will succeed and radically overestimate the portion of the pie that will be allocated to you if the startup succeeds.  Read about dilution and liquidation preferences on Hacker News or Venture Hacks, then remember that there are people who know more about negotiating deals than you know about programming and imagine what you could do to a program if there were several hundred million on the line.

Are startups great for your career as a fresh graduate?

The high-percentage outcome is you work really hard for the next couple of years, fail ingloriously, and then be jobless and looking to get into another startup.  If you really wanted to get into a startup two years out of school, you could also just go work at a megacorp for the next two years, earn a bit of money, then take your warchest, domain knowledge, and contacts and found one.

Working at a startup, you tend to meet people doing startups.  Most of them will not be able to hire you in two years.  Working at a large corporation, you tend to meet other people in large corporations in your area.  Many of them either will be able to hire you or will have the ear of someone able to hire you in two years.

So would you recommend working at a startup?  Working in a startup is a career path but, more than that, it is a lifestyle choice.  This is similar to working in investment banking or academia.  Those are three very different lifestyles.  Many people will attempt to sell you those lifestyles as being in your interests, for their own reasons.  If you genuinely would enjoy that lifestyle, go nuts.  If you only enjoy certain bits of it, remember that many things are available a la carte if you really want them.  For example, if you want to work on cutting-edge technology but also want to see your kids at 5:30 PM, you can work on cutting-edge technology at many, many, many megacorps.

(Yeah, really.  If it creates value for them, heck yes, they’ll invest in it.  They’ll also invest in a lot of CRUD apps, but then again, so do startups — they just market making CRUD apps better than most megacorps do.  The first hour of the Social Network is about making a CRUD app seem like sexy, the second is a Lifetime drama about a divorce improbably involving two heterosexual men.)

Your most important professional skill is communication: Remember engineers are not hired to create programs and how they are hired to create business value?  The dominant quality which gets you jobs is the ability to give people the perception that you will create value.  This is not necessarily coextensive with ability to create value.

Some of the best programmers I know are pathologically incapable of carrying on a conversation.  People disproportionately a) wouldn’t want to work with them or b) will underestimate their value-creation ability because they gain insight into that ability through conversation and the person just doesn’t implement that protocol.  Conversely, people routinely assume that I am among the best programmers they know entirely because a) there exists observable evidence that I can program and b) I write and speak really, really well.

(Once upon a time I would have described myself as “Slightly below average” in programming skill.  I have since learned that I had a radically skewed impression of the skill distribution, that programming skill is not what people actually optimize for, and that modesty is against my interests.  These days if you ask me how good of a programmer I am I will start telling you stories about how I have programmed systems which helped millions of kids learn to read or which provably made companies millions.  The question of where I am on the bell curve matters to no one, so why bother worrying about it?)

Communication is a skill.  Practice it: you will get better.  One key sub-skill is being able to quickly, concisely, and confidently explain how you create value to someone who is not an expert in your field and who does not have a priori reasons to love you.  If when you attempt to do this technical buzzwords keep coming up (“Reduced 99th percentile query times by 200 ms by optimizing indexes on…”), take them out and try again.  You should be able to explain what you do to a bright 8 year old, the CFO of your company, or a programmer in a different specialty, at whatever the appropriate level of abstraction is.

You will often be called to do Enterprise Sales and other stuff you got into engineering to avoid: Enterprise Sales is going into a corporation and trying to convince them to spend six or seven figures on buying a system which will either improve their revenue or reduce costs.  Every job interview you will ever have is Enterprise Sales.  Politics, relationships, and communication skills matter a heck of a lot, technical reality not quite so much.

When you have meetings with coworkers and are attempting to convince  them to implement your suggestions, you will also be doing Enterprise Sales.  If getting stuff done is your job description, then convincing people to get stuff done is a core job skill for you.  Spend appropriate effort on getting good at it.  This means being able to communicate effectively in memos, emails, conversations, meetings, and PowerPoint (when appropriate).  It means understanding how to make a business case for a technological initiative.  It means knowing that sometimes you will make technological sacrifices in pursuit of business objectives and that this is the right call.

Modesty is not a career-enhancing character trait: Many engineers have self-confidence issues (hello, self).  Many also come from upbringings where modesty with regards to one’s accomplishments is culturally celebrated.  American businesses largely do not value modesty about one’s accomplishments.  The right tone to aim for in interviews, interactions with other people, and life is closer to “restrained, confident professionalism.”

If you are part of a team effort and the team effort succeeds, the right note to hit is not “I owe it all to my team” unless your position is such that everyone will understand you are lying to be modest.  Try for “It was a privilege to assist my team by leading their efforts with regards to $YOUR_SPECIALTY.”  Say it in a mirror a thousand times until you can say it with a straight face.  You might feel like you’re overstating your accomplishments.  Screw that.  Someone who claims to Lead Efforts To Optimize Production while having the title Sandwich Artist is overstating their accomplishments.  You are an engineer.  You work magic which makes people’s lives better.  If you were in charge of the database specifically on an important project involving people then heck yes you lead the database effort which was crucial for the success of the project.  This is how the game is played.  If you feel poorly about it, you’re like a batter who feels poorly about stealing bases in baseball: you’re not morally superior, you’re just playing poorly

All business decisions are ultimately made by one or a handful of multi-cellular organisms closely related to chimpanzees, not by rules or by algorithms: People are people.  Social grooming is a really important skill.  People will often back suggestions by friends because they are friends, even when other suggestions might actually be better.  People will often be favoritably disposed to people they have broken bread with.  (There is a business book called Never Eat Alone.  It might be worth reading, but that title is whatever the antonym of deceptive advertising is.)  People routinely favor people who they think are like them over people they think are not like them.  (This can be good, neutral, or invidious.  Accepting that it happens is the first step to profitably exploiting it.)

Actual grooming is at least moderately important, too, because people are hilariously easy to hack by expedients such as dressing appropriately for the situation, maintaining a professional appearance, speaking in a confident tone of voice, etc.  Your business suit will probably cost about as much as a computer monitor.  You only need it once in a blue moon, but when you need it you’ll be really, really, really glad that you have it.  Take my word for it, if I wear everyday casual when I visit e.g. City Hall I get treated like a hapless awkward twenty-something, if I wear the suit I get treated like the CEO of a multinational company.  I’m actually the awkward twenty-something CEO of a multinational company, but I get to pick which side to emphasize when I want favorable treatment from a bureaucrat.

(People familiar with my business might object to me describing it as a multinational company because it is not what most people think of when “multinational company” gets used in conversation.  Sorry — it is a simple conversational hack.  If you think people are pissed off at being manipulated when they find that out, well, some people passionately hate business suits, too.  That doesn’t mean business suits are valueless.  Be appropriate to the circumstances.  Technically true answers are the best kind of answers when the alternative is Immigration deporting you, by the way.)

At the end of the day, your life happiness will not be dominated by your career.  Either talk to older people or trust the social scientists who have: family, faith, hobbies, etc etc generally swamp career achievements and money in terms of things which actually produce happiness.  Optimize appropriately.  Your career is important, and right now it might seem like the most important thing in your life, but odds are that is not what you’ll believe forever.  Work to live, don’t live to work.

Speaking At Microconf — Free Ticket Inside

I met Rob Walling of Software By Rob at the Business of Software conference last year, after a couple years of swapping emails.  He and I hit it off, largely since we come from similar places on the “building small profitable software businesses” solution space.  So when he asked if I would fly around the world to speak at MicroConf, a conference he was organizing for small software business, I of course said yes.  It is June 6 and June 7th in Las Vegas, and tickets are still on sale.  (Special promo code: BINGO gets you $100 off.  I don’t get compensated for that.)

I gave away the one free ticket, but feel free to use the above promo code.  I also have a ticket to give away.  It is yours if

  • you have a small software business with an actual product which sells to actual people
  • you have sold at least one copy
  • you can get yourself to Vegas
  • you can find my address and email me explaining what you hope to get out of the conference

Offer good to one person, judging based solely on who I think would benefit most from it.

I have not written my speech yet, but intend to make it worthwhile for folks coming there, both in terms of motivation and in terms of teaching stuff they can actually use for their businesses.  I generally tend to talk marketing when it comes to that.

I’m currently kicking around an extended metaphor about icebergs for the talk.  You see, every business is an iceberg: of the value created by the business, much of it is hidden within the company or (at best) exposed to existing customers, and only a small portion peaks above the waterline, outside of the existing community around the business.  This is unfortunate, because both traditional marketing and SEO revolve around maximizing the visible bit of the iceberg.  There are practical ways to do that which work well for software businesses.  I will likely talk about several of them.

If you have any suggestions on things I should absolutely cover, I’d love to hear them in the comments.

If you come to Microconf, talk to me.  I know most of the other speakers and they’re all very personable people.  You should probably talk to them, too.  But this is an explicit invitation: talk to me.  I’m literally flying halfway around the world and I have no agenda item other than talking to you about your software business.  Ask me for advice.  I can’t promise it will be good advice, but I intend to give lots of it.  I’ll be the tall jet-lagged white guy in the bright red Twilio jacket. (<Plug>Twilio: it’s awesome, you should be using it.  Plus they make awesome red jackets.</Plug>)

Appointment Reminder at 6 Months

<Plug>

The guys at AppSumo approached me and said “Hey, we’d like to do a video of you talking strategy with Andrew Warner.  You guys script it, we’ll edit it and sell it.”  Ordinarily I don’t really do e-books and whatnot but that pitch had me at “Andrew”, because Mixergy is one of the best sources of consistently actionable advice I’ve seen, and I want to help him succeed in whatever little way I can.

The topic of the video is Scalable Content Generation.  It’s the same SEO strategy that I’ve talked about on my blog for years (see Greatest Hits section)  Take my word for it: that is the highest single expected ROI of anything I’ve ever talked about on this blog.  However, those posts are stream-of-consciousness notes from a strategy which evolved organically over years.  Many people tell me that the idea is wonderful, very few actually end up implementing it.

The video is scripted, professionally edited, and organized so that hopefully people will actually implement it this time.  Andrew and I walk you through exactly what I did to turn $3,000 of freelancer writing into $30,000 of sales last year, and discuss how to apply that to an arbitrary online business.

Here was my pitch to AppSumo for why they should have me talk on Scalable Content Generation:

  1. It lets small businesses achieve top rankings for relevant search terms on Google for minimal costs.
  2. It lets small businesses develop an asset which continues to grow in value over time, rather than leasing traffic via e.g. AdWords ads, where you have to continue paying or the spigot turns off.
  3. It allows you to provide huge amounts of actual value to customers without spending a huge amount of time on it.

The video is for sale over here.  Apparently there is an option for getting it free for the next 24 hours (on Friday April15th, 2011 US time) — after that it will be somewhere south of $100.  They’re also throwing in an hour of consultation with me for somebody who writes a review.

Ask me about my thoughts on e-books and info marketing some other time, but suffice it to say a) I am doing this for free, b) I would not have done it but for Andrew’s involvement, and c) I believe the video has value to online businesses or I wouldn’t be associated with it.

</Plug>

Appointment Reminder Update

Early in December I launched my second software business, Appointment Reminder.  I can’t be as open with it as I am with Bingo Card Creator (you can literally see my sales stats for that one) , but I hope to keep folks informed about how things are going.  Long story short: egads, I had forgotten how long it takes to get these things off the ground.

One would assume that, after leaving the 70 ~ 90 hours day job, I would be able to devote 100% of my concentration to the new business.   That turned out to be grossly over-optimistic: a combination of burnout, reacquainting myself with human life, and distractions from consulting meant that I accomplished almost nothing on AR between May and late October of last year.  Similarly, after launching I took the month off for Christmas, and when I got back in January I immediately started applying myself to marketing AR floundered around for quite a bit.  There was a consulting engagement, a few side projects (Achievement Unlocked: Published in Academic Journal), an earthquake, and now we’re almost to Easter and I’m wondering where 2011 has gone.

So that’s the bad news.  The good news:

Got Customers.

AR has had about fifty people sign up for the trial, either by doing it themselves on the website or by me giving them an account manually.  That’s much, much, much slower than BCC — these days, BCC routinely gets 250 signups a day.  The saving grace is that their conversion rates are high: keeping in mind that customers have a 30 day free trial and that many are still within it, about 10 of them have already paid me money and about 10 more look likely to.  The revenue run rate is still inconsequential (south of $500 a month), but AR is cash flow positive (pays for server, calls, credit card processing, etc), and the unit economics for those customers turned out better than expected.

For example, my most popular plan currently is the Professional one, at $29 a month.  This entitles the customer to up to 100 appointments a month.  The worse case scenario for cost to service that customer is about $20 a month paid to Twilio.  My hypothesis was that the actual cost to service the customer would be lower than $5 or so, which makes the economics attractive.  It turns out that most customers on that plan are below $3 apiece.  This means that, if I could just scale customer acquisition, I would be in a very happy place.

They Love It.

My customers have sent over a thousand reminders regarding 800 or so appointments.  It is anecdotally making a big impact for their businesses: my biggest fan has seen his no-show rate decline to virtually nothing, which singlehandedly “pays for the mortgage.”  Many other customers report that they didn’t previously have a problem with no-shows, but that making reminder calls was a source of frustration for them, and that AR removes the frustration and makes it much more likely that any given client actually gets contacted.

Somewhat surprisingly to me, my customers’ customers love Appointment Reminder, too.  My favorite: “I wish all my [service providers] used this.”  The context for me hearing that was that customer relaying his customers’ opinions to tell me why he wouldn’t stop using Appointment Reminder after getting bitten by a real doozy of a bug.

Bugs Suck

I very carefully avoided doing anything “Mission Critical” when I was an employee, because I didn’t feel like I could offer the requisite level of service.  BCC going down can inconvenience a teacher, but nobody is going to have their day totally ruined by it.

Appointment Reminder is more than capable of totally ruining somebody’s day.  If it just broke, that would be annoying but survivable: clients do not expect to get automated reminders yet from my customers, and most will come in for their appointments regardless of whether they get a reminder or not.  However, “failure to deliver reminders in a timely fashion” is not nearly the worst possible failure case.

An example: during my apartment move in February, due to an ill-considered code push the night before the move, the DelayedJob queue which handles (among other things) outgoing reminders fell over.  Thanks to the magic of Scout, I heard about this essentially instantly.  Well, my cell phone did, at any rate.  My cell phone was packed up with my laptop and other essential computer stuff for transport by hand.  I didn’t hear about the queue falling over until after the move was mostly complete, by which time it was already 8 PM for many of my customers (in the US).

I panicked.  Mistake #3.  I was worried about many customers not getting their reminders for appointments tomorrow, so instead of doing the smart thing and purging the outgoing queue, turning on my In Case Shit Happens button (which prevents any outgoing reminders without my explicit approval), and manually restarting then verifying that the system was stable, I decided to improvise.  Mistake #4.

I visually inspected the queue, which was 1,000+ jobs ranging from outgoing reminders to low-priority requests to external analytics APIs.  I saw one type of queued item that would be annoying to try again — demo calls, which have to occur when a user is still on the website rather than hours later — and purged them.  Then I just restarted the queue workers and watched the queue go from 1,000 jobs down to 0.  Mission accomplished, right?

That night, for some reason I couldn’t sleep, so I turn on my iPad and check email.  I had several very irate emails from customers, who had just had their morning appointments come in and complain about getting contacted by Appointment Reminder.  Repeatedly.  See, for the several hours that the queued workers were down, a cron job kept saying “Who has an appointment tomorrow?  Millie Smith?  Have we called Millie Smith yet?  OK then, queuing a call for Millie Smith and ignoring her for 5 minutes while that call takes place.”  There are an awful lot of 5 minute intervals in several hours, and the queue was not idempotent, so Millie Smith got many, many calls queued for her.

As soon as I hit “go”, the backed up queue workers blasted through 600 calls, 400 SMSes, and 200 emails, and my website and Twilio received an impromptu stress test.  We passed with flying colors.  Millie Smith’s phone, on the other hand, did not.  The worst affected user got 40 calls, back to back, essentially DDOSing their phone line for 15 minutes straight.

I didn’t have Internet at my new apartment yet, so I picked up my laptop and walked 45 minutes across town at 3 AM to my old apartment to perform damage control.  First I hit the In Case Shit Happens button like I should have hours ago — it stayed on for the next several days.  Then I started making phone calls.  This was, unquestionably, the low point of my entrepreneurial career: picture me in a freezing, pitch black apartment at 4 AM in the morning crying in between calls to apoplectic customers of customers.

Things looked much better in the morning.  Surprisingly to me, I only lost two customers in the debacle, and one of them resubscribed after seeing how I handled it.

High Touch Sales Processes Are Not My Cup of Tea

I’m fairly decent at marketing software on the Internet with low-touch sales: you click on my AdWords ad or SEO’d piece of content, the website convinces you to take a spin, you like the software, and a sale happens without ever speaking to me.  This is born of necessity: I simply couldn’t routinely talk to people when I had a day job.  Happily, there exist at least a few people who will buy AR on this model.

Also happily, for a different kind of happy, there is a channel for AR that I wasn’t aware existed: white label sellers.  Picture a technology consultant or web development shop which has a relationship with a few dozen small businesses in their area.  Many of them sell hours-for-dollars but they would really love to have recurring revenue sources.  Their clients have business models which involve appointments.  They would like to sell AR to their clients as if it were their own software — it lets them have all of the upside of SaaS businesses (recurring revenue, low support, etc) without actually having to write SaaS.  This also has obvious benefits for me: they have boots on the ground to sell AR to their clients, and I don’t.

I had had this in the back of my mind as an option, but it was on the backburner until somebody came to me with a dream client.  Suffice it to say they were just about ready to sign on the dotted line, and it would have involved enough Small Business ($80 / month) accounts to singlehandedly make AR a smashing success.  I immediately dropped what I was doing and built up the infrastructure to actually offer white label accounts and let the white label customers customize their off-brand AR sites.  (You can see one for a fictitious Ocean Waves Spa here.)  All hosting and software gets taken care of by me.

Then that sale fell through.  It was nobody’s fault, really, the contact’s client just happened to decide to exit the line of business which used appointments.  Oof.  This sort of thing happens quite a bit in sales.  One would think I would be used to it, since it isn’t unknown in consulting either, but it still snuck up on me.

Similarly, actually riding herd on white label accounts has been more difficult than I would have expected.  I have had a dozen leads to folks very interested in offering it and then they just dry out, largely because I am not aggressive enough on pushing the deals forward.  My typical customer support workflow is responding to all email and then thinking that I am done.  It is a new experience when a) people are not trying to tell me about problems and b) this means I have work to do.  For example, many folks need marketing support (brochures, questions answered, and whatnot) to make the sale to their clients, and since they have the relationship but know nothing about AR, I need to figure out a way to get them that support in a timely and proactive manner without interrupting everything else I need to do.

Another niggle I had not expected: some B2B customers are unqualified and it is to your advantage to figure that out early and stop pursuing them.  I had a long exchange of emails with a prospect who does professional development for a particular type of business.  Think salon, but crucially, at a much lower price point than salons operate at.  We were 15 emails and thousands of words into discussing possibilities when she indicated that the $9 a month plan would simultaneously a) too costly and b) too limited for most of her customers.

“Ah, I don’t believe that my business is the right fit for your needs.  Best of luck in your search for an alternative.”

A business is defined both by what it does and what it does not do.  I don’t want to spend time marketing the service to customers who think $9 is an appreciable amount of money.  (For that and related reasons, I’ll be killing the $9 account tier for new customers as soon as I get the pricing page redesigned.)

What’s Next?

Same old same old!  I’m continuing to develop AR in response to observed customer needs and requests.  The product is very stable these days — I was able to virtually ignore it during a client engagement with no harm done.  Although I don’t know if I would have agreed with it at the time, I’m glad to have taken my licks when I had five customers as opposed to when I have five thousand — that would have made for a very long night of apologies.

I started implementing Scalable Content Generation (see above plug section) for AR.  Currently, I’m at the “experiment by hand” stage.  The site does not have sufficient link equity to rank for much yet, and I’m not totally wowed with my first concept for the content, so I’m going to try something else towards the end of April.  I also have a project or two in the queue along the lines of A/Bingo: produce something of value to people who are not my customers, put in on the website, collect links, use to bootstrap rankings for commercially valuable keywords.

I’m still tentatively targeting 200 paying accounts by the end of this year.  It will take a bit of acceleration to happen, but after May (going back to the US for family and a bit of the consulting/conference circuit), I’ll have most of summer to concentrate on scaling the marketing plan. I am strongly considering various options for taking things to the next level if I can get things that far.  It will depend on a few factors, some business and some personal, but it looks highly likely that there is a viable micro-ISV in AR and quite likely that there is a bigger business there if I want to go after it.

Any questions?

Software For Underserved Markets

They’ve posted my talk at Business of Software 2010.  I highly recommend watching the video first, prior to reading the slides.

BOS2010 was one of the defining moments of my professional career (notes here). I strongly, strongly advise that you come to it in 2011 if you’re interested in taking your software business to the next level: it’s choc full of very, very smart people who are running real, profitable software businesses. (I will probably be speaking again this year, but for longer and with less joking.)

Some Perspective On The Japan Earthquake

[日本の方へ:読者が日本語版を翻訳してくださいました。ご参照してください。]

I run a small software business in central Japan.  Over the years, I’ve worked both in the local Japanese government (as a translator) and in Japanese industry (as a systems engineer), and have some minor knowledge of how things are done here.  English-language reporting on the matter has been so bad that my mother is worried for my safety, so in the interests of clearing the air I thought I would write up a bit of what I know.

A Quick Primer On Japanese Geography

Japan is an archipelago made up of many islands, of which there are four main ones: Honshu, Shikoku, Hokkaido, and Kyushu.  The one that almost everybody outside of the country will think of when they think “Japan” is Honshu: in addition to housing Tokyo, Nagoya, Osaka, Kyoto, and virtually every other city that foreigners have heard of, it has most of Japan’s population and economic base.  Honshu is the big island that looks like a banana on your globe, and was directly affected by the earthquake and tsunami…

… to an extent, anyway.  See, the thing that people don’t realize is that Honshu is massive. It is larger than Great Britain.  (A country which does not typically refer to itself as a “tiny island nation.”)  At about 800 miles long, it stretches from roughly Chicago to New Orleans.  Quite a lot of the reporting on Japan, including that which is scaring the heck out of my friends and family, is the equivalent of someone ringing up Mayor Daley during Katrina and saying “My God man, that’s terrible — how are you coping?”

The public perception of Japan, at home and abroad, is disproportionately influenced by Tokyo’s outsized contribution to Japanese political, economic, and social life.  It also gets more news coverage than warranted because one could poll every journalist in North America and not find one single soul who could put Miyagi or Gifu on a map.  So let’s get this out of the way: Tokyo, like virtually the whole island of Honshu, got a bit shaken and no major damage was done.  They have reported 1 fatality caused by the earthquake.  By comparison, on any given Friday, Tokyo will typically have more deaths caused by traffic accidents.  (Tokyo is also massive.)

Miyagi is the prefecture hardest hit by the tsunami, and Japanese TV is reporting that they expect fatalities in the prefecture to exceed 10,000.  Miyagi is 200 miles from Tokyo.  (Remember — Honshu is massive.)  That’s about the distance between New York and Washington DC.

Japanese Disaster Preparedness

Japan is exceptionally well-prepared to deal with natural disasters: it has spent more on the problem than any other nation, largely as a result of frequently experiencing them.  (Have you ever wondered why you use Japanese for “tsunamis” and “typhoons”?)  All levels of the government, from the Self Defense Forces to technical translators working at prefectural technology incubators in places you’ve never heard of, spend quite a bit of time writing and drilling on what to do in the event of a disaster.

For your reference, as approximately the lowest person on the org chart for Ogaki City (it’s in Gifu, which is fairly close to Nagoya, which is 200 miles from Tokyo, which is 200 miles from Miyagi, which was severely affected by the earthquake), my duties in the event of a disaster were:

  • Ascertain my personal safety.
  • Report to the next person on the phone tree for my office, which we drilled once a year.
  • Await mobalization in case response efforts required English or Spanish translation.

Ogaki has approximately 150,000 people.  The city’s disaster preparedness plan lists exactly how many come from English-speaking countries.  It is less than two dozen.  Why have a maintained list of English translators at the ready?  Because Japanese does not have a word for excessive preparation.

Another anecdote: I previously worked as a systems engineer for a large computer consultancy, primarily in making back office systems for Japanese universities.  One such system is called a portal: it lets students check on, e.g., their class schedule from their cell phones.

The first feature of the portal, printed in bold red ink and obsessively tested, was called Emergency Notification.  Basically, we were worried about you attempting to check your class schedule while there was a wall of water coming to inundate your campus, so we built in the capability to take over all pages and say, essentially, “Forget about class.  Get to shelter now.”

Many of our clients are in the general vicinity of Tokyo.  When Nagoya (again, same island but very far away) started shaking during the earthquake, here’s what happened:

  1. T-0 seconds: Oh dear, we’re shaking.
  2. T+5 seconds: Where was that earthquake?
  3. T+15 seconds: The government reports that we just had a magnitude 8.8 earthquake off the coast of East Japan.  Which clients of ours are implicated?
  4. T+30 seconds: Two or three engineers in the office start saying “I’m the senior engineer responsible for X, Y, and Z universities.”
  5. T+45 seconds: “I am unable to reach X University’s emergency contact on the phone.  Retrying.”  (Phones were inundated virtually instantly.)
  6. T+60 seconds: “I am unable to reach X University’s emergency contact on the phone.  I am declaring an emergency for X University.  I am now going to follow the X University Emergency Checklist.”
  7. T+90 seconds: “I have activated emergency systems for X University remotely.  Confirm activation of emergency systems.”
  8. T+95 seconds: (second most senior engineer) “I confirm activation of emergency systems for X University.”
  9. T+120 seconds: (manager of group)  “Confirming emergency system activations, sound off: X University.”  “Systems activated.”  “Confirmed systems activated.”  “Y University.”  “Systems activated.”  “Confirmed systems activated.” …

While this is happening, it’s somebody else’s job to confirm the safety of the colleagues of these engineers, at least a few of whom are out of the office at client sites.  Their checklist helpfully notes that confirmation of the safety of engineers should be done by visual inspection first, because they’ll be really effing busy for the next few minutes.

So that’s the view of the disaster from the perspective of a wee little office several hundred miles away, responsible for a system which, in the scheme of things, was of very, very minor importance.

Scenes like this started playing out up and down Japan within, literally, seconds of the quake.

When the mall I was in started shaking, I at first thought it was because it was a windy day (Japanese buildings are designed to shake because the alternative is to be designed to fail catastrophically in the event of an earthquake), until I looked out the window and saw the train station.  A train pulling out of the station had hit the emergency breaks and was stopped within 20 feet — again, just someone doing what he was trained for.  A few seconds after the train stopped, after reporting his status, he would have gotten on the loudspeakers and apologized for inconvenience caused by the earthquake.  (Seriously, it’s in the manual.)

Everything Pretty Much Worked

Let’s talk about trains for a second.  Four One of them were washed away by the tsunami. All Japanese trains survived the tsunami without incident. [Edited to add: Initial reports were incorrect.  Contact was initially lost with 5 trains, but all passengers and crew were rescued.  See here, in Japanese.]  All of the rest — including ones travelling in excess of 150 miles per hour — made immediate emergency stops and no one died.  There were no derailments.  There were no collisions.  There was no loss of control.  The story of Japanese railways during the earthquake and tsunami is the story of an unceasing drumbeat of everything going right.

This was largely the story up and down Honshu.  Planes stayed in the sky.  Buildings stayed standing.  Civil order continued uninterrupted.

On the train line between Ogaki and Nagoya, one passes dozens of factories, including notably a beer distillery which holds beer in pressure tanks painted to look like gigantic beer bottles.  Many of these factories have large amounts of extraordinarily dangerous chemicals maintained, at all times, in conditions which would resemble fuel-air bombs if they had a trigger attached to them.  None of them blew up.  There was a handful of very photogenic failures out east, which is an occupational hazard of dealing with large quantities of things that have a strongly adversarial response to materials like oxygen, water, and chemists.  We’re not going to stop doing that because modern civilization and it’s luxuries like cars, medicine, and food are dependent on industry.

The overwhelming response of Japanese engineering to the challenge posed by an earthquake larger than any in the last century was to function exactly as designed.  Millions of people are alive right now because the system worked and the system worked and the system worked.

That this happened was, I say with no hint of exaggeration, one of the triumphs of human civilization.  Every engineer in this country should be walking a little taller this week.  We can’t say that too loudly, because it would be inappropriate with folks still missing and many families in mourning, but it doesn’t make it any less true.

Let’s Talk Nukes

There is currently a lot of panicked reporting about the problems with two of Tokyo Electric’s nuclear power generation plants in Fukushima.  Although few people would admit this out loud, I think it would be fair to include these in the count of systems which functioned exactly as designed.  For more detail on this from someone who knows nuclear power generation, which rules out him being a reporter, see here.

  • The instant response — scramming the reactors — happened exactly as planned and, instantly, removed the Apocalyptic Nightmare Scenarios from the table.
  • There were some failures of important systems, mostly related to cooling the reactor cores to prevent a meltdown.  To be clear, a meltdown is not an Apocalyptic Nightmare Scenario: the entire plant is designed such that when everything else fails, the worst thing that happens is somebody gets a cleanup bill with a whole lot of zeroes in it.
  • Failure of the systems is contemplated in their design, which is why there are so many redundant ones.  You won’t even hear about most of the failures up and down the country because a) they weren’t nuclear related (a keyword which scares the heck out of some people) and b) redundant systems caught them.
  • The tremendous public unease over nuclear power shouldn’t be allowed to overpower the conclusion: nuclear energy, in all the years leading to the crisis and continuing during it, is absurdly safe.  Remember the talk about the trains and how they did exactly what they were supposed to do within seconds?  Several hundred people still drowned on the trains.  That is a tragedy, but every person connected with the design and operation of the railways should be justifiably proud that that was the worst thing that happened.  At present, in terms of radiation risk, the tsunami appears to be a wash: on the one hand there’s a near nuclear meltdown, on the other hand the tsunami disrupted something really dangerous: international flights.  (One does not ordinarily associate flying commercial airlines with elevated radiation risks.  Then again, one doesn’t normally associate eating bananas with it, either.  When you hear news reports of people exposed to radiation, keep in mind, at the moment we’re talking a level of severity somewhere between “ate a banana” and “carries a Delta Skymiles platinum membership card”.)

What You Can Do

Far and away the worst  thing that happened in the earthquake was that a lot of people drowned.  Your thoughts and prayers for them and their families are appreciated.  This is terrible, and we’ll learn ways to better avoid it in the future, but considering the magnitude of the disaster we got off relatively lightly.  (An earlier draft of this post said “lucky.”  I have since reworded because, honestly, screw luck.  Luck had absolutely nothing to do with it.  Decades of good engineering, planning, and following the bloody checklist are why this was a serious disaster and not a nation-ending catastrophe like it would have been in many, many other places.)

Japan’s economy just got a serious monkey wrench thrown into it, but it will be back up to speed fairly quickly.  (By comparison, it was probably more hurt by either the Leiman Shock or the decision to invent a safety crisis to help out the US auto industry.  By the way, wondering what you can do for Japan?  Take whatever you’re saying currently about “We’re all Japanese”, hold onto it for a few years, and copy it into a strongly worded letter to your local Congresscritter the next time nativism runs rampant.)

A few friends of mine have suggested coming to Japan to pitch in with the recovery efforts.  I appreciate your willingness to brave the radiological dangers of international travel on our behalf, but that plan has little upside to it: when you get here, you’re going to be a) illiterate b) unable to understand instructions and c) a productivity drag on people who are quite capable of dealing with this but will instead have to play Babysit The Foreigner.  If you’re feeling compassionate and want to do something for the sake of doing something, find a charity in your neighborhood.  Give it money.  Tell them you were motivated to by Japan’s current predicament.  You’ll be happy, Japan will recover quickly, and your local charity will appreciate your kindness.

On behalf of myself and the other folks in our community, thank you for your kindness and support.

[本投稿を日本語にすると思っておりますが、より早くできる方がいましたら、ご自由にどうぞ。翻訳を含めて二次的著作物を許可いたします。詳細はこちらまで

This post is released under a Creative Commons license.  I intend to translate it into Japanese over the next few days, but if you want to translate it or otherwise use it, please feel free.]

[Edit: Due to overwhelming volume and a poor signal-to-noise ratio, I am closing comments on this post, but I encourage you to blog about it if you feel strongly about something.]

Japanese Disaster Micro-Update

Apologies for not posting this earlier — I put notices on my business websites but forgot that a lot of folks know me solely through the blog:

  • I live in Gifu, which is quite far from the earthquake epicenter.  We got shaken up a bit, but no permanent damage was done.  We’re landlocked so, unless the mountains fall into the sea, tsunamis are not an issue for us.
  • The people I’m close to in Japan are all OK.
  • We really appreciate your expressions of concern and prayers.
  • If you are wondering “What can I do?”, every day is a good day for charity.  I recommend the Red Cross or your local favorite charity.  In particular, disaster relief charities will use money collected today to help the folks affected by the next major incident, and it is highly probable that they are less well-situated than Japan is — we’re probably as well-prepared as anybody could be.

Thanks as always.  We’ll pull through this, don’t worry.

Regards,

Patrick