Why we have DevOps at Prezi.com

DevOps has become a global movement. As such it means many things to many people. At Prezi.com, DevOps has come to mean clear ownership of a component. Disclaimer: I work at Prezi.com, but my views are not necessarily those of my employer. Before I get into why DevOps have serves Prezi so well, let’s start with a brief history lesson.

The classical product development model

In desktop software companies, product moves on a conveyor belt between homogeneous departments: from design to development, then QA. When the product is finished, it is placed in a box with a pretty user manual and shipped off to customers.

When web applications started to appear, the product development cycle remained essentially unchanged. Developers still wrote the application in isolation from the other parts of the company. Once the software was thoroughly tested by QA and ready to be released, instead of shipping it off in little boxes, the ops team took over. Their job was to create supporting infrastructure (databases, network configuration, etc) and install the app in the production environment.

Operations teams were typically well-equipped to handle hardware failures, but minor changes in the software ecosystem used by the application could result in strange and subtle errors which were often impossible to debug without a deep understanding of how the application worked. Whenever such an issue caused downtime, massive finger-pointing ensued between operations and development (sometimes QA got involved, completing the triangle).

No matter which department won the blame game, the customer always lost. In the case of Prezi, not being able to load a presentation could cost someone a promotion or credit for a university class. Not something people shrug off easily.

Owning software versus using a service

Users intuitively understand that they own desktop software, regardless of what the license agreement actually says. They purchased a box with a disk inside, and the ownership of a physical object in most people’s mind extends to it’s contents (just try telling someone they don’t have the right to lend a book they own to a friend because their license does not permit it).

With web applications, it’s not so clear-cut. Have you ever heard anyone claim they own Google Docs or Farmville or Dropbox? People are more likely to say they have an account with the service. The mental shift from “an object I own” to “a service I use” brings about changes in expectations.

If your 2 year-old car breaks down, you probably won’t blame the car salesman who sold it to you. Perhaps you’ll blame yourself for not being diligent enough in changing the oil. Maybe you’ll write it off as bad luck. A rental car is an entirely different story: you’re paying for the ability to drive from point A to point B. The fact that Hertz gave you a car for your money is an insignificant detail. For all you care, it could have been a magic carpet as long as it gets you where you’re going. On the contrary, if your rental car breaks down, even a Ferrari was not worth the price.

Web applications -like any other service- are expected to work when users need them. Since most web applications (including Prezi) are not restricted to a single timezone, this means 24/7 uptime is a basic requirement, not a “feature”.

Success comes to those who measure it correctly

One of the reasons the old product development cycle was adopted by online software companies is that it seemed very efficient. Developers could concentrate on writing code. QA personnel would test and Sysops built and monitored infrastructure. There was no need to waste time with inter-department meetings or get interrupted by strange questions coming from people who didn’t even use the correct lingo.

A major downside of this approach is that in isolation each department lacks the necessary information to create a winning service. For example, let’s assume there is a spike in user  traffic due to a press release touting some exciting new product feature. No one anticipated such interest and the hardware on which the database runs clearly can’t cope with the load. The operators see that the majority of the long-running queries seem to be search related. One way to keep the database afloat could be to temporarily disable the search feature on the website. Unfortunately, the development team had no time to contemplate this scenario, so they did not provide an easy way to turn it off. Who’s to blame when the site goes down?

The big mismatch here is how operations, development and users measure success. A successful development project completes all requested features before the deadline. Operations is successful if the site is reachable at all times. Users, however, don’t care if all the development deadlines were kept. Nor do they care if Apache is running on all the webservers, it’s just the stupid application that’s not servicing requests. In their view, if the application doesn’t work as expected, the whole company failed.

DevOps combines the development and operations teams into one. If they service works, the team is successful. If not, it really doesn’t matter which part of the stack caused the downtime. The important thing is to get it running again as soon as possible. Merging the two teams is beneficial in preventing outages as well as coping with them. Operators who understand how the application works are better equipped to dimension the hardware, monitor the right alarms, etc. In turn, developers who have a good understanding of how the infrastructure running their code is put together write better code. It’s less likely they’ll make unrealistic assumptions about performance in their code. If a basic property of the underlying system like the average execution time of an SQL query changes with the number of users, informed developers can accommodate their code (by adding an extra layer of caching for example).

What system ownership means at Prezi

So far, I haven’t mentioned too much that was Prezi specific. I work in the infrastructure team, responsible for developing the components which make up Prezi’s backend. Each of the components has multiple owners within the Prezi infrastructure team. One of the components I own is the conversion service. Any time a user inserts content into a Prezi which flash cannot handle natively, the source media is converted to something flash-compatible like a bitmap or an swf file.

It’s possible to create presentations without the conversion service, but the failure of this component greatly reduces the productivity of our users. As a result, the owners of the conversion service take turns being on call. PagerDuty notifies the lucky team member  when things go south.

How does PagerDuty know there’s a problem? We have several thousand Nagios checks defined for each system supporting the Prezi backend. We also mine every log file for clues about the health of any application running on our servers. We write manual checks for things like Amazon SQS queue length and the number of EC2 instances.

Being and owner of the conversion service means I wrote some of the alerts that might wake me up at night (and understand all of them). It means I constantly keep up with what gets committed into git and write reviews for other contributors’ code. It means I frequently think about how to make the service faster, smarter and more resistant to failure. It means I have a deep understanding of which parts of the system can fail, and what the associated symptoms are. If this sounds a little intimidating, well initially it was. Thankfully nobody has ever been decapitated for making a mistake (which happens to all of us occasionally).

There a common misunderstanding about DevOps. People often assume switching to DevOps means developers and operators must learn each others’ skillsets overnight. Nothing could be farther from the truth. We sit next to each other because our skills complement each other. Each time I develop a backend service which requires an Apache configuration to run, I know somebody who can write those in their sleep is ready to help. As long as I take my ownership role seriously and understand why the final webserver config looks the way it does I learn. Over time, my skillset grows (as does everyone else’s in the team). I know we would have built an inferior Prezi backend and used more calendar time if we weren’t doing DevOps, because in isolation each of us knows less than what’s required to build something this complex. We can make it work only by sharing what we know and creating an environment where everyone is encouraged to ask when in doubt.

It’s not just the infrastructure team that works this way. Prezi has no dedicated “operations” team: every team is responsible for deploying and operating the code they write. For example, the data mining team regularly tunes linux kernel parameters when optimization of the hadoop cluster calls for it. Front-end developers often drop by to discuss their ideas about how to further improve the existing backend components used by their code.

Along with the responsibility of ownership comes a great feeling when I see a speaker at TED making her point with a video embedded in her prezi: I know it’s there because I’m doing my job. That’s the satisfaction of knowing I truly make a difference!