Why we have DevOps at Prezi.com

DevOps has become a global movement. As such it means many things to many people. At Prezi.com, DevOps has come to mean clear ownership of a component. Disclaimer: I work at Prezi.com, but my views are not necessarily those of my employer. Before I get into why DevOps have serves Prezi so well, let’s start with a brief history lesson.

The classical product development model

In desktop software companies, product moves on a conveyor belt between homogeneous departments: from design to development, then QA. When the product is finished, it is placed in a box with a pretty user manual and shipped off to customers.

When web applications started to appear, the product development cycle remained essentially unchanged. Developers still wrote the application in isolation from the other parts of the company. Once the software was thoroughly tested by QA and ready to be released, instead of shipping it off in little boxes, the ops team took over. Their job was to create supporting infrastructure (databases, network configuration, etc) and install the app in the production environment.

Operations teams were typically well-equipped to handle hardware failures, but minor changes in the software ecosystem used by the application could result in strange and subtle errors which were often impossible to debug without a deep understanding of how the application worked. Whenever such an issue caused downtime, massive finger-pointing ensued between operations and development (sometimes QA got involved, completing the triangle).

No matter which department won the blame game, the customer always lost. In the case of Prezi, not being able to load a presentation could cost someone a promotion or credit for a university class. Not something people shrug off easily.

Owning software versus using a service

Users intuitively understand that they own desktop software, regardless of what the license agreement actually says. They purchased a box with a disk inside, and the ownership of a physical object in most people’s mind extends to it’s contents (just try telling someone they don’t have the right to lend a book they own to a friend because their license does not permit it).

With web applications, it’s not so clear-cut. Have you ever heard anyone claim they own Google Docs or Farmville or Dropbox? People are more likely to say they have an account with the service. The mental shift from “an object I own” to “a service I use” brings about changes in expectations.

If your 2 year-old car breaks down, you probably won’t blame the car salesman who sold it to you. Perhaps you’ll blame yourself for not being diligent enough in changing the oil. Maybe you’ll write it off as bad luck. A rental car is an entirely different story: you’re paying for the ability to drive from point A to point B. The fact that Hertz gave you a car for your money is an insignificant detail. For all you care, it could have been a magic carpet as long as it gets you where you’re going. On the contrary, if your rental car breaks down, even a Ferrari was not worth the price.

Web applications -like any other service- are expected to work when users need them. Since most web applications (including Prezi) are not restricted to a single timezone, this means 24/7 uptime is a basic requirement, not a “feature”.

Success comes to those who measure it correctly

One of the reasons the old product development cycle was adopted by online software companies is that it seemed very efficient. Developers could concentrate on writing code. QA personnel would test and Sysops built and monitored infrastructure. There was no need to waste time with inter-department meetings or get interrupted by strange questions coming from people who didn’t even use the correct lingo.

A major downside of this approach is that in isolation each department lacks the necessary information to create a winning service. For example, let’s assume there is a spike in user  traffic due to a press release touting some exciting new product feature. No one anticipated such interest and the hardware on which the database runs clearly can’t cope with the load. The operators see that the majority of the long-running queries seem to be search related. One way to keep the database afloat could be to temporarily disable the search feature on the website. Unfortunately, the development team had no time to contemplate this scenario, so they did not provide an easy way to turn it off. Who’s to blame when the site goes down?

The big mismatch here is how operations, development and users measure success. A successful development project completes all requested features before the deadline. Operations is successful if the site is reachable at all times. Users, however, don’t care if all the development deadlines were kept. Nor do they care if Apache is running on all the webservers, it’s just the stupid application that’s not servicing requests. In their view, if the application doesn’t work as expected, the whole company failed.

DevOps combines the development and operations teams into one. If they service works, the team is successful. If not, it really doesn’t matter which part of the stack caused the downtime. The important thing is to get it running again as soon as possible. Merging the two teams is beneficial in preventing outages as well as coping with them. Operators who understand how the application works are better equipped to dimension the hardware, monitor the right alarms, etc. In turn, developers who have a good understanding of how the infrastructure running their code is put together write better code. It’s less likely they’ll make unrealistic assumptions about performance in their code. If a basic property of the underlying system like the average execution time of an SQL query changes with the number of users, informed developers can accommodate their code (by adding an extra layer of caching for example).

What system ownership means at Prezi

So far, I haven’t mentioned too much that was Prezi specific. I work in the infrastructure team, responsible for developing the components which make up Prezi’s backend. Each of the components has multiple owners within the Prezi infrastructure team. One of the components I own is the conversion service. Any time a user inserts content into a Prezi which flash cannot handle natively, the source media is converted to something flash-compatible like a bitmap or an swf file.

It’s possible to create presentations without the conversion service, but the failure of this component greatly reduces the productivity of our users. As a result, the owners of the conversion service take turns being on call. PagerDuty notifies the lucky team member  when things go south.

How does PagerDuty know there’s a problem? We have several thousand Nagios checks defined for each system supporting the Prezi backend. We also mine every log file for clues about the health of any application running on our servers. We write manual checks for things like Amazon SQS queue length and the number of EC2 instances.

Being and owner of the conversion service means I wrote some of the alerts that might wake me up at night (and understand all of them). It means I constantly keep up with what gets committed into git and write reviews for other contributors’ code. It means I frequently think about how to make the service faster, smarter and more resistant to failure. It means I have a deep understanding of which parts of the system can fail, and what the associated symptoms are. If this sounds a little intimidating, well initially it was. Thankfully nobody has ever been decapitated for making a mistake (which happens to all of us occasionally).

There a common misunderstanding about DevOps. People often assume switching to DevOps means developers and operators must learn each others’ skillsets overnight. Nothing could be farther from the truth. We sit next to each other because our skills complement each other. Each time I develop a backend service which requires an Apache configuration to run, I know somebody who can write those in their sleep is ready to help. As long as I take my ownership role seriously and understand why the final webserver config looks the way it does I learn. Over time, my skillset grows (as does everyone else’s in the team). I know we would have built an inferior Prezi backend and used more calendar time if we weren’t doing DevOps, because in isolation each of us knows less than what’s required to build something this complex. We can make it work only by sharing what we know and creating an environment where everyone is encouraged to ask when in doubt.

It’s not just the infrastructure team that works this way. Prezi has no dedicated “operations” team: every team is responsible for deploying and operating the code they write. For example, the data mining team regularly tunes linux kernel parameters when optimization of the hadoop cluster calls for it. Front-end developers often drop by to discuss their ideas about how to further improve the existing backend components used by their code.

Along with the responsibility of ownership comes a great feeling when I see a speaker at TED making her point with a video embedded in her prezi: I know it’s there because I’m doing my job. That’s the satisfaction of knowing I truly make a difference!

Securely backing up your website to google docs

Hard drives fail: it’s a fact of life. The more frequently you back up your data, the safer you are. But making backups by hand is tedious. It’s difficult to find a truly safe place for your files, especially if they are confidential. This post shows working code for automatically encrypting and backing up your most important files or databases to google docs on a daily basis. Continue reading

Wrapper functions: manage resources without going insane

Every once in a while, you’ll stumble upon a technique so elegant, it makes you wonder how you got by without it. Wrapper functions, the subject of this post definitely belong in this category. As the title states, I’ve come to believe they’re the only sane way of managing resources. Best of all, they work in almost any programming language.
Continue reading

Roman numerals in Javascript

Organizers of the excellent ViennaJS meetup suggested we write a function to convert integers to roman numerals and vice versa. It’s a nice logic puzzle which can be surprisingly useful. Try it my solution here:

Arabic Roman

Read on for how it works!
Continue reading

Bidirectional browser-webserver RPC with Thrift

Most people don’t know that you can use Apache Thrift from javascript to make RPC calls to a webserver. My proof of concept goes even farther: it lets the webserver make RPC requests which are run in the browser.

With several javascript clients, you could use the webserver as a hub which routes RPC requests and responses among the clients. Best of all, since the thrift interface definitions make it easy to distinguish well-formed messages from most types of attacks automatically, it becomes not only possible but also safe to make RPC calls between browsers running on different computers. Imagine no custom server side code for real-time web applications like chat rooms and games.

It should be noted that the current code currently does not connect several browsers to each other. The “hub” functionality is missing. It does all the heavy lifting though: the webserver can also make RPC requests which are then run in the browser. Needless to say, the browser can also be the client. All the standard features of thirft work: services, custom exceptions, interface inheritence, etc. Another disclaimer: by proof of concept I mean code that just works. You will find hard-wired file paths, no Makefile, outdated comments, etc. With time I plan to clean this up. For now, take it for what it’s worth: a demonstration that Thrift can be used in new and exciting was in the browser.

So how does it work?

First of all, there needs to be some way to push requests from the server to the client. In the best case scenario, we could use websockets, but since most of the world has not caught up to HTML5 yet, I chose socket.io. Actually, socket.io-erlang, since the server code for the project is written in erlang. Although apache thrift does not support socket.io clients and servers out of the box, adding support was relatively easy.

One of the nice things about thrift is the clear separation between so-called transports, protocols and messages. A transport is responsible for reading/writing bytes. Thrift doesn’t care if it’s through the network, on a disk, or by carrier pigeons. Protocols turn messages into bytes. Protocols are transport-agonstic, so in theory, any protocol could be used in conjunction with socket.io. Due to the difficulty of handling binary data in javascript, thrift has a JSON-based protocol used by the javascript client. Messages can be RPC requests, responses or exceptions. The data types used as parameters or return values are described by a thrift IDL file. An example is tutorial.thrift, which is used by the client-server calculator application (the standard thrift tutorial which is implemented in all the languages thrift supports). This is also the application showcased in this proof of concept demo.

Once the erlang webserver and the javascript client could send thrift messages to each other over the socket.io transport using the thrift json protocol, I thought I was done. I still had a long ways to go.

Thrift messages become objects in object-oriented languages through code generation. There are two code generators for javascript: one for the browser, and one for nodejs. Only the nodejs version produces code which can service RPC requests. The problem is that the nodejs code is not AMD-loader friendly. In order to use it with requirejs, I had to introduce a rather convoluted dance with a fake require object .

But all is well that ends well: after many iterations of trial and error, the entire calculator demo works for me. Give it a spin!

My future plans are to implement the “hub” functionality described in the beginning of the post to allow javascript-to-javascript calls. I’m also considering switching to SockJS, as it is better supported in erlang and has a simpler deployment story.