Jack and I are both long-time software guys who now spend somewhat less of our time thinking about what to build, and somewhat more about how to keep a large number of systems running well. The emergent DevOps culture of the last few years has made it easy for people like us to move between these worlds. Are they really separate worlds anymore? In 2009 John Allspaw and Paul Hammond, heads at the time of ops and development, respectively, at Flickr, gave an influential talk at Velocity called “10+ deploys per day: Dev and Ops Cooperation at Flickr”. It’s about their deploy tool and IRC bots, sure, but it’s also about culture, and especially how to get dev and ops team out of adversarial habits and into a productive state, with a combination of good manners and proper tooling. We have cribbed quite a bit from that family of techniques since 2010, but Wayfair has had an evolution of stack and culture that’s distinctive, and we’re going to try to give a close-up picture of it.
Let’s start with a brief overview of the customer-facing Wayfair stack overs the year since our founding. We’re going to stick to the customer-facing systems, because although there is a lot more to Wayfair tech than what you see here, we’re afraid this will turn into a book if we don’t bound it somehow.
Late 2002: ASP + SQL Server
The middle years: Forget about the tech stack for a while, add VSS (Windows source control) relatively early. Hosting goes from Yahoo shopping, to Hostway shared, to Hostway dedicated, to Savvis managed, to a Savvis cage (now CenturyLink) in Waltham. Programmers focus on building a custom RDBMS-backed platform supporting the business up to ~$300M in sales…
2010: Add Solr for search because database-backed search was slow, small experiments with Hadoop, Netezza, Seattle data center. Yes, you read that right. 2010. 8. years. later. That’s what happens when you’re serious about the whole lean-startup/focus-on-the-business/minimum-viable-stack thing. But now we’ve broken the seal.
2011-2012: Add PHP on FreeBSD, Memcached, MongoDB, Redis, MySQL, .NET and PHP services, Hadoop and Netezza powering site features, RabbitMQ, Zookeeper, Storm, Celery, Subversion everywhere and git experiments, ReviewBoard, Jenkins, deploy tool. Whoah! What happened there!? More on that below.
2013: Serve the last customer-facing ASP request, serve west-coast traffic from Seattle data center on a constant basis, start switching from FreeBSD to CentOS, put git into widespread use.
2015: Dublin data center.
SiteSpect has been our A/B testing framework from very early on, and at this point we have in-line, on-premise SiteSpect boxes between the perimeter and the php-fpm servers in all three data centers.
Here’s a boxes-and-arrows diagram of what’s in our data centers, with a ‘Waltham only’ label on the things that aren’t deplicated, except for disaster recovery purposes. Strongview is an email system that we’re not going to discuss in depth here, but it’s an important part of our system. HP Tableau is a dashboard thing that is pointed at the OLAP-style SQL Servers and Vertica.
Why did we move from ASP to PHP, and why not to .NET? That’s one of the most fascinating things about the whole sequence. Classic ASP worked for us for years, enabling casual coders, as well as some hard-core programmers who weren’t very finicky about the elegance of the tools they were using, to be responsive to customer needs and competitor attacks. Of course there was a huge pile of spaghetti code: we’ll happily buy a round of drinks for the people with elegant architectures who went out of business in the mean time!
But by 2008 or so, classic ASP had started to look like an untenable platform, not least because we had to make special rules for certain sensitive files: we could only push them at low-traffic times of day, which was becoming a serious impediment to sound operations and developer productivity. Microsoft was pushing ASP.NET as an alternative, and on its face that is a natural progression. We gave it a try. We found it to be very unnatural for us, a near-total change in tech culture, in the opposite direction from where we wanted to go: expensive tools, laborious build process, no meaningful improvement in the complexity of calling library code from scripts, etc., etc. We eventually found our way to PHP, which, like ASP, allowed web application developers to do keep-it-simple-stupid problem solving, but to rationalize caching and move our code deployment and configuration management into a good place. In the early days of Wayfair, when there was not even version control, coders would ftp ASP scripts to production. That’s a simple model that has a fire-and-forget feel to an application developer that is very pleasant. Something goes wrong? Redeploy the old version, work up a fix, and fire again, with no server lifecycle management to worry about. It is a lot easier to write a functional tool to deploy code, when you don’t have to do a lot of server lifecycle management, as you do with Java, .NET, in fact most platforms. So we got that working for PHP on FreeBSD, but soon applied it to ASP, Solr, Python and .NET services, and SQL Server stored procedures or ‘sprocs’. Obviously, by that point we had had to figure out server lifecycles after all, but it’s hard to overstate the importance of the ease of getting to a simple system that worked. The operational aspects of php-fpm were a great help in that area. The core of Wayfair application development is what it has always been: pushing code and data structures independently, and pushing small pieces of code independently from the rest of the code base. It’s just that we’re now doing it on a gradually expanding farm of more than a thousand servers, that spans three data centers in Massachusetts, Washington state, and Ireland.
It’s funny. Microservices are all the rage right now, and I was speaking with a microservices luminary a couple of weeks ago. I described how we deploy PHP files to our services layer, either by themselves, or in combination with complementary front-end code, data structures, etc., and theorized that as soon as I had a glorified ftp of all that working in 2011, I had a microservices architecture. He said something like, “Well, if you can deploy one service independently of the others, I guess that’s right.” Still, I wouldn’t actually call what we have a micro-services architecture, or an SOA, even though we have quite a few services now. On the other hand, there’s too much going on in that diagram for it to be called a simple RDMS-backed web site. So what is it? When I need a soundbite on the stack, I usually say, “PHP + databases and continuous deployment of the broadly Facebook/Etsy type. With couches.” So that’s a thing now.
Let’s dig in on continuous deployment, and our deploy tool. Here’s a chart of all the deploys for the last week, one bar per day, screenshot from our engineering blog’s chrome:
Between 110 and 210 per day, Monday to Friday, stepping down through the week, and then a handful of fixes on the weekend. What do those numbers really mean in the life of a developer? There’s actually some aggregation behind the numbers in this simple histogram. We group individual git changesets into batches, and then test them and deploy them, with zero downtime. The metric in the histogram is changesets, not ‘deploys’. Individual changesets can be deployed one by one, but there’s usually so much going on that the batching helps a lot. Database changes are deployed through the same tool, although never batched. The implementation of what ‘deploy’ means is very different for a php/css/js file on the one side, and a sproc on the other, but the developer’s interface to it is identical. Most deploys are pretty simple, but once in a while, to do the zero downtime thing for a big change, a developer might have to make a plan to do a few deploys, along the lines of: (1) change database adding new structures, (2) deploy backwards-compatible code, (3) make sure we really won’t have to go back, (4) clean up tech debt, (5) remove backwards-compatible database constructs. From the point of view of engineering management, the important thing is to allow the development team to go about their business with occasional check-ins and assistance from DBAs, rather than a gating function.
Memcached and Redis are half-way house caches and storage for simple and complex data structures, respectively, but what about MongoDB and MySQL? Great question. In 2010 we launched a brand new home-goods flash-sale business called Jossandmain.com. We outsourced development at first, and the business was a big success. We went live with an in-house version a year later, in November, 2011. Working with key-value stores that have sophisticated blob-o-document storage-and-retrieval capabilities has been a fun thing for developers for a while now. It had the freshness of new things to us in 2011. There were no helpful DAOs in our pile of RDBMS-backed spaghetti code at the time, so the devs were in the classic mode of having to think about storage way too often. Working with medium-sized data structures (a complex product, an ‘event’, etc.) that we could quickly stash in a highly available data store felt like a big productivity gain for the 4-person that built that site in a few months. So why didn’t we switch the whole kit and caboodle over to this productivity-enhancing stack? First of all, we’re not twitchy that way. But secondly, and most importantly, what sped up new feature development had an irritating tendency to slow down analysis. And those document/k-v databases definitely slow you down for analysis, unless you have a large number of analysts with exotic ETL and programming skills. We love how MongoDB has worked out for our flash sale site, but as we extrapolated the idea of adopting it across the sites that use our big catalogue, we foresaw quagmire. By 2011, we were a large enough business that a big hit to analyst productivity was much worse than a small cramp in the developers’ style.
Around the same time, we began to experiment with moving some data that had been in SQL Server into MySQL and MySQL Cluster. The idea was to cut licensing cost and remove some of the cruft that has accumulated in our sproc layer. We have since backed off that idea, because after a little experimentation it began to seem like a fool’s errand. We would have been moving to a database with worse join algorithm implementations and a more limited sproc language, which in practice would have meant porting all our sprocs to application code, a huge exercise of zero obvious benefit to our customers. Since the sprocs are already part of the deployment system, the only compensation besides licensing cost would have been increased uniformity of production operations, which would have been standardized on Linux, but in the end we did not like that trade-off.
Wow! Stored procedures along with application code, colo instead of cloud. We’re really checking all the boxes for Luddite internet companies, aren’t we!? I can’t tell you how many times I’ve gotten a gobsmacked look and a load of snark from punks at meetups who basically ask me how we can still be in business with those choices.
Let’s take these questions one at a time. First, the sprocs. When we say sproc, of course, we mean any kind of code that lives in the DBMS. In SQL server, these can be stored procedures, triggers, or functions. We also have .NET assemblies, which you can call like a function from inside T-SQL. Who among us programmers does not have a visceral horror for these things? I know I do. The coding languages (T-SQL, PL/SQL and their ilk) are unpleasant to read and write, and in many shops, the deployment procedures can be usefully compared to throwing a Molotov cocktail over a barbed-wire fence, to where an angry platoon of DBAs will attempt to minimize the damage it might do. Not that we don’t have deployment problems with sprocs once in a while, but they’re deploy-tool-enabled pieces of code like anything else, and the programmers are empowered to push them.
Secondly, the cloud. If we were starting Wayfair today, would we start it on AWS or GCP? Probably. But Wayfair grew to be a large business before the cloud was a thing. Our traffic can be spiky, but not as bad as sports sites or the internet as a whole. We need to do some planning to turn servers on ahead of anticipated growth, particularly around the holiday season, but it’s typically an early month of the new year when our average traffic is above the peak for the holidays of the previous year, so we don’t think we’re missing a big cost-control opportunity there. Cloud operations certainly speed up one’s ability to turn new things on quickly, but the large-scale operations who make that economical typically have to assign a team to write code that turns things *off*. One way or the other, nobody avoids spending some mindshare to nanny the servers, unless they don’t care about the unit economics of their business. In early startup mode, that’s often fine. Where we are? Meh. It’s a problem, among many others, that we throw smart people at. We think our team is pretty efficient, and we know they’re good at what they do. What is the Wayfair ‘cloud’, which is to say that thing that allows our developers to have the servers they need, more or less when they need them? It looks something like this:
We’re afraid of vendor lock-in, of course, with some of the hardware, which we mostly buy and don’t rent:
But the gentleman on the right makes sure we get good deals.
That’s it for now. Another day, we’ll dig in on the back end for all this.
I wrote a post last year on consistent hashing for Redis and Memcached with ketama: http://engineering.wayfair.com/consistent-hashing-with-memcached-or-redis-and-a-patch-to-libketama/. We’ve evolved our system a lot since then, and I gave a talk about the latest developments at Facebook’s excellent Data@Scale Boston conference in November: https://www.youtube.com/watch?v=oLjryfUZPXU. We have some updates to both design and code that we’re ready to share.
To recap the talk: at any given point over the last four years, we have had what I’d call a minimum viable caching system. The stages were:
- Stand up a Master-slave Memcached pair.
- Add sharded Redis, each shard a master-slave pair, with loosely Pinstagram-style persistence, consistent hashing based on fully distributed ketama clients, and Zookeeper to notify clients of configuration changes.
- Replace (1) with Wayfair-ketamafied Memcached, with no master-slaves, just ketama failover, also managed by Zookeeper.
- Put Twemproxy in front of the Memcached, with Wayfair-ketamafied Twemproxy hacked into it. The ketama code moves from clients, such as PHP scripts and Python services, to the proxy component. The two systems, one with configuration fully distributed, one proxy-based, maintain interoperability, and a few fully distributed clients remain alive to this day.
- Add Redis configuration improvements, especially 2 simultaneous hash rings for transitional states during cluster expansion.
- Switch all Redis keys to ‘Database 0′
- Put Wayfairized Twemproxy in front of Redis.
- Stand up a second Redis cluster in every data center, with essentially the same configuration as Memcached, where there’s no slave for each shard, and every key can be lazily populated from an interactive (non-batch) source.
The code we had to write was
- Some patches to Richard Jones’s ketama, described in full detail in the previous blog post: https://github.com/wayfair/ketama.
- Some patches to Twitter’s Twemproxy : https://github.com/wayfair/twemproxy, a minor change, making it interoperable with the previous item.
- Revisions to php-pecl-memcached, removing a ‘version’ check
- A Zookeeper script to nanny misbehaving cluster nodes. Here’s a gist to give the idea.
Twemproxy/Nutcracker has had Redis support from early on, but apparently Twitter does not run Twemproxy in front of Redis in production, as Yao Yue of Twitter’s cache team discusses here: https://www.youtube.com/watch?v=rP9EKvWt0zo. So we are not necessarily surprised that it didn’t ‘just work’ for us without a slight modification, and the addition of the Zookeeper component.
Along the way, we considered two other solutions for all or part of this problem space: mcRouter and Redis cluster. There’s not much to the mcRouter decision. Facebook released McRouter last summer. Our core use cases were already covered by our evolving composite system, and it seemed like a lot of work to hack Redis support into it, so we didn’t do it. McRouter is an awesome piece of software, and in the abstract it is more full-featured than what we have. But since we’re already down the road of using Redis as a Twitter-style ‘data structures’ server, instead of something more special-purpose like Facebook’s Tao, which is the other thing that mcRouter supports, it felt imprudent to go out on a limb of Redis/mcRouter hacking. The other decision, the one where we decided not to use Redis cluster, was more of a gut-feel thing at the time: we did not want to centralize responsibility for serious matters like shard location with the database. Those databases have a lot to think about already! We’ll certainly continue to keep an eye on that product as it matures.
There’s a sort of footnote to the alternative technologies analysis that’s worth mentioning. We followed the ‘Database 0′ discussion among @antirez and his acolytes with interest. Long story short: numbered databases will continue to exist in Redis, but they are not supported in either Redis cluster or Twemproxy. That looks to us like the consensus of the relevant community. Like many people, we had started using the numbered databases as a quick and dirty set of namespaces quite some time ago, so we thought about hacking *that* into Twemproxy, but decided against it. And then of course we had to move all our data into Database 0, and get our namespace act together, which we did.
Mad props to the loosely confederated cast of characters that I call our distributed systems team. You won’t find them in the org chart at Wayfair, because having a centralized distributed systems team just feels wrong. They lurk in a seemingly random set of software and systems group throughout Wayfair engineering. Special honors to Clayton and Andrii for relentlessly cutting wasteful pieces of code out of components where they didn’t belong, and replacing them with leaner structures in the right subsystem.
Even madder props to the same pair of engineers, for seamless handling of the operational aspects of transitions, as we hit various milestones along this road. Here are some graphs, from milestone game days. In the first one, we start using Twemproxy for data that was already in Database 0. We cut connections to Redis in half:
Then we take another big step down.
Add the two steps, and we’re going from 8K connections, to 219. Sorry for the past, network people, and thanks for your patience! We promise to be good citizens from now on.
Andrew has given a couple of talks at national conferences recently, on other front-end topics. First there was cssdevconf 2014, on web components (slides here: http://www.slideshare.net/andrewrota/web-components-and-modular-css), and more recently React.js Conf 2015, where he spoke about the interoperation of web components and React. Wow, that was a hot conference! Tickets could be had for only a few minutes before it was sold out. Fortunately the Facebook conference people are always really great about posting video, and here’s his talk on Youtube, with slides: http://www.slideshare.net/andrewrota/the-complementarity-of-reactjs-and-web-components.
Update: the slides from the presentation last Thursday night are up, here: http://www.slideshare.net/andrewrota/an-exploration-of-frameworks-and-why-we-built-our-own-46467292
Wayfair Engineering places special emphasis on software testing as a means of maintaining stability in production. The DevTools team, which I am a member of, has built and integrated a number of tools into our development and deploy process in order to catch errors as early as possible, especially before they land in master. If you missed it, last week we released sp-phpunit, a script to manage running PHPUnit tests in parallel.
Today we’re open-sourcing hussar, another tool we’ve been using as part of our deploy process, that performs PHP static analysis using HHVM. The name comes partly from the cavalry unit in Age of Empires II — a classic strategy game where a few of us on DevTools still fight to the end — but mainly from the fact that it’s a good, open name that shares a few letters with the tool’s main feature: HHVM static analysis.
Put simply, hussar builds and compares HHVM static analysis reports. It maintains a project workspace and shows errors introduced by applying patches or merging branches. With hussar, projects that cannot yet run on HHVM are able to realize the benefits of static analysis and catch potentially fatal errors prior to runtime. Here is a list of errors hussar can catch. The tool displays these errors in a formatted report. This means our engineers get the safety of strong typing and static code analysis in addition to the flexibility of development they’re accustomed to.
We wrote hussar as a preparatory step toward possibly running Wayfair on HHVM. When we first tried to use HHVM, we discovered that it lacked support for a number of features and extensions used throughout our codebase. Recognizing both the performance and code quality benefits it could provide, we hacked together a script that would get at least the code analysis component working. Over the past few months this script has gone through several iterations as we worked on edge cases and ironed out false-positives to increase its accuracy and utility. The result is a tool that reliably reports legitimate errors.
We’re using hussar as part of our deploy process in addition to our integration and unit tests. Since we’ve started using the tool it has made us aware of a number of errors that slipped through our other tests. This multi-faceted approach to testing has allowed us to be more confident in deployments while keeping productivity high.
Our engineers are also able to run hussar against their code in advance of the deployment process, so ideally any errors are caught even before code review. We’re using a remotely triggered Jenkins job to coordinate running hussar builds on a dedicated testing cluster. HHVM is a bit heavy, so each machine has 6gb RAM, and reports are written to a shared filesystem to avoid repeating work. Run time is usually five minutes or less.
We also generate a full report nightly and are working through the backlog of existing errors. Each resolved error improves our codebase and brings us one step closer to the possibility of running our sites on the HHVM platform.
We think hussar solves the “backsliding” problem likely faced by all project maintainers with large PHP codebases when considering a migration onto HHVM. That is, it’s usually impractical to address all the static analysis issues at once, since tech debt continues to accumulate as developers work through the backlog. This is solved by hussar’s focus on preventing new errors, which allows momentum to build in the efforts to clean the rest of the codebase. For us, the proof of this is that the number of errors found by static analysis across our codebase has been steadily declining since we started using hussar.
For more details on how hussar works and information on how you can start using it to analyze your own code, head over to the project’s GitHub page. We hope you find it useful, and welcome any contributions!
We write a lot of PHP unit tests at Wayfair, and we want to be able to run them as fast as possible, which seems like a good use case for parallelization. Running tests in parallel is not built in to PHPUnit, but there are ways to do it. When we looked we found three: parallel-phpunit, ParaTest, and GNU Parallel. All met some of our needs, but none was exactly what we wanted, so we got to work.
After hacking for a bit, we settled on these requirements:
- Easy to set up
- Fast to run
- Minimalist configuration and resource usage
- Not dependent on PHP, because chicken-before-egg
and these specifications for input, operation and output:
- Use suites–work with existing test suites, or make suites as needed out of individual test files
- Run suites in parallel
- Preserve exit codes and errors
We looked at GNU Parallel. It worked, but it was an additional dependency, and it is not conveniently packaged for a broad set of platforms. It also ended up running more slowly than backgrounding in the shell, and since we didn’t need any of the fancier/nicer features of it, we cut it out of our scripts.
ParaTest is awesome, but it uses PHP, which complicates things when we’re testing new versions or features of PHP.
Parallel-phpunit was the closest existing tool to what we wanted, but we didn’t like the overhead of invoking a separate PHPUnit process for each file. The logical design of our new ‘Sweet Parallel PHPUnit’ is a suite-enabled bash test-runner script, similar to parallel-phpunit, with output and error codes handled to our liking. The Linux PIPESTATUS array variable was the key to doing this last part in bash.
So we finally got everything working, as you can see https://github.com/wayfair/sp-phpunit on github, and it was time for the moment of truth. Did it actually work any faster than our other options, on our own largest battery of tests? YES! We cut our run time down by 36% relative to the fastest alternative, while maintaining a small memory footprint! Before opensourcing it, we also generated some generic tests, to convince ourselves that our success wasn’t a coincidental artifact of our own test suite.
We wrote scripts to handle a few different scenarios. Here is what they generate:
- One massive file with 2500 unit tests.
- 25 folders each with 100 files, each containing exactly 1 unit test.
- 10 folders each containing 10 files that have 1 unit test that sleeps for between 0 and 2 seconds
- Same thing, but with one anomalous file, hand-edited, that sleeps for 30 seconds instead of 0-2.
- 25 files each containing 100 unit tests.
Below you can see the results of each suite run against the other parallel options, as well as PHPUnit directly for comparison.
|Running with 6 Parallel threads, average over 5 runs(minutes:seconds)|
|2500 files with 1||00:08.93||02:00.00||03:05.67||00:34.87|
|25 files with 100 tests||00:01.35||00:02.12||00:02.65||00:01.47|
|One file with 2500 tests||00:01.83||00:01.99||00:01.73||00:01.49|
|100 files with sleeps||00:18.51||00:19.45||00:22.85||01:40.36|
|100 files with sleeps(one file sleeps for 30 seconds)||00:45.47||00:30.47||00:32.55||02:10.55|
You can see that the more files you have, the more sp-phpunit really shines. We happen to have many small files, with many quick tests spread out among them, so our real test suite is most like the first line in the table, and the improvements are dramatic.
The TODO list for this project is not empty. The way that sp-phpunit generates its temporary suites has no knowledge of how long each sub test/file will take. This can lead to some bad luck where, for example, if you do 6 parallel runs, 5 might finish in 3 seconds, but one that happens to contain the slow tests will take say another minute to finish. This is clearly shown in the last row of our table. The ‘sleep 30′ is added to a bunch of other tests, and the cumulative effect, because of the grouping that we do, pushes the cumulative time for sp-phpunit higher than the other frameworks.
In upcoming versions I’d like to implement something so that you can pass in a folder to create a suite from. Also since this was created for our system only, I’m sure there are some options that other people will want or need that we have not implemented simply because the default behavior works for us. I hope this has given some insight into how and why we built sp-phpunit. We hope others find it as useful as we have. If you do happen to check it out and have some results you would like to share, please reach out. We’d be very excited to hear about it!
Catchpoint is running an event today, called “WebPerf on Location Boston”, part of a series of such events in different cities. It starts at 2, and our very own Jack Wood, Wayfair CIO, is speaking at 3:30. Details and the link to register are here. It’s at Battery Ventures in the seaport area, and it should be an excellent afternoon.
Andrew Rota, an engineer in our client technologies group, is speaking in a bit, at the CSS Dev conference in New Orleans, on “Web Components and Modular CSS #components”. It’s a great talk, about these emerging standards and what’s possible to do with them. If you’re there, check it out: http://cssdevconf2014.sched.org/. We’ll post the links to slides and video when they come out.
Here’s our Frontline team at work, in our spiffy network operations center:
Selling home goods on the internet isn’t rocket science, but if you actually wanted to send a couch to the moon, you’d want to plan for and monitor that from a room like this. If you can’t make out the clocks on the wall, those are the times in Seattle, Ogden Utah, Boston/NYC/Hebron Kentucky, London/Galway, Berlin, and Sydney.
Wayfair (W): ready for lift-off.