When you write your first web application, chances are you’re going to query a database. When you write it in PHP, chances are it’ll look like this:

$mysqli = new mysqli("example.com", "user", "password", "database");
$result = $mysqli->query("SELECT * FROM product");
$row = $result->fetch_assoc();

Before long, you have to start handling user input, which means escaping:

$mysqli = new mysqli("example.com", "user", "password", "database");
$result = $mysqli->query("SELECT * FROM product WHERE name = " . mysqli_real_escape_string($mysqli, $product_name));
$row = $result->fetch_assoc();

As your application grows, you start writing code like this a lot. You may start encapsulating it in a DAO, but they do little besides erect walls around this chimeric code. “Okay,” you say. “This is fine, because it’s only me. I’m a Responsible Engineer and I don’t have to sugar-coat things for myself.” But soon, this project is going gangbusters. You’ve got a team, and then a large one, and now there’s no rug large enough under which you can hide this mess. And woe unto you should you decide you need connection pooling or any other resource management.

One solution to this problem is an ORM. But, some people prefer having their database interactions more “managed” than “abstracted away.” Instead your code could look more like this:

$pdo = new PDO("mysql:host=example.com;dbname=database", "user", "password");
$statement = $pdo->prepare("SELECT * FROM product WHERE name = :name");
$statement->bindParam(":name", $product_name);
$row = $statement->fetch(PDO::FETCH_ASSOC);

A little more verbose, yes, but also easier to read and less error-prone. This is PDO. It’s a PHP extension that provides a vendor-agnostic interface to various relational databases. It pairs a well-structured API for performing queries with a series of different database drivers.

When Wayfair began adopting PDO, our database access was relatively managed. An in-house library managed connections over the course of a request, but building queries involved a whole lot of string concatenation. Complex queries would get unwieldy. Engineers with prior PDO experience wanted to know why we weren’t using it. However, to convince engineers new to PDO that it would make their lives easier, it had be as low friction as the existing library and produce output in the same format.

Simplifying PDO syntax was the easy part. Technically, the example given is shy on error handling. The PDO constructor can throw exceptions. Related functions return a boolean value, indicating whether they succeeded. So a “correct” PDO example would look like this:

$pdo = null;
try {
  $pdo = new PDO("mysql:host=example.com;dbname=database", "user", "password");
} catch (Exception $e) {
  // logging, etc., if you want to note when you were unable to get a connection

$statement = false;  // PDO::prepare() will return false if there’s an error
if ($pdo) {
  $statement = $pdo->prepare("SELECT * FROM product WHERE name = :name");

$row = null;
if ($statement) {
  $statement->bindParam(":name", $product_name);

  if ($statement->execute()) {
    $row = $statement->fetch(PDO::FETCH_ASSOC);

// now, do something with $row

Awesome, I know, right? Sure, the PDO API is, on the whole, “nicer,” but no one’s going to want to deal with it if they’re forced to jump through these kinds of hoops. And who could blame you? At Wayfair, we place a lot of value on developer ergonomics. These are problems we strive to solve well when rolling out new internal tools. We landed on a slight extension to PDO that would yield this syntax:

$statement = PDO::new_statement("PT", "SELECT * FROM product WHERE name = :name"); // the first argument refers to the desired host/database
$statement->bindParam(":name", $product_name);
$row = $statement->fetch(); // PDO::FETCH_ASSOC is now the default fetch style

We pulled all the boilerplate into a factory function. It does the necessary error handling and reporting. If everything succeeds, it’ll return a standard-issue PDO statement object. If there are errors, it will return a special object which acts like a statement that’s failed, but will return empty result sets if asked. We felt comfortable that this would remove most of the friction around using PDO while preserving the underlying interface. Anyone who wants finer-grained control can still utilize the stock API.

The trickier problem was “make output the same.” While PDO looks the same with each driver, the drivers don’t necessarily behave the same. The documentation isn’t always clear about these differences. We needed to do a fair amount of testing and source code reading to suss out the effects.

While my examples have used MySQL, Wayfair is an MSSQL shop. We had been using the mssql extension. It uses a C API called DBLIB to talk to the server. Microsoft doesn’t maintain an open source version. FreeTDS is the commonly-used free implementation. One of the PDO drivers also uses DBLIB, but it returns column data differently. Instead of returning strings as strings and ints as ints, the PDO DBLIB driver returns everything as a string. We had to patch it to use the expected data types. To be able to differentiate between quoting strings as VARCHAR vs. NVARCHAR, we also added a parameter type. We also added support for the setting connection timeouts (PDO defines a PDO::ATTR_TIMEOUT constant, but it has no effect with the DBLIB driver).

Another reason we were first attracted to PDO was for prepared statements. Since MSSQL supports them, it seemed like this could be an opportunity for a performance gain. However, after digging into the driver internals, we found that the DBLIB driver only emulates them. Microsoft has an ODBC driver for Linux. We tested it in conjunction with PDO’s ODBC driver, but found the two to be incompatible. We were able to get it working with the plain odbc extension, but (amazingly) found prepared statements to be slower than regular queries. Since using prepared statements would’ve necessitated a nontrivial change in coding style, we decided against investigating the speed difference.

We’re currently working on deploying SQL Relay. Preliminary tests have proven out that it reduces network load without adding much overhead. It has a PDO driver, so we’ll be able to swap it into our stack without having to change how queries are made.

Tungsten.js: UI Framework with Virtual DOM + Mustache Templates

Performance is top priority here at Wayfair, because improved performance means an improved customer experience. A significant piece of web performance is the time it takes to render, or generate, the markup for a page. Over the last several months we’ve worked hard to improve the render performance on our customer facing sites, and ensure it’s easier for our engineers to write code that results in page renders with optimal performance.

We had been using Backbone.js and Mustache templates for our JavaScript modules at Wayfair for some time, but last year we realized that our front-end performance needed an upgrade. We identified two areas for improvement: efficiency of client-side DOM updates and abstracting DOM manipulation away from engineers.

The first issue was a result of the standard render implementation in Backbone. By default, the Backbone render implementation does nothing. It is up to developers to implement the render function as they see fit. A common implementation of this (and the example given in Backbone Docs) looks something like this:

render: function() {

The problem with this implementation is two-fold: first, the entire view is unnecessarily re-rendered with jQuery’s $().html() whenever render is called, and second, the render method always manipulates the DOM regardless of whether the data changed, so engineers must be explicit about when render is called to avoid unnecessary expensive DOM updates. The solution to both of these problems is a mix of only calling render when the engineer is sure the entire view needs to be re-rendered and then writing low-level DOM manipulation code when only portions of the view need to be updated, or the update needs to be more precise (e.g., changing a single class on an element in the view). All of this means that engineers have to be very aware of the state of the DOM at all times, and have to be aware of the performance consequences of any DOM manipulations. This makes for view modules that are hard to reason about and include low-level DOM manipulation code.

To address both of these problems, we investigated front-end frameworks that would abstract the DOM from the developer while also providing high-performance updates. The primary library we looked at was React.js, a UI library open-sourced by Facebook that utilizes a one-way data flow with virtual DOM rendering to support high-performance client-side DOM updates. We really liked React.js, but encountered one major issue: the lack of support for templates which enabled high-performance server-side rendering.

On modern web sites and applications, HTML rendering occurs at two points: once on page load when the DOM is built from HTML delivered from the server, and again (0 to many times) when JavaScript updates the DOM after page load, usually as a result of the user interacting with the page. The initial rendering happens on the server with a multi-page site like Wayfair, and we’ve put a lot of work into making sure it’s as fast as it can be. HTML markup is written in Mustache templates and rendered via a C++ mustache renderer implemented as an extension for PHP. This gives us server-side rendering at speeds faster even than native PHP views.

Since server-side rendering is an important part of our web site, we were glad that React.js comes with this feature out of the box. Unfortunately while server-side rendering is available with React.js, it’s significantly slower than our existing C++ Mustache setup. In addition to performance, rendering React.js on the server would have required Node.js servers to supplement our PHP servers. This new requirement for UI rendering would have introduced complexity as well as a new single point of failure into our server stack. For these reasons, as well as the fact that we already had existing Mustache templates we wished to reuse, we decided React.js wasn’t a good fit.

Where do we go from here? We liked many of the concepts React.js introduced us to, such as reactive data-driven views and virtual DOM rendering, but we didn’t want our choice of a front-end framework to dictate our server-side technologies, and dictate a replacement of Mustache rendering via C++ and PHP. So, after some investigation of what else was available, we decided to take the concepts we liked from React.js and implement them ourselves with features that made sense for our tech stack.

Earlier this year, we wrote Tungsten.js, a modular web UI framework that leverages shared Mustache templates to enable high-performance rendering on both server and client. A few weeks ago we announced that we were open sourcing Tungsten.js, and today we’re excited to announce that primary development on Tungsten.js will be “GitHub first,” and all new updates to the framework can be found on our GitHub repo: https://github.com/wayfair/tungstenjs.

Tungsten.js is the bridge we built between Mustache templates, virtual-DOM, and Backbone.js. It uses the Ractive compiler to pre-compile Mustache templates to functions that return virtual DOM objects. It uses the virtual-DOM diff/patch library to make intelligent updates to the DOM. And it uses Backbone.js views, models, and collections as the developer-facing API. At least, it uses all these libraries for us here at Wayfair. Tungsten.js emphasizes modularity above all else. Any one of these layers in Tungsten.js can be swapped out for a similar library paired with an adaptor. Backbone could be swapped out for Ampersand. virtual-DOM could be swapped out another implementation. Mustache could be swapped out for Handlebars, Jade, or even JSX. So, more generally, Tungsten.js is a bridge between any combination of markup notation (templates), a UI updating mechanism, and a view layer for developers.

We don’t expect Tungsten.js to be the best fit for everyone, but we think it fits a common set of uses cases very well. We’ve been using it on customer-facing pages in production for a while here at Wayfair and so far we’ve been very happy with it. Our full-stack engineers frequently tell us they far prefer using Tungsten to vanilla Backbone.js + jQuery, and we’ve improved client-side performance now that DOM manipulation is abstracted away from developers. And while we weren’t trying to be the “fastest” front-end framework around, it turns out that when we re-implemented Ryan Florance’s DBMonster demo from React.js Conf in Tungsten.js, the browser’s frame rate ends up being, give or take, at the same level as both React and Ember with Glimmer.

Here at Wayfair we have a saying that “we’re never done”. That’s certainly the case with Tungsten.js, which we’re constantly improving. We have a lot of ideas for Tungsten.js in the coming months, so watch the repo for updates. And of course we welcome contributions!

No-follow SEO link highlighter Chrome extension

Cari, who is a developer on our SEO team, just wrote a Chrome extension that’s up on both github (https://github.com/wayfair/nofollow_highlighter) and the Chrome web store (click here to add to Chrome). If you don’t know this subject matter, here’s a classic explanation from Matt Cutts of Google, from a few years ago: https://www.mattcutts.com/blog/pagerank-sculpting/. A lot has changed in SEO since then, but this basic idea has become a constant in internet life: if you’re compensating people for a promotional activity, they need to make that clear to Google with a ‘nofollow’ link. Here’s an example of a blogger who is doing it right, with a disclaimer saying she was compensated with a gift card, to keep the FTC happy, and a ‘nofollow’ link with the yellow highlight from our plug-in, indicating that the link properly warns Google not to pass page rank, or whatever they’re calling ‘Google mojo’ these days, through to the destination. The other links on the page don’t show up color-coded one way or the other, because they go to domains we don’t care about.

A green highlight, indicating that Google thinks we want it to pass page rank, would be a problem. If you don’t like the idea of green being a problem, the default yellow and green colors are configurable to whatever you want.

If your promotions people are working with a blogger who forgets to do that, or misspells ‘nofollow,’ or anything along those lines, it’s on you to get that cleaned up in a hurry. It’s suboptimal to have to ‘view-source’ on every page or write your own crawler: enter the browser-based ‘no-follow’ extension. There have been a few of these for different browsers over the years, but none did exactly what we wanted, so we rolled our own.

The features of ours that we like are:

  • Configurable list of domains whose reputation you are trying to defend.
  • Click-button activation/deactivation on pages, which is persistent.
  • Aggressive defense against misspellings, bad formatting, etc.

The configurable list is important, because if you’re looking over a page that links to one of your sites, and it links to several other sites, it’s best if you don’t have to puzzle over which links you care about.

The persistent flagging of pages that you care about is important, because if you’re engaged in a promotional activity with a site, odds are someone at your company is going to be that site from time to time. Tell your colleague to enable the plugin and to be on the lookout for green links, and you’ve got a visual cue that’s hard to miss, for problems that might arise.

The defense against misspellings, special characters, and the like, is for this scenario. Brian, head of Wayfair SEO: “Hey Bob, did you put ‘nofollow’ on those links?” Bob: “Yup”. Brian: “kthxbye”. But in fact, although Bob is telling the truth, he actually put smart quotes, rather than ascii quotes, around ‘nofollow,’ so Google will not recognize the instruction. It’s funny: browsers do a great job of supplying missing closing tags, guessing common spelling mistakes, etc., because their mission in life is to paint the page, regardless of the foibles and carelessness of web page authors. Nationwide proofreaders’ strike? No problem, browsers will more or less read your mind! But Google’s mission in life is to crawl the web and pass page rank, so you have to tell it very clearly not to do that.

And of course, when all your links are clean, you can have a luau party in your Tiki hut, like our SEO team:
SEO tiki hut luau

TechJam 2015

Come hang out with us tomorrow, June 11th, from 4 pm to 9 pm, at TechJam. Not to be too transparent, but we’re hiring! We will be at booth 43, bostontechjam@wayfair.com, #btj2015. Steve Conine, Wayfair Founder and CTO, and I will be there, along with a bunch of our colleagues in Wayfair engineering.

We will have a Money Booth where people can enter by checking in on Facebook or tagging Wayfair in a picture on Instagram/Twitter, #wayfairbtj2015. The person who grabs the most money will win that amount in Wayfair Bucks.

We will also have some Wayfair swag at the booth.

vim emacs talk by Aaron Bieber

I can’t believe I’m writing a post about vim and emacs in the year 2015! But our very own Aaron Bieber just spoke at the vim meetup on how he’s been secretly using emacs all the time for a few months, and is now coming out of the closet as an emacs user. Vim vs. emacs is an eternal holy war, and pretty much the opposite of a topic that I would normally want to write about. But Aaron is the opposite of a holy warrior, as anyone at Wayfair Engineering can tell you. Here’s the announcement of the talk: http://www.meetup.com/The-Boston-Vim-Meetup/events/222395931/, here’s his personal blog post on the topic: http://blog.aaronbieber.com/blog/2015/01/11/learning-to-love-emacs/, and here’s the video: https://www.youtube.com/watch?v=JWD1Fpdd4Pc, with cool jazz!

Evil mode is what makes this possible, of course. I used to be Aaron’s manager, and as someone who has used his .vimrc / .vim-folder setup, after watching this talk I’m at least going to try his .emacs file, as an adjunct to my crusty old pseudo-Python-IDE thing. As an engineering manager, the key comment to me was the thing about how ctags (for tab completion) works just as well in both environments. As long as people are using something that helps them save time that would otherwise be spent on meaningless drudgery, to each his own!

Announcing Tungstenjs

Matt DeGennaro and Andrew Rota of our Javascript team recently spoke at the BostonJS meetup on a library we have written called Tungstenjs, which we have opensourced today. It takes the fast-virtual-dom-update idea from React.js and makes it usable with other frameworks, including Backbone.js. It ships with a Backbone adapter. There’s a server-side component too, using npm and Mustache templating, but perhaps I should just let the ‘readme’ tell you: https://github.com/wayfair/tungstenjs. We’ve been using it on Wayfair, and it’s awesome.

Stackdive: the evolution of Wayfair’s stack

Jack Wood and I, CIO and Chief Architect of Wayfair, spoke at Stackdive, at Wayfair’s on April 23. Here’s the matching blog post we published on the Stackdive site, now crossposted here.

Jack and I are both long-time software guys who now spend somewhat less of our time thinking about what to build, and somewhat more about how to keep a large number of systems running well. The emergent DevOps culture of the last few years has made it easy for people like us to move between these worlds. Are they really separate worlds anymore? In 2009 John Allspaw and Paul Hammond, heads at the time of ops and development, respectively, at Flickr, gave an influential talk at Velocity called “10+ deploys per day: Dev and Ops Cooperation at Flickr”. It’s about their deploy tool and IRC bots, sure, but it’s also about culture, and especially how to get dev and ops team out of adversarial habits and into a productive state, with a combination of good manners and proper tooling. We have cribbed quite a bit from that family of techniques since 2010, but Wayfair has had an evolution of stack and culture that’s distinctive, and we’re going to try to give a close-up picture of it.

Let’s start with a brief overview of the customer-facing Wayfair stack overs the year since our founding. We’re going to stick to the customer-facing systems, because although there is a lot more to Wayfair tech than what you see here, we’re afraid this will turn into a book if we don’t bound it somehow.

Early 2002 (founding, in a back room of Steve Conine’s house): Yahoo Shopping for a bout 5 minutes, then ASP + Microsoft Access.steveinthe00s


Late 2002: ASP + SQL Server

The middle years: Forget about the tech stack for a while, add VSS (Windows source control) relatively early.  Hosting goes from Yahoo shopping, to Hostway shared, to Hostway dedicated, to Savvis managed, to a Savvis cage (now CenturyLink) in Waltham.  Programmers focus on building a custom RDBMS-backed platform supporting the business up to ~$300M in sales…

2010: Add Solr for search because database-backed search was slow, small experiments with Hadoop, Netezza, Seattle data center.  Yes, you read that right.  2010.  8. years. later.  That’s what happens when you’re serious about the whole lean-startup/focus-on-the-business/minimum-viable-stack thing.  But now we’ve broken the seal.

2011-2012: Add PHP on FreeBSD, Memcached, MongoDB, Redis, MySQL, .NET and PHP services, Hadoop and Netezza powering site features, RabbitMQ, Zookeeper, Storm, Celery, Subversion everywhere and git experiments, ReviewBoard, Jenkins, deploy tool.  Whoah!  What happened there!?  More on that below.

2013: Serve the last customer-facing ASP request, serve west-coast traffic from Seattle data center on a constant basis, start switching from FreeBSD to CentOS, put git into widespread use.

2014: Vertica.

2015: Dublin data center.

SiteSpect has been our A/B testing framework from very early on, and at this point we have in-line, on-premise SiteSpect boxes between the perimeter and the php-fpm servers in all three data centers.

Here’s a boxes-and-arrows diagram of what’s in our data centers, with a ‘Waltham only’ label on the things that aren’t deplicated, except for disaster recovery purposes.  Strongview is an email system that we’re not going to discuss in depth here, but it’s an important part of our system.  HP Tableau is a dashboard thing that is pointed at the OLAP-style SQL Servers and Vertica.



Why did we move from ASP to PHP, and why not to .NET?  That’s one of the most fascinating things about the whole sequence.  Classic ASP worked for us for years, enabling casual coders, as well as some hard-core programmers who weren’t very finicky about the elegance of the tools they were using, to be responsive to customer needs and competitor attacks.  Of course there was a huge pile of spaghetti code: we’ll happily buy a round of drinks for the people with elegant architectures who went out of business in the mean time!

But by 2008 or so, classic ASP had started to look like an untenable platform, not least because we had to make special rules for certain sensitive files: we could only push them at low-traffic times of day, which was becoming a serious impediment to sound operations and developer productivity. Microsoft was pushing ASP.NET as an alternative, and on its face that is a natural progression. We gave it a try. We found it to be very unnatural for us, a near-total change in tech culture, in the opposite direction from where we wanted to go: expensive tools, laborious build process, no meaningful improvement in the complexity of calling library code from scripts, etc., etc. We eventually found our way to PHP, which, like ASP, allowed web application developers to do keep-it-simple-stupid problem solving, but to rationalize caching and move our code deployment and configuration management into a good place.  In the early days of Wayfair, when there was not even version control, coders would ftp ASP scripts to production.  That’s a simple model that has a fire-and-forget feel to an application developer that is very pleasant.  Something goes wrong?  Redeploy the old version, work up a fix, and fire again, with no server lifecycle management to worry about.  It is a lot easier to write a functional tool to deploy code, when you don’t have to do a lot of server lifecycle management, as you do with Java, .NET, in fact most platforms.  So we got that working for PHP on FreeBSD, but soon applied it to ASP, Solr, Python and .NET services, and SQL Server stored procedures or ‘sprocs’.  Obviously, by that point we had had to figure out server lifecycles after all, but it’s hard to overstate the importance of the ease of getting to a simple system that worked. The operational aspects of php-fpm were a great help in that area.  The core of Wayfair application development is what it has always been: pushing code and data structures independently, and pushing small pieces of code independently from the rest of the code base.  It’s just that we’re now doing it on a gradually expanding farm of more than a thousand servers, that spans three data centers in Massachusetts, Washington state, and Ireland.

It’s funny.  Microservices are all the rage right now, and I was speaking with a microservices luminary a couple of weeks ago.  I described how we deploy PHP files to our services layer, either by themselves, or in combination with complementary front-end code, data structures, etc., and theorized that as soon as I had a glorified ftp of all that working in 2011, I had a microservices architecture.  He said something like, “Well, if you can deploy one service independently of the others, I guess that’s right.”   Still, I wouldn’t actually call what we have a micro-services architecture, or an SOA, even though we have quite a few services now.  On the other hand, there’s too much going on in that diagram for it to be called a simple RDMS-backed web site.  So what is it?  When I need a soundbite on the stack, I usually say, “PHP + databases and continuous deployment of the broadly Facebook/Etsy type.  With couches.”  So that’s a thing now.

Let’s dig in on continuous deployment, and our deploy tool.  Here’s a chart of all the deploys for the last week, one bar per day, screenshot from our engineering blog’s chrome:


Between 110 and 210 per day, Monday to Friday, stepping down through the week, and then a handful of fixes on the weekend.  What do those numbers really mean in the life of a developer?  There’s actually some aggregation behind the numbers in this simple histogram.  We group individual git changesets into batches, and then test them and deploy them, with zero downtime.  The metric in the histogram is changesets, not ‘deploys’.  Individual changesets can be deployed one by one, but there’s usually so much going on that the batching helps a lot.  Database changes are deployed through the same tool, although never batched.  The implementation of what ‘deploy’ means is very different for a php/css/js file on the one side, and a sproc on the other, but the developer’s interface to it is identical.  Most deploys are pretty simple, but once in a while, to do the zero downtime thing for a big change, a developer might have to make a plan to do a few deploys, along the lines of: (1) change database adding new structures, (2) deploy backwards-compatible code, (3) make sure we really won’t have to go back, (4) clean up tech debt, (5) remove backwards-compatible database constructs.  From the point of view of engineering management, the important thing is to allow the development team to go about their business with occasional check-ins and assistance from DBAs, rather than a gating function.

Memcached and Redis are half-way house caches and storage for simple and complex data structures, respectively, but what about MongoDB and MySQL?  Great question.  In 2010 we launched a brand new home-goods flash-sale business called Jossandmain.com.  We outsourced development at first, and the business was a big success.  We went live with an in-house version a year later, in November, 2011.  Working with key-value stores that have sophisticated blob-o-document storage-and-retrieval capabilities has been a fun thing for developers for a while now.  It had the freshness of new things to us in 2011.  There were no helpful DAOs in our pile of RDBMS-backed spaghetti code at the time, so the devs were in the classic mode of having to think about storage way too often.  Working with medium-sized data structures (a complex product, an ‘event’, etc.) that we could quickly stash in a highly available data store felt like a big productivity gain for the 4-person that built that site in a few months.  So why didn’t we switch the whole kit and caboodle over to this productivity-enhancing stack?  First of all, we’re not twitchy that way.  But secondly, and most importantly, what sped up new feature development had an irritating tendency to slow down analysis.  And those document/k-v databases definitely slow you down for analysis, unless you have a large number of analysts with exotic ETL and programming skills.   We love how MongoDB has worked out for our flash sale site, but as we extrapolated the idea of adopting it across the sites that use our big catalogue, we foresaw quagmire.  By 2011, we were a large enough business that a big hit to analyst productivity was much worse than a small cramp in the developers’ style.

Around the same time, we began to experiment with moving some data that had been in SQL Server into MySQL and MySQL Cluster.  The idea was to cut licensing cost and remove some of the cruft that has accumulated in our sproc layer.  We have since backed off that idea, because after a little experimentation it began to seem like a fool’s errand.  We would have been moving to a database with worse join algorithm implementations and a more limited sproc language, which in practice would have meant porting all our sprocs to application code, a huge exercise of zero obvious benefit to our customers.  Since the sprocs are already part of the deployment system, the only compensation besides licensing cost would have been increased uniformity of production operations, which would have been standardized on Linux, but in the end we did not like that trade-off.

Wow! Stored procedures along with application code, colo instead of cloud.  We’re really checking all the boxes for Luddite internet companies, aren’t we!?  I can’t tell you how many times I’ve gotten a gobsmacked look and a load of snark from punks at meetups who basically ask me how we can still be in business with those choices.

Let’s take these questions one at a time.  First, the sprocs.  When we say sproc, of course, we mean any kind of code that lives in the DBMS.  In SQL server, these can be stored procedures, triggers, or functions.  We also have .NET assemblies, which you can call like a function from inside T-SQL.  Who among us programmers does not have a visceral horror for these things?  I know I do.  The coding languages (T-SQL, PL/SQL and their ilk) are unpleasant to read and write, and in many shops, the deployment procedures can be usefully compared to throwing a Molotov cocktail over a barbed-wire fence, to where an angry platoon of DBAs will attempt to minimize the damage it might do.  Not that we don’t have deployment problems with sprocs once in a while, but they’re deploy-tool-enabled pieces of code like anything else, and the programmers are empowered to push them.

Secondly, the cloud.  If we were starting Wayfair today, would we start it on AWS or GCP?  Probably.  But Wayfair grew to be a large business before the cloud was a thing.  Our traffic can be spiky, but not as bad as sports sites or the internet as a whole.  We need to do some planning to turn servers on ahead of anticipated growth, particularly around the holiday season, but it’s typically an early month of the new year when our average traffic is above the peak for the holidays of the previous year, so we don’t think we’re missing a big cost-control opportunity there.  Cloud operations certainly speed up one’s ability to turn new things on quickly, but the large-scale operations who make that economical typically have to assign a team to write code that turns things *off*.  One way or the other, nobody avoids spending some mindshare to nanny the servers, unless they don’t care about the unit economics of their business.  In early startup mode, that’s often fine.  Where we are?  Meh.  It’s a problem, among many others, that we throw smart people at.  We think our team is pretty efficient, and we know they’re good at what they do.  What is the Wayfair ‘cloud’, which is to say that thing that allows our developers to have the servers they need, more or less when they need them?  It looks something like this:


We’re afraid of vendor lock-in, of course, with some of the hardware, which we mostly buy and don’t rent:


But the gentleman on the right makes sure we get good deals.

That’s it for now.  Another day, we’ll dig in on the back end for all this.

Scaling Redis and Memcached at Wayfair

I wrote a post last year on consistent hashing for Redis and Memcached with ketama: http://engineering.wayfair.com/consistent-hashing-with-memcached-or-redis-and-a-patch-to-libketama/. We’ve evolved our system a lot since then, and I gave a talk about the latest developments at Facebook’s excellent Data@Scale Boston conference in November: https://www.youtube.com/watch?v=oLjryfUZPXU. We have some updates to both design and code that we’re ready to share.

To recap the talk: at any given point over the last four years, we have had what I’d call a minimum viable caching system. The stages were:

  1. Stand up a Master-slave Memcached pair.
  2. Add sharded Redis, each shard a master-slave pair, with loosely Pinstagram-style persistence, consistent hashing based on fully distributed ketama clients, and Zookeeper to notify clients of configuration changes.
  3. Replace (1) with Wayfair-ketamafied Memcached, with no master-slaves, just ketama failover, also managed by Zookeeper.
  4. Put Twemproxy in front of the Memcached, with Wayfair-ketamafied Twemproxy hacked into it. The ketama code moves from clients, such as PHP scripts and Python services, to the proxy component. The two systems, one with configuration fully distributed, one proxy-based, maintain interoperability, and a few fully distributed clients remain alive to this day.
  5. Add Redis configuration improvements, especially 2 simultaneous hash rings for transitional states during cluster expansion.
  6. Switch all Redis keys to ‘Database 0’
  7. Put Wayfairized Twemproxy in front of Redis.
  8. Stand up a second Redis cluster in every data center, with essentially the same configuration as Memcached, where there’s no slave for each shard, and every key can be lazily populated from an interactive (non-batch) source.

The code we had to write was

  1. Some patches to Richard Jones’s ketama, described in full detail in the previous blog post: https://github.com/wayfair/ketama.
  2. Some patches to Twitter’s Twemproxy : https://github.com/wayfair/twemproxy, a minor change, making it interoperable with the previous item.
  3. Revisions to php-pecl-memcached, removing a ‘version’ check
  4. A Zookeeper script to nanny misbehaving cluster nodes. Here’s a gist to give the idea.

Twemproxy/Nutcracker has had Redis support from early on, but apparently Twitter does not run Twemproxy in front of Redis in production, as Yao Yue of Twitter’s cache team discusses here: https://www.youtube.com/watch?v=rP9EKvWt0zo. So we are not necessarily surprised that it didn’t ‘just work’ for us without a slight modification, and the addition of the Zookeeper component.

Along the way, we considered two other solutions for all or part of this problem space: mcRouter and Redis cluster. There’s not much to the mcRouter decision. Facebook released McRouter last summer. Our core use cases were already covered by our evolving composite system, and it seemed like a lot of work to hack Redis support into it, so we didn’t do it. McRouter is an awesome piece of software, and in the abstract it is more full-featured than what we have. But since we’re already down the road of using Redis as a Twitter-style ‘data structures’ server, instead of something more special-purpose like Facebook’s Tao, which is the other thing that mcRouter supports, it felt imprudent to go out on a limb of Redis/mcRouter hacking. The other decision, the one where we decided not to use Redis cluster, was more of a gut-feel thing at the time: we did not want to centralize responsibility for serious matters like shard location with the database. Those databases have a lot to think about already! We’ll certainly continue to keep an eye on that product as it matures.

There’s a sort of footnote to the alternative technologies analysis that’s worth mentioning. We followed the ‘Database 0’ discussion among @antirez and his acolytes with interest. Long story short: numbered databases will continue to exist in Redis, but they are not supported in either Redis cluster or Twemproxy. That looks to us like the consensus of the relevant community. Like many people, we had started using the numbered databases as a quick and dirty set of namespaces quite some time ago, so we thought about hacking *that* into Twemproxy, but decided against it. And then of course we had to move all our data into Database 0, and get our namespace act together, which we did.

Mad props to the loosely confederated cast of characters that I call our distributed systems team. You won’t find them in the org chart at Wayfair, because having a centralized distributed systems team just feels wrong. They lurk in a seemingly random set of software and systems group throughout Wayfair engineering. Special honors to Clayton and Andrii for relentlessly cutting wasteful pieces of code out of components where they didn’t belong, and replacing them with leaner structures in the right subsystem.

Even madder props to the same pair of engineers, for seamless handling of the operational aspects of transitions, as we hit various milestones along this road. Here are some graphs, from milestone game days. In the first one, we start using Twemproxy for data that was already in Database 0. We cut connections to Redis in half:


Then we take another big step down.


Add the two steps, and we’re going from 8K connections, to 219. Sorry for the past, network people, and thanks for your patience! We promise to be good citizens from now on.

Front end talks

Andrew Rota and Matt DeGennaro of Wayfair Engineering are giving a talk on a Javascript framework we have written at Wayfair called ‘Tungsten,’ which shares goals and ideas with React.js but interoperates with Backbone, which we use heavily. It’s at the BostonJS meetup tonight at Bocoup, with a $5 cover: http://www.meetup.com/boston_JS/events/221038649/, on a double bill with Calvin Metcalf. It should be a great night.

Andrew has given a couple of talks at national conferences recently, on other front-end topics. First there was cssdevconf 2014, on web components (slides here: http://www.slideshare.net/andrewrota/web-components-and-modular-css), and more recently React.js Conf 2015, where he spoke about the interoperation of web components and React. Wow, that was a hot conference! Tickets could be had for only a few minutes before it was sold out. Fortunately the Facebook conference people are always really great about posting video, and here’s his talk on Youtube, with slides: http://www.slideshare.net/andrewrota/the-complementarity-of-reactjs-and-web-components.

Update: the slides from the presentation last Thursday night are up, here: http://www.slideshare.net/andrewrota/an-exploration-of-frameworks-and-why-we-built-our-own-46467292

PHP Static Analysis with HHVM and Hussar

Wayfair Engineering places special emphasis on software testing as a means of maintaining stability in production. The DevTools team, which I am a member of, has built and integrated a number of tools into our development and deploy process in order to catch errors as early as possible, especially before they land in master. If you missed it, last week we released sp-phpunit, a script to manage running PHPUnit tests in parallel.

Today we’re open-sourcing hussar, another tool we’ve been using as part of our deploy process, that performs PHP static analysis using HHVM. The name comes partly from the cavalry unit in Age of Empires II — a classic strategy game where a few of us on DevTools still fight to the end — but mainly from the fact that it’s a good, open name that shares a few letters with the tool’s main feature: HHVM static analysis.

Put simply, hussar builds and compares HHVM static analysis reports. It maintains a project workspace and shows errors introduced by applying patches or merging branches. With hussar, projects that cannot yet run on HHVM are able to realize the benefits of static analysis and catch potentially fatal errors prior to runtime. Here is a list of errors hussar can catch. The tool displays these errors in a formatted report. This means our engineers get the safety of strong typing and static code analysis in addition to the flexibility of development they’re accustomed to.

We wrote hussar as a preparatory step toward possibly running Wayfair on HHVM. When we first tried to use HHVM, we discovered that it lacked support for a number of features and extensions used throughout our codebase. Recognizing both the performance and code quality benefits it could provide, we hacked together a script that would get at least the code analysis component working. Over the past few months this script has gone through several iterations as we worked on edge cases and ironed out false-positives to increase its accuracy and utility. The result is a tool that reliably reports legitimate errors.

We’re using hussar as part of our deploy process in addition to our integration and unit tests. Since we’ve started using the tool it has made us aware of a number of errors that slipped through our other tests. This multi-faceted approach to testing has allowed us to be more confident in deployments while keeping productivity high.

Our engineers are also able to run hussar against their code in advance of the deployment process, so ideally any errors are caught even before code review. We’re using a remotely triggered Jenkins job to coordinate running hussar builds on a dedicated testing cluster. HHVM is a bit heavy, so each machine has 6gb RAM, and reports are written to a shared filesystem to avoid repeating work. Run time is usually five minutes or less.

We also generate a full report nightly and are working through the backlog of existing errors. Each resolved error improves our codebase and brings us one step closer to the possibility of running our sites on the HHVM platform.

We think hussar solves the “backsliding” problem likely faced by all project maintainers with large PHP codebases when considering a migration onto HHVM. That is, it’s usually impractical to address all the static analysis issues at once, since tech debt continues to accumulate as developers work through the backlog. This is solved by hussar’s focus on preventing new errors, which allows momentum to build in the efforts to clean the rest of the codebase. For us, the proof of this is that the number of errors found by static analysis across our codebase has been steadily declining since we started using hussar.

For more details on how hussar works and information on how you can start using it to analyze your own code, head over to the project’s GitHub page. We hope you find it useful, and welcome any contributions!