Wayfair Engineering FAQ, a conversation with the Chief Architect

Q1: What’s the tech stack in a nutshell?
A: PHP on Linux, and a few data backends, with continuous deployment at the pace of ~250 zero-downtime code pushes a day.

Q2: Whoah, that’s a lot of code pushes! Break that down for me.
A: What I’m actually counting there is git changesets that are going to some type of production system. We group them in batches, so merges and testing don’t get too hairy. The rumors that we’re so cowboy we only test in production are an exaggeration.

Q3: What’s the main idea of Wayfair Engineering?
A: Stay fast while growing to enormous size.

Q4: That sounds cool, but how do you measure ‘enormous size’?
A: Anything you can think to measure: revenue, active customers, daily visitors, site traffic, terabytes of data, engineering headcount, warehouse square footage, linear miles traveled by our packages being delivered, square miles covered by our proprietary transportation network, number of products we sell. The numbers are out there every quarter in our reports, and in the independent venues: however you count, we’re huge. I’d give you the actual numbers today, but I want to be able to leave this page up for a while, and at the rate we’re growing, snapshot numbers get stale fast.

Q5: How do you measure ‘fast’?
A: Two things. First and foremost, speed to market for new business ideas and features. We also look at performance metrics: web page load times, lead times for delivery, etc., etc. We measure everything.

Q6: How’s the relationship between tech and the business?
A: It’s very good, and it’s one of mutual respect and cooperation. It’s a founder-led business. Steve Conine (tech) and Niraj Shah (business), co-founders/owners, provide a good model of that kind of relationship for the rest of the company, and we take our cue from them. Steve is a very business-minded tech entrepreneur who’s also a phenomenal programmer, and Niraj is an unusually tech-savvy business person. They were both engineering majors at Cornell. There’s less of a divide there in the first place than at many companies. The core of Wayfair is an innovative, completely custom e-commerce platform, and management has consistently described that to the outside world as being a big part of the equity value of the company. On the tech side, let me ask people out there: how valuable would it be to you, to have a deep reserve of confidence that you’re working at a well managed business, where your engineering efforts aren’t going to be wasted on some improperly vetted hunch that will send the balance sheet into the red for no good reason? Combine that analytical sense and good judgment with Wayfair’s characteristic aggressive business innovation, and we all feel pretty good about working together. The home goods niche of the economy isn’t going to go on line all by itself: we’re going to need to make it easy for people, and that’s going to come from a strenuous, combined effort by all parts of Wayfair.

Q7: The home goods niche? You’re taking credit for a whole sector of the retail economy going on line?
A: Of course it’s not *all* us, but we do account for a hefty percentage of every dollar that goes on line for the purchase of home goods.

Q8: ‘Innovative, completely custom e-commerce platform,’ you say? Do you want to elaborate?
A: It’s not that we never buy third-party software, but we have a strong bias towards ‘build’, in build-vs-buy discussions, and that has only become more pronounced over time. At the core, there’s no third-party platform like DemandWare or Magento, just an evolving set of data models and architectural principles, a lot of code, and some great components that our developers know what to do with. We’re very patient with the early stages of DIY efforts that aren’t necessarily up to industry standards at first, if we think we can gain a sustainable advantage over time. Most recently, we’ve insourced some big parts of our marketing tech stack, that we formerly used outside vendors or commercial software for. It’s satisfying when we can leave behind vendor-based point solutions to individual problems, and stand up a new part of the living, breathing, integrated whole, which allows us to take advantage of everything our platform has to offer.

Q9: What languages do you write code in?
A: PHP and Javascript are the bread and butter, including our opensource Tungstenjs framework. We have also written important things in Python and C#, and some key components in Java. Objective C for iOS mobile apps, Java for Android, and some emerging language platforms for VR and AR. We use Puppet for configuration management, so there’s a certain amount of Ruby hacking as well, and a lot of systems scripting with Python. Once in a while we write some C or C++, for optimized numerical work, PHP extensions, and opensource infrastructure like Twitter’s Twemproxy (patches) and Statsdcc (from scratch, inspired by things in Node.js and other languages).

Q10: So you opensource code. Where can I find that?
A: Yes, we do that all the time, most of it on https://github.com/wayfair. Check it out!

Q11: VR and AR?
A: Virtual and augmented reality. We’ve got a lot going on in that space, particularly in a small department we call Wayfair Next that Steve Conine is leading. Right now the biggest push is to model the catalogue. From this we get excellent 2D imagery for the site, and next-gen experiences on things like the Google Tango AR devices that are becoming available to the general public in September. If you have a dev kit, check out the ‘WayfairView’ app. Big picture: we want to make it easier and easier to buy, say, an easy chair from your couch. VR/AR is going to be a big part of that.

Q12: What are your data platforms?
A: We’re proud of how far we got as a business, from our founding in 2002 until 2010, on a keep-it-simple-stupid or KISS architecture of relational-database-backed web scripting. SQL Server was and still is our core for OLTP, and it allows us to plug new tools into our integrated operational and analytical infrastructure very quickly. But to drive innovative customer experiences, we now rely on Solr-backed search and browse, and Redis/Memcache for fast access to complex data structures and ordinary caching. We have a modern, on-premise big data infrastructure consisting of Hadoop and Vertica clusters, and some specialized, vertically-scaled big-memory and GPU machines, for analytical workloads. We do our machine learning, statistical analysis and other types of computation on that setup, and funnel the results to the ‘Storefront,’ as we call it, and to the operational business systems. RabbitMQ and Kafka provide a kind of circulatory system for the data, and they are gradually replacing what traditional ETL we have. As I speak with other architects and CTOs around the industry, I actually think it’s pretty rare, at the biggest and most successful companies, to junk your relational databases, even when you’re many years into adopting these next-generation auto-sharding platforms. We’re fine with that.

Q13: OK, so with all these relational databases, do you use an ORM?
A: There’s a joke around Wayfair Engineering, that if you use the word ‘ORM’ in a positive way, you might notice a sudden drop in your career prospects. Joking aside, I do think excessive reliance on ORMs tends to foster careless data access code, and excessive round trips to the back end. We mostly use the ‘phrasebook’ pattern and hand-make our data access layer. Besides, it’s not as if an easily generated mock would really help you: by the time you’ve horizontally partitioned your data to the extent we have, ORMs are close to useless. We try to make it easy for everybody to develop against a readily accessible development database infrastructure. On the other hand, there is actually a bit of Hibernate, SQLAlchemy and both the Entity Framework and nHibernate in the Java, Python and C#, respectively. ORMs on language platforms like those can have some engineering benefits in addition to the convenience features, such as connection pooling, caching of various kinds, etc. None of that works, or at least works well, in PHP, so we just use PDO like the rest of the PHP world, and we’re experimenting with SQL Relay for some other kinds of optimization and encapsulation of the details of how we talk to the databases. At the higher levels, we have some pretty handy traits, which are the multiple inheritance thing in PHP, to inject common functionality into our codebase in a DRY way. No fanaticism, one way or the other.

Q14: What are your thoughts on web services?
A: We have a handful of important web services behind the scenes at Wayfair. Search and browse for products, orders, and some other things, are powered by Solr, which is an opensource, Java-based web service that we have patched a few times for our own needs. Our Python-based customer recommendations and search enhancements, and our C#-based inventory service, deliver a lot of value. There are other examples.

Q15: Do you have any other kinds of services?
A: Good question. Some of the highest-value systems we have are data processing services that ingest data from our messaging platforms (Rabbit, Kafka) and push value-added results to where we can use them to move faster on behalf of customers and suppliers. There are some DIY ones, but most live in the frameworks Celery, Storm or Spark. You could call our caching system a service. It’s a composite Redis/Memcache/consistent-hashing thing with smart proxies everywhere. You’re using regular Redis and Memcache commands, not going through an adapter layer, but that’s true of Elasticache too, so we’re far from eccentric in this way. Unlike in Elasticache, the sharding is taken care of for you. We built it on the back of work by Twitter, Pinterest and Instagram, but we added some innovative elements of our own. It has some similarities with Facebook’s McRouter, which is pretty awesome, and which we might well have chosen instead, if it had Redis support.

Q16: What about micro-services, or SOA?
A: We’re not really into all that, although as I said, we have some pretty awesome services back there. Is your code base, or a big part of it, really a monolith, in any pejorative sense, if it has decent separation of concerns, and you can deploy small modifications to any layer of it without rebuilding the whole thing, and without down time? We’ve had all of that for years. Many of the best big tech companies have largely monolithic code bases, and they’re too busy adding awesome features to the core to want to replatform. But don’t get me wrong: there are some cool micro-service set-ups out there. If we keep developing very valuable *macro* services at the rate we’ve been doing that, we’ll eventually have so many of them that micro-service-style orchestration techniques will start to make sense for us. Our Python services are the most numerous ones we have, and we’re already experimenting with Docker, Mesos and Kubernetes, for them. It’s just that over time I have seen the importance of web services diminishing, as data platforms become easier to scale horizontally, and server-side-of-the-front-end programming becomes easier and more powerful. The data is just too readily available for these layer cakes of http indirection to make any sense in a well-designed, modern setup.

Q17: Why do you like PHP?
A: I’m not sure I *do* like it, but it attracts the right kind of people: neither ivory-tower language snobs, nor hipster code posers. No fanatics, but no luddites either. We have some fun with all that, when we’re trying to make sure the culture stays strong. I tried to depict both sides of the ivory-tower/hipster thing in this picture a few years ago, in a comic-strip-style blog post on our Python ops: I think the tweed jacket combined with the Brooklyn t-shirt really gets the point across. (To MIT professors, and to my former neighbors in Park Slope: I kid because I love!)
eye-rolling sysadmin
With every other language, there are a lot of fanatics who think it’s the answer to every problem, and will wear your ears out explaining why. Even people who love PHP don’t think that about PHP. It’s just a solid platform for web development, the kind the tattooed web ops expert in the picture would think is a fine thing to have running on his servers. There’s no server lifecycle management to worry about, and practical problem-solvers gravitate to it. It’s also very readable, even if you don’t know it well, so let’s all just pause to give it the big thank you it deserves for killing Perl (with a substantial assist from Python, of course, on the systems scripting side). That needed to happen, and in retrospect it’s obvious that neither Java nor .NET had the slightest chance to do it. 80% of the internet runs on PHP, including a bunch of the biggest sites, which we’re rapidly becoming one of. It’ll do.

Q18: So PHP is a cultural thing?
A: Yes. Let me draw an analogy, which I sometimes use in talks for new hires. Do you remember the scene in Star Wars, when Luke Skywalker sees the Millennium Falcon for the first time, and says “What a piece of junk!”? Han Solo responds, “She’ll make point five past lightspeed. She may not look like much, but she’s got it where it counts, kid. I’ve made a lot of special modifications myself.” H/t @danmil for that analogy. Our PHP-and-friends stack might seem inelegant to language snobs, but it takes us where we want to go fast. The Millenium Falcon is still a space ship, after all! Let’s try a ‘car’ analogy: anyone who has ever done the coding equivalent of putting a Porshe engine in a VW Golf, or can show us the chops and attitude for that and wants to try, is welcome at Wayfair Engineering. Adding lambdas to the opensource php_mustache extension, which we did, is a great example of something that fits that mold. If you’re more of an “I won’t drive at all unless I have a Lamborghini” person, you should seek a company more willing to splurge on the shiniest tools, before thinking about whether there is really a need. If your mindset is more “My tank-like SUV keeps me safe, and I don’t care that it handles poorly,” there are plenty of J2EE shops out there.

Q19: OK, I’m getting the picture. Why use the other languages at all?
A: PHP is not a great fit for every programming task. Sometimes you need a long-running daemon that can respond to requests with little startup/wake-up overhead. We have excellent services in Python, C# and Java for that. The C# code grew out of our early-stage Microsoft heritage, but we are now doing some phenomenal things with it, and we have added some very elegant functional programming in F#. Python is our favorite language for data science, machine learning and the like, and it combines low-latency service qualities, the way we run it, with the convenience and productivity of loose typing, and that super-handy mix of the functional and object-oriented styles. Java allows us to tap directly into platform-level infrastructure such as Solr, Elastic Search, Hadoop, Kafka and Storm.

Q20: You’ve been talking about speed, and you mentioned that you measure web performance earlier. Can you give a bit more detail on that?
A: Sure. Web performance measurement is basically a 3-legged stool: RUM, or real user monitoring; synthetic monitoring, which is externally-located bots that measure page speed; and server-side execution metrics. We have a centralized performance team that makes sure we have the right tools and dashboards to be proactive about all of that. They also work on framework-level changes that can make a big impact, when those aren’t naturally more specialized with another group. They play a strategic role for us, but that team wouldn’t be very effective if we didn’t have a good culture of thinking about web performance in a broader context of putting a great user experience into the hands, and onto all the devices, of our customers. The RUM instrumentation gives us great insight into what our customers are actually experiencing. I’m not original with this name, but my joke name for that department is the RUM distillery, and you can imagine the joking about operating precise instruments in the right state of mind. We have some cool ‘responsive’ experiences here and there, but the RUM tells us that our decision to emphasize adaptive delivery over responsive design was a good one. Check out Wayfair mobile web, on a small iOS or Android device, and you’ll see what I mean. Our native apps are fast too, but that’s a separate discipline, where server-side execution and expert Java and Objective C programming are the key components.

Q21: Thoughts on the cloud?
A: We run a few elastic workloads on public cloud infrastructure, but that’s a drop in the ocean of Wayfair tech. Don’t get me wrong: if we were starting Wayfair today, we would do it on public cloud infrastructure, for the speed-to-market aspect, for sure. In fact, Wayfair was very briefly a Yahoo! Store in 2002, before Steve built the first version of the platform to run in a colocated cage in a data center. We run colo-style to this day. Wayfair was already a multi-hundred-million-dollar company before the cloud was a thing. We think about it, and do some analysis and experiments periodically. But ultimately our traffic is not extremely spiky, and we grow into the holiday spike provisioning pretty early the following year. The economics, control and convenience have not yet aligned to make it worthwhile to go through a big process of switching. We’re not big enough to have whole data centers, at least not yet, but we have our /22 ARIN range, and we use the border gateway protocol to make sure we have the kind of relationship with our ISPs where we have a lot more control than when we were smaller. Wrestling with these types of configurations is interesting work, and it attracts really good network and systems people. Let’s face it: the public cloud is awesome, but when the problem is under the hood of the hypervisor, you’re in for a frustrating day at the office. We do a lot of virtualization, and we like it, but when various types of systems become very cookie-cutter or have certain types of requirements, we run physical boxes. Virtualization adds overhead, and it’s one more thing that can break. If you can provision basic types ahead of demand, the IAAS side of the cloud becomes just another provider, and of course the higher-level services are fraught with problems of vendor lock-in. The way cloud adoption presents itself to new or small companies, it’s kind of ironic that we’re moving too fast to be bothered with moving to the cloud. But never say never.

Q22: OK, sign me up. How do you succeed in Wayfair Engineering?
A: It’s hard to answer that question without using some cliches, but I’ll try to use the ones that are characteristic and relevant. Programmers with the polyglot, DevOps-savvy innovator background tend to do really well here. Boyscout principle for refactoring, rather than a penchant for from-scratch rewrites. Bias for action: if you’re not embarrassed by the first version, you waited too long to ship it. Just ship! If you find yourself tempted by a months-long science project, don’t do it. Instead, fast-follow/adopt something that’s already here in the general area (whether we wrote it or it’s open source from outside), and innovate at the margins for now. But when you see a quick win that you think is on a path to a real breakthrough, pounce.

Boston events during the week of 25 July 2016, on augmented/virtual reality, data science and PHP

There are three events in Boston this week where Wayfair engineers will be speaking.

On Monday, July 25th, I am on a panel called “What’s Hot in Tech: Providing Next Gen Experiences,” which is part of the MITX E-Commerce Summit 2016, being held all day, from 8:30 to 5, at Google in Cambridge. Details here: http://mitxecommerce.org. Fair warning: paid admission. I’ll be speaking about Wayfair Next and our augmented and virtual reality applications. The panel is from 1:45 to 2:30 in the afternoon (details here: http://mitxecommerce.org/session/tech/), but Shrenik Sadalgi of Wayfair Next and I will be there all day, demo-ing next-gen technology, including our Tango app WayfairView, which is already in the Tango app store in Google Play, and will be available to the general public when the Lenovo Tango devices hit the market in September.

On Tuesday, July 26th, at 7:30 pm, Robby Grodin of our Marketing Engineering group is speaking on data science at General Assembly in Boston: https://generalassemb.ly/education/data-science-lets-break-it-down/boston/27296.

On Wednesday, July 27th, it’s open mike night at the Boston PHP Meetup, hosted at Wayfair. Pizza at 6, talks from 7-8:30, Q&A, wrap-up, out for beers? after that. Adam Baratz, George Carrette and I will get the ball rolling with some material about PHP 7 and a couple of other things. If you’ve been wondering why we haven’t been blogging about PHP7, it’s because we’re just rolling it out now, after wrestling with some interesting APC issues. Details on Wednesday.

Statsdcc

Statsdcc is a Statsd-compatible high performance multi-threaded network daemon written in C++. It aggregates stats and sends the results to backends, especially Graphite. We are proud to announce that we are opensourcing it today. Check out the code at https://github.com/wayfair/statsdcc.

At Wayfair we’re big believers in “measure anything, measure everything,” as the “Statsd is reborn in node.js at Etsy” announcement put it. We do application performance monitoring with the opensource tools Graphite, Grafana, the ELK stack (Elastic Search/Logstash/Kibana), and some homegrown tools. Until recently we had been using Flickr/Etsy’s 2nd-generation, node.js-based Statsd to collect metrics for Graphite. As the volume of these metrics increased, we noticed inconsistencies in the data, and realized that some metrics were being dropped. Long story short, we tried some architectural changes, scaling Statsd and Carbon horizontally (details below), but as the operational complexity of that increased, we began to wonder why we needed so many boxes. We found a bottleneck in the way Statsd buffers and flushes data to Carbon, and we decided we needed a different version.

Alternatives:

There are already quite a few alternative Statsd implementations available, but none of them really came close to meeting all of our needs. Brubeck by github is one that we found interesting, because it promised high throughput. Unfortunately, it was released after we had Statsdcc implemented and were ready to put it into production. At that point, we had no reason to take Brubeck and extend it to support the features we needed. However, we borrowed the idea of integrating a webserver to view application health from BrubeckStatsdcc and Brubeck try to solve similar problems. I would recommend checking out all these implementations and picking the one that best fits your needs.

TL;DR

If you’re interested in what we tried before starting to hack the C++, read on.

Attempts at Horizontal Scaling with Statsd:

Statsd performs aggregations on incoming metrics and sends the aggregates to a Carbon process, which in turn saves received metrics to a Whisper database.

To scale, we use multiple Statsd/Carbon chains. Each chain goes to a different disk.  Proxy daemons hash metric names to determine which chain to use. Which proxy daemon is chosen depends on round robin DNS. Consistent hashing ensures metric names are well balanced.

The diagram below depicts the architecture.

statsd_architecture

Issues:

A year ago we noticed that a certain set of metrics were being dropped, resulting in inconsistent monitoring data. We realized that this was due to a maxing out of UDP receive buffers on Statsd. So we tried adding more Statsd processes with increased UDP buffer sizes.

However, adding a new process is complicated. When a new Statsd instance is added, consistent hashing by a reverse proxy will re-route some metrics to the new process, resulting in duplicate files on different Carbon nodes for the same metric – one for the old data and one for the new data. To save space, and for Graphite to show all data, the old Whisper data files should be merged into the new ones.

In the end we were unhappy with how much traffic an individual node could handle. We discovered that the problem was a design decision in Statsd, where the same thread is responsible for both buffering incoming metrics and performing aggregations on them at every flush interval. When computing aggregations, the thread stops listening for incoming metrics, which are stored in the UDP buffer. As the rate of metrics increases, the UDP buffer overflows and drops metrics. We use single-threaded, event-looping frameworks in a few places (Node.js-based daemons for a couple of things, Python-based gunicorn+gevent for several), and we have seen this type of problem before. The event loops don’t help you when you have a blocking IO operation that can bring processing to a halt. Sometimes we work around or solve such problems within the event-loop paradigm, and sometimes we take a completely different approach.

After finding the actual root cause, we decided to rewrite Statsd as a multi-threaded application with a focus on effective use of socket-IO and CPU cycles.

Statsdcc:

Statsdcc is an alternative implementation for Statsd written in C++ for high performance. In Statsdcc, one or more server threads actively listen for incoming metrics. Server threads distribute incoming metrics among multiple workers using the formula worker = hash(metric name) % #workers. Worker threads read from their dedicated queues and update their ledgers until signaled to flush by a clock thread. Upon receiving this signal, the worker threads hand off their ledgers to short-lived flush threads, and continue with new ledgers until the next signal. To avoid lock contention and to pass metrics faster between server and worker threads, boost’s lock-free queues are used.

statsdcc_thread

We have not gotten rid of consistent hashing as we did not want to lose the ability to scale horizontally. However, to solve the scaling problem in our previous architecture, where adding a new process required cleanup on the Carbon end, we moved consistent hashing from proxies to aggregators. The proxies distribute the incoming metrics among multiple aggregators using the formula aggregator = hash(metric name) % #aggregators. Each aggregator then sends the metric aggregation to its respective Carbon process by using the consistent hash. The difference from the previous architecture is that the Carbon process has more TCP connections open, one with each aggregator. However, unlike Statsd, instead of reopening connection on each flush, Statsdcc reuses established TCP connections, thereby avoiding the overhead of a TCP handshake. The diagram below describes the current architecture.

statsdcc_architecture

Statsdcc can handle up to 10 times more load (up to 400,000 metrics/sec) than Etsy’s Statsd. Only one instance of the Statsdcc aggregator handles all our production traffic, in contrast to the previous 12 Statsd instances. Statsdcc has been used in production for about 7 months. We hope more people will find Statsdcc as useful as we have at Wayfair.

Tungsten in the news

There’s a great interview with our own Matt DeGennaro by Paul Krill of Infoworld that came out a few days ago. The topic is Tungsten.js, our awesome framework that ‘lights up’ the DOM with fast, virtual-DOM-based updates, React-style, and can be integrated with Backbone.js and pretty much whatever other framework you want. It’s spiffy, it has a logo,
tungstenjs-w1200
we do it github-first, and we’re getting a lot of mileage out of it at Wayfair. Matt mentions the templating aspect of our composite system: we use server-side PHP, including Mustache templates, and then our client-side pages, also including Mustache templates as needed, get dynamic updates via Tungsten.js. That works great for us because Mustache has implementations in both Javascript and PHP, among many other languages.

What’s that you say? The PHP implementation of Mustache is not fast enough for you? Well, we’ve got you covered! Adam Baratz just put up a blog post yesterday on a server-side optimization that’s been working well for us. We use John Boehr’s excellent PHP mustache extension, which is written in C++, and is much faster than vanilla PHP Mustache. Inspired by another snippet of PHP/Mustache code, we’ve even added lambdas to that, as Adam explains. I had to do a double-take the first time he explained that to me. As far as I can tell, the PHP community, of all groups of web programmers, is the least likely to care about lambdas in particular, and any kind of functional programming in general. And yet, we’re finding lambdas very useful for our globalization efforts, and we’re starting to use them for other things as well.

We’re still working on a date, but Adam, Matt and Andrew Rota will be giving a talk on all of this at the Boston Web Performance Meetup, hosted at Wayfair, in the near future.

Wayfair Labs in the news

Scott Kirsner has a terrific piece about tech talent wars in Boston, that was in Beta Boston on Friday, and then in the print edition of the Boston Globe on Sunday, October 26th. It features Wayfair Labs, which is our hiring and onboarding program for level 1 engineers in most of the department (a few specialized roles excepted). I am the director of it, so if you have any questions, please reach out.

Logo-Vert

Rendering Mustache templates with PHP

For the past couple years, Wayfair’s front-end stack has relied heavily on Mustache templates. They’ve let our growing front-end team focus on the front-end. They allow us to share more code between server and client as we push towards a Tungsten-powered future.

Anyone who’s seen a Mustache template knows that they’re pretty simple to write. Rendering them can be another story. We began using a pure PHP implementation. This got us off the ground, but as we expanded our use of templates, we ran into the unfortunate truth that such a library could never be faster than the pure PHP pages we were hoping to replace. Yet, we wanted to make it work, for the mentioned organizational and architectural reasons.

To understand why rendering Mustache in PHP is slow, first you have to understand what goes into rendering Mustache. Consider a canonical template/data example:

Hello {{name}}
You have just won {{value}} dollars!
{{#in_ca}}
Well, {{taxed_value}} dollars, after taxes.
{{/in_ca}}

{
  "name": "Chris",
  "value": 10000,
  "taxed_value": 10000 - (10000 * 0.4),
  "in_ca": true
}

This template must be tokenized, then each token must be rendered. Mustache is simple enough that tokenizing mostly amounts to splitting on curlies. Not a big deal for PHP. The real trick is in replacing {{name}} with the correct content. This is fine when you have a flat set of key/value pairs. Consider this example:

{{inner}}
{{#outer}}{{inner}}{{/outer}}

{
  "inner": 1,
  "outer": {
    "inner": 2
  }
}

The output should be “12”. When rendering the {{#outer}} section, the renderer must know which “inner” to display. This is typically implemented by turning the data hash into a stack. When entering/exiting sections, data is pushed/popped. To get a value, start at the top and descend until you find a match.

This is an easy operation to describe, but it makes for some slow PHP. It was the major performance bottleneck with the Mustache implementation we first used, and it’s an issue with another popular implementation.

Bearing this in mind, we sought a more radical solution. Enter php-mustache, a C++ implementation of Mustache as a PHP extension. C++ is much better than PHP at traversing stacks. Witness this before/after from when we first deployed php-mustache:

php-mustache performance

This chart shows the render time for the product grid on our browse pages (for example, Beds). It’s a complex mustache template with a lot of data and a lot of nesting. The X-axis is clock time, the Y-axis is render time in milliseconds.

This kind of lift allowed us to justify making Mustache a standard instead of an occasional tool. And, courtesy of the open source world, we didn’t even have to write it. However, it became something of a double-edged sword. As Wayfair operates stores in multiple countries, we have to localize a lot of strings. We started handling this by building all of them in PHP and loading them into the template. This led to some thick code in some cases, which occasionally created friction around using Mustache. The typical i18n solution for Mustache involves lambdas, which unfortunately were not implemented in php-mustache… until now! If you’re a performance-minded Mustache user, we hope you’ll check it out.

PDO and MSSQL

When you write your first web application, chances are you’re going to query a database. When you write it in PHP, chances are it’ll look like this:

$mysqli = new mysqli("example.com", "user", "password", "database");
$result = $mysqli->query("SELECT * FROM product");
$row = $result->fetch_assoc();

Before long, you have to start handling user input, which means escaping:

$mysqli = new mysqli("example.com", "user", "password", "database");
$result = $mysqli->query("SELECT * FROM product WHERE name = " . mysqli_real_escape_string($mysqli, $product_name));
$row = $result->fetch_assoc();

As your application grows, you start writing code like this a lot. You may start encapsulating it in a DAO, but they do little besides erect walls around this chimeric code. “Okay,” you say. “This is fine, because it’s only me. I’m a Responsible Engineer and I don’t have to sugar-coat things for myself.” But soon, this project is going gangbusters. You’ve got a team, and then a large one, and now there’s no rug large enough under which you can hide this mess. And woe unto you should you decide you need connection pooling or any other resource management.

One solution to this problem is an ORM. But, some people prefer having their database interactions more “managed” than “abstracted away.” Instead your code could look more like this:

$pdo = new PDO("mysql:host=example.com;dbname=database", "user", "password");
$statement = $pdo->prepare("SELECT * FROM product WHERE name = :name");
$statement->bindParam(":name", $product_name);
$statement->execute();
$row = $statement->fetch(PDO::FETCH_ASSOC);

A little more verbose, yes, but also easier to read and less error-prone. This is PDO. It’s a PHP extension that provides a vendor-agnostic interface to various relational databases. It pairs a well-structured API for performing queries with a series of different database drivers.

When Wayfair began adopting PDO, our database access was relatively managed. An in-house library managed connections over the course of a request, but building queries involved a whole lot of string concatenation. Complex queries would get unwieldy. Engineers with prior PDO experience wanted to know why we weren’t using it. However, to convince engineers new to PDO that it would make their lives easier, it had be as low friction as the existing library and produce output in the same format.

Simplifying PDO syntax was the easy part. Technically, the example given is shy on error handling. The PDO constructor can throw exceptions. Related functions return a boolean value, indicating whether they succeeded. So a “correct” PDO example would look like this:

$pdo = null;
try {
  $pdo = new PDO("mysql:host=example.com;dbname=database", "user", "password");
} catch (Exception $e) {
  // logging, etc., if you want to note when you were unable to get a connection
}

$statement = false;  // PDO::prepare() will return false if there’s an error
if ($pdo) {
  $statement = $pdo->prepare("SELECT * FROM product WHERE name = :name");
}

$row = null;
if ($statement) {
  $statement->bindParam(":name", $product_name);

  if ($statement->execute()) {
    $row = $statement->fetch(PDO::FETCH_ASSOC);
  }
}

// now, do something with $row

Awesome, I know, right? Sure, the PDO API is, on the whole, “nicer,” but no one’s going to want to deal with it if they’re forced to jump through these kinds of hoops. And who could blame you? At Wayfair, we place a lot of value on developer ergonomics. These are problems we strive to solve well when rolling out new internal tools. We landed on a slight extension to PDO that would yield this syntax:

$statement = PDO::new_statement("PT", "SELECT * FROM product WHERE name = :name"); // the first argument refers to the desired host/database
$statement->bindParam(":name", $product_name);
$statement->execute();
$row = $statement->fetch(); // PDO::FETCH_ASSOC is now the default fetch style

We pulled all the boilerplate into a factory function. It does the necessary error handling and reporting. If everything succeeds, it’ll return a standard-issue PDO statement object. If there are errors, it will return a special object which acts like a statement that’s failed, but will return empty result sets if asked. We felt comfortable that this would remove most of the friction around using PDO while preserving the underlying interface. Anyone who wants finer-grained control can still utilize the stock API.

The trickier problem was “make output the same.” While PDO looks the same with each driver, the drivers don’t necessarily behave the same. The documentation isn’t always clear about these differences. We needed to do a fair amount of testing and source code reading to suss out the effects.

While my examples have used MySQL, Wayfair is an MSSQL shop. We had been using the mssql extension. It uses a C API called DBLIB to talk to the server. Microsoft doesn’t maintain an open source version. FreeTDS is the commonly-used free implementation. One of the PDO drivers also uses DBLIB, but it returns column data differently. Instead of returning strings as strings and ints as ints, the PDO DBLIB driver returns everything as a string. We had to patch it to use the expected data types. To be able to differentiate between quoting strings as VARCHAR vs. NVARCHAR, we also added a parameter type. We also added support for the setting connection timeouts (PDO defines a PDO::ATTR_TIMEOUT constant, but it has no effect with the DBLIB driver).

Another reason we were first attracted to PDO was for prepared statements. Since MSSQL supports them, it seemed like this could be an opportunity for a performance gain. However, after digging into the driver internals, we found that the DBLIB driver only emulates them. Microsoft has an ODBC driver for Linux. We tested it in conjunction with PDO’s ODBC driver, but found the two to be incompatible. We were able to get it working with the plain odbc extension, but (amazingly) found prepared statements to be slower than regular queries. Since using prepared statements would’ve necessitated a nontrivial change in coding style, we decided against investigating the speed difference.

We’re currently working on deploying SQL Relay. Preliminary tests have proven out that it reduces network load without adding much overhead. It has a PDO driver, so we’ll be able to swap it into our stack without having to change how queries are made.

Tungsten.js: UI Framework with Virtual DOM + Mustache Templates

Performance is top priority here at Wayfair, because improved performance means an improved customer experience. A significant piece of web performance is the time it takes to render, or generate, the markup for a page. Over the last several months we’ve worked hard to improve the render performance on our customer facing sites, and ensure it’s easier for our engineers to write code that results in page renders with optimal performance.

We had been using Backbone.js and Mustache templates for our JavaScript modules at Wayfair for some time, but last year we realized that our front-end performance needed an upgrade. We identified two areas for improvement: efficiency of client-side DOM updates and abstracting DOM manipulation away from engineers.

The first issue was a result of the standard render implementation in Backbone. By default, the Backbone render implementation does nothing. It is up to developers to implement the render function as they see fit. A common implementation of this (and the example given in Backbone Docs) looks something like this:

render: function() {
    this.$el.html(this.template(this.model.attributes));
}

The problem with this implementation is two-fold: first, the entire view is unnecessarily re-rendered with jQuery’s $().html() whenever render is called, and second, the render method always manipulates the DOM regardless of whether the data changed, so engineers must be explicit about when render is called to avoid unnecessary expensive DOM updates. The solution to both of these problems is a mix of only calling render when the engineer is sure the entire view needs to be re-rendered and then writing low-level DOM manipulation code when only portions of the view need to be updated, or the update needs to be more precise (e.g., changing a single class on an element in the view). All of this means that engineers have to be very aware of the state of the DOM at all times, and have to be aware of the performance consequences of any DOM manipulations. This makes for view modules that are hard to reason about and include low-level DOM manipulation code.

To address both of these problems, we investigated front-end frameworks that would abstract the DOM from the developer while also providing high-performance updates. The primary library we looked at was React.js, a UI library open-sourced by Facebook that utilizes a one-way data flow with virtual DOM rendering to support high-performance client-side DOM updates. We really liked React.js, but encountered one major issue: the lack of support for templates which enabled high-performance server-side rendering.

On modern web sites and applications, HTML rendering occurs at two points: once on page load when the DOM is built from HTML delivered from the server, and again (0 to many times) when JavaScript updates the DOM after page load, usually as a result of the user interacting with the page. The initial rendering happens on the server with a multi-page site like Wayfair, and we’ve put a lot of work into making sure it’s as fast as it can be. HTML markup is written in Mustache templates and rendered via a C++ mustache renderer implemented as an extension for PHP. This gives us server-side rendering at speeds faster even than native PHP views.

Since server-side rendering is an important part of our web site, we were glad that React.js comes with this feature out of the box. Unfortunately while server-side rendering is available with React.js, it’s significantly slower than our existing C++ Mustache setup. In addition to performance, rendering React.js on the server would have required Node.js servers to supplement our PHP servers. This new requirement for UI rendering would have introduced complexity as well as a new single point of failure into our server stack. For these reasons, as well as the fact that we already had existing Mustache templates we wished to reuse, we decided React.js wasn’t a good fit.

Where do we go from here? We liked many of the concepts React.js introduced us to, such as reactive data-driven views and virtual DOM rendering, but we didn’t want our choice of a front-end framework to dictate our server-side technologies, and dictate a replacement of Mustache rendering via C++ and PHP. So, after some investigation of what else was available, we decided to take the concepts we liked from React.js and implement them ourselves with features that made sense for our tech stack.

Earlier this year, we wrote Tungsten.js, a modular web UI framework that leverages shared Mustache templates to enable high-performance rendering on both server and client. A few weeks ago we announced that we were open sourcing Tungsten.js, and today we’re excited to announce that primary development on Tungsten.js will be “GitHub first,” and all new updates to the framework can be found on our GitHub repo: https://github.com/wayfair/tungstenjs.

Tungsten.js is the bridge we built between Mustache templates, virtual-DOM, and Backbone.js. It uses the Ractive compiler to pre-compile Mustache templates to functions that return virtual DOM objects. It uses the virtual-DOM diff/patch library to make intelligent updates to the DOM. And it uses Backbone.js views, models, and collections as the developer-facing API. At least, it uses all these libraries for us here at Wayfair. Tungsten.js emphasizes modularity above all else. Any one of these layers in Tungsten.js can be swapped out for a similar library paired with an adaptor. Backbone could be swapped out for Ampersand. virtual-DOM could be swapped out another implementation. Mustache could be swapped out for Handlebars, Jade, or even JSX. So, more generally, Tungsten.js is a bridge between any combination of markup notation (templates), a UI updating mechanism, and a view layer for developers.

We don’t expect Tungsten.js to be the best fit for everyone, but we think it fits a common set of uses cases very well. We’ve been using it on customer-facing pages in production for a while here at Wayfair and so far we’ve been very happy with it. Our full-stack engineers frequently tell us they far prefer using Tungsten to vanilla Backbone.js + jQuery, and we’ve improved client-side performance now that DOM manipulation is abstracted away from developers. And while we weren’t trying to be the “fastest” front-end framework around, it turns out that when we re-implemented Ryan Florance’s DBMonster demo from React.js Conf in Tungsten.js, the browser’s frame rate ends up being, give or take, at the same level as both React and Ember with Glimmer.

Here at Wayfair we have a saying that “we’re never done”. That’s certainly the case with Tungsten.js, which we’re constantly improving. We have a lot of ideas for Tungsten.js in the coming months, so watch the repo for updates. And of course we welcome contributions!

No-follow SEO link highlighter Chrome extension

Cari, who is a developer on our SEO team, just wrote a Chrome extension that’s up on both github (https://github.com/wayfair/nofollow_highlighter) and the Chrome web store (click here to add to Chrome). If you don’t know this subject matter, here’s a classic explanation from Matt Cutts of Google, from a few years ago: https://www.mattcutts.com/blog/pagerank-sculpting/. A lot has changed in SEO since then, but this basic idea has become a constant in internet life: if you’re compensating people for a promotional activity, they need to make that clear to Google with a ‘nofollow’ link. Here’s an example of a blogger who is doing it right, with a disclaimer saying she was compensated with a gift card, to keep the FTC happy, and a ‘nofollow’ link with the yellow highlight from our plug-in, indicating that the link properly warns Google not to pass page rank, or whatever they’re calling ‘Google mojo’ these days, through to the destination. The other links on the page don’t show up color-coded one way or the other, because they go to domains we don’t care about.
dwellbeautiful-nofollow-example-withmarkup

A green highlight, indicating that Google thinks we want it to pass page rank, would be a problem. If you don’t like the idea of green being a problem, the default yellow and green colors are configurable to whatever you want.

If your promotions people are working with a blogger who forgets to do that, or misspells ‘nofollow,’ or anything along those lines, it’s on you to get that cleaned up in a hurry. It’s suboptimal to have to ‘view-source’ on every page or write your own crawler: enter the browser-based ‘no-follow’ extension. There have been a few of these for different browsers over the years, but none did exactly what we wanted, so we rolled our own.

The features of ours that we like are:

  • Configurable list of domains whose reputation you are trying to defend.
  • Click-button activation/deactivation on pages, which is persistent.
  • Aggressive defense against misspellings, bad formatting, etc.

The configurable list is important, because if you’re looking over a page that links to one of your sites, and it links to several other sites, it’s best if you don’t have to puzzle over which links you care about.

The persistent flagging of pages that you care about is important, because if you’re engaged in a promotional activity with a site, odds are someone at your company is going to be that site from time to time. Tell your colleague to enable the plugin and to be on the lookout for green links, and you’ve got a visual cue that’s hard to miss, for problems that might arise.

The defense against misspellings, special characters, and the like, is for this scenario. Brian, head of Wayfair SEO: “Hey Bob, did you put ‘nofollow’ on those links?” Bob: “Yup”. Brian: “kthxbye”. But in fact, although Bob is telling the truth, he actually put smart quotes, rather than ascii quotes, around ‘nofollow,’ so Google will not recognize the instruction. It’s funny: browsers do a great job of supplying missing closing tags, guessing common spelling mistakes, etc., because their mission in life is to paint the page, regardless of the foibles and carelessness of web page authors. Nationwide proofreaders’ strike? No problem, browsers will more or less read your mind! But Google’s mission in life is to crawl the web and pass page rank, so you have to tell it very clearly not to do that.

And of course, when all your links are clean, you can have a luau party in your Tiki hut, like our SEO team:
SEO tiki hut luau

TechJam 2015

Come hang out with us tomorrow, June 11th, from 4 pm to 9 pm, at TechJam. Not to be too transparent, but we’re hiring! We will be at booth 43, bostontechjam@wayfair.com, #btj2015. Steve Conine, Wayfair Founder and CTO, and I will be there, along with a bunch of our colleagues in Wayfair engineering.

We will have a Money Booth where people can enter by checking in on Facebook or tagging Wayfair in a picture on Instagram/Twitter, #wayfairbtj2015. The person who grabs the most money will win that amount in Wayfair Bucks.

We will also have some Wayfair swag at the booth.