Upgrading our stack for web performance: early flushing, http/2, and more

What’s fast, jank-free, and has what you need for your home? Wayfair’s mobile web site, that’s what!
android-screenshot-wfdc
Behind this spiffy Android experience, and behind all of the ways into Wayfair that we make available to our customers, are some fresh technical upgrades and an evolution of our programming and product culture, with increased attention to web performance.

For some time we have used RUM, synthetic and server-side execution metrics to stay on top of how fast or slow we are. We recently set out to adopt a group of the best practices that have emerged over the last few years, hoping to observe big changes in our numbers. We were a little surprised by the trickiness of configuration and verification. These techniques have been possible for a while now, but many proxies, monitoring and measurement tools, web servers, and supporting libraries either don’t support them at all, support them but in versions that have only very recently been packaged by popular Linux distributions, or must be carefully massaged into non-default states for them to work. Our guides were the blog posts, books, and Velocity talks of web performance industry leaders Ilya Grigorik, Patrick Meenan, and Steve Souders. We figure a practically minded how-to from a big site like ours might be useful to people. We’ll focus on how to make sure these things are working, rather than on the mechanics of configuration beyond the basic idea, because the details of different web and CDN platforms will differ widely from site to site.

Our goal for the web performance program is simple: a better user experience through faster perceived performance. As proxies for that somewhat vague goal, we focused on measurements of the experience-ready type, using events in the performance-timing RUM data, and various metrics that are available in synthetic tools like WebPageTest and Phantomas. We have found Google’s RAIL model to be an excellent frame of reference. We have tinkered with the metrics we use, and TTFB (time to first byte), first paint, above-the-fold render time, DOM-ready, ‘interaction-ready,’ document-complete, and speed index have figured in our thinking at various points. We have a lot of tactics and techniques on our roadmap, but we decided to focus at first on early flushing and, to a lesser extent, http/2 or ‘h2’ as it appears in some tools. Our analysis told us they would have a big impact. We also felt that if we didn’t do them first, or at least early, we would be messing around with smaller optimizations that would probably not work exactly the same way without those two things in place, so we would just have to redo them.

Before we get into the details, here’s a quick thought on internal justifications for the effort involved: we don’t really buy the web performance industry’s simple sales math for e-commerce sites. The standard argument goes something like this: visitors who experience faster page views convert at higher rates. Move the slower crowd to faster, speed everyone up and hammer down the long tail of slow experiences, and they will convert at the same rate as the fast people. I think a lot of speed tools have been sold based on that argument. If only! We’re not buyers. What if the people who are converting at a high rate for some combination of reasons just tend to have fast computers? Unfortunately the ROI on speed is not just a simple calculus problem, where you raise up the juiciest part of the speed curve, and sit back and count the money. This is a deep question, and one worth pondering before you put too much of your life energy into a performance effort, especially if divorced from other product concerns. But that doesn’t mean there isn’t real money at stake, and it’s certainly not worth dwelling on when you’re working hard, as we have been doing, to give your customers a fast and excellent experience.

Let’s break down the techniques of early flushing and h2.

If you’re unfamiliar with early flushing, it’s an easy idea to understand: as early as reasonably possible in the process of building up a stream of html, send or flush the first part, including instructions for downloading critical assets like CSS and Javascript. There are some fine choices to make. Can you inline the most critical CSS, JS or even images? Can you get something meaningful out in 14K or less, so it will fit in the first TCP congestion window? Also, there are typically some obstacles. If, after flushing the first batch of bytes, including HTTP code 200/OK, you encounter an error that previously would have triggered a code other than 200, what do you do? Is *anything* in your stack buffering the stream? In our case this meant PHP (easy enough to change the code and the configuration file), Nginx (proxy_buffering directive), two inline appliances we had doing various things in our data center, and our CDNs. All things being equal, and as a general rule, buffering is your friend, and we probably could have gotten all this to work by reducing buffer sizes. However, in the spirit of keeping things simple and cutting to the chase, we just removed the appliances and made other plans for what they were doing, turned buffering off in all the other places, observed only a very slight increase in server load, and went with that.

H2 is a big topic. In short it’s a new protocol, based on Google’s SPDY, that aims to reduce network overhead, round trips, and payloads. It’s available, for practical purposes, for TLS/HTTPS only, and it allows you to re-use a single connection for multiple downloads. It’s supported, we think fairly typically for North American and European sites, by more than half of the browsers we see. You need graceful degradation to http 1.1 for the other 40+%, but there are good ways to do that. To see it in action in Google Chrome, add this extension: http2-spdy

which is available here: https://chrome.google.com/webstore/detail/http2-and-spdy-indicator/mpbpobfflnpcgagjijhmgnchggcjblin. It shows you h2 working on a site as a blue thunderbolt, h2 and QUIC (a topic for a different day, and seems to be Google-only for now) as a red one, and a blank grey icon for none of the above. Here it is on wayfair.com: http2-spdy-on-wayfair

Click on the thunderbolt, and you’ll see the screen below. It shows all the h2 sessions your browser has going. I’ve opened up Google, Facebook and Twitter, which are three sites that populate this tab for many people. Interspersed among those, you can see the domains wayfair.com, wayfair.co.uk, and the separate domains we use for our CDNs to serve static assets. http2-wayfair-domains

The two domains for assets are an http/1.1 graceful degradation hack, where you resolve two different domains to the same IP address. H2 is smart enough to make one connection to the IP, but http/1.1 treats it as two, so the browser’s connection limit is doubled. In this way, you get the benefits of domain sharding for http/1.1 clients, and the benefits of connection re-use for h2. Domain sharding is the best of the optimizations from the older set of best practices that we’re replacing, so even as we abandon some of the other techniques, we want to continue to take advantage of this one for the legacy-client population, which is still quite large.

How do you know early flush and h2 are working? Open up Chrome developer tools and look at the timeline view. Let’s look at early flush first.
wayfaircouk-timeline

Notice how the purple CSS and yellow Javascript downloads start happening half way through the blue html. Also, see how we get a first ‘paint’ of just the top nav, before the full page shows up two frames later. If you’re on a slow connection and you type in the home page, but really want to go somewhere else on the site, you can actually use that to navigate. So it’s useful to some users, and kicking off all the downloads early speeds everything up across the board. Here is a webpagetest video of early flushing vs control:


Notice that you see something of interest sooner, and the overall time is 2.6 seconds vs 3.5. These are under synthetic test conditions: the discrepancy is more dramatic than in real life, on average, and the raw numbers are also different. But it illustrates perfectly the idea of what you’re looking for, and what people are actually experiencing on the site now.

For h2, back in Chrome developer tools, go to the network tab, right-click on the column-headings and enable ‘Protocol’. A new column will appear between ‘Status’ and ‘Domain’ You’re looking for the value ‘h2’ instead of ‘http/1.1’. Then fire up webpagetest.org, run a test, and look at the details tab. The waterfall view shows a large number of requests to one of our CDN urls.
wpt-detail-waterfall

But then the connection view shows that many requests are multiplexed over the connections to the static asset urls.
wpt-detail-connection2

Web page test is telling this story pretty well, but we’re still seeing more connections than we should, according to the theory. So was the theory not implemented correctly by Akamai or Fastly and Google Chrome? Or is Web page test misleading us? Let’s fire up Wireshark and find out. And… everything’s encrypted, so we can’t see it. Let’s go see Chris the protocol wizard (wearing his wizard cap today!), who’s savvier with Wireshark than me.
chris-the-protocol-wizard

A quick, friendly man-in-the-middle attack on our own TLS infrastructure later, and Wireshark shows this (click to enlarge, and note that HTTP2 is magic!):

wireshark-http2-arrows

And then we see one TCP stream to the ip address behind our assets urls.

one-tcp-stream

Victory! WPT is showing more ‘connections’ than there actually are TCP streams, but there can be only one stream to that address if it’s working right, and that’s what we see.

Obviously, we’re missing an opportunity here to serve everything off one domain, in order to reduce the number of connection handshakes even further and get the most out of server push. That domain would have to be www.wayfair.co.uk for our UK site, from the example. We may be more aggressive about that in future, but that would require more refactoring, and we’re happy for now, with the browsers being able to grab all their assets from one place, without domain sharding or other tricks.

Bottom line, we got a 15-20% decrease in the metrics we care about, from early flush, 5-10% from h2. The examples above are for desktop, because it’s easier to see what’s really going on than with mobile. But these techniques work for mobile too, and of course the reduction in network round trips is much more of a benefit for phones than it is for desktops on fast networks. Early flush works the same way. Like a lot of e-commerce sites, we use adaptive delivery more than responsive design, although we’re doing more of the latter than we used to. We are delivering a smaller payload to phones. Whatever number of bytes you can flush, if it’s constant across platforms, will be a bigger percentage of the total, so more of a meaningful part of the page, on mobile than on desktop.

A couple of configuration notes. Your experience of CDNs may differ from ours. Just as with nginx, turning proxy buffering off is probably not the default configuration. We terminate TLS at the CDN, so their support for http/2 was important to us. If you’re doing this yourself, a pretty advanced Ubuntu, as of this writing, is the only mainstream server operating system that has a default openssl package advanced enough for http/2 to work. We set that up during testing: Ubuntu + openssl + nginx + with proxy_buffering off and h2 on, and everything worked. If you are using a CDN, even if you are encrypting your origin endpoints, you don’t actually have to enable h2 at the origin. As long as the edge node does the h2 thing properly, the long-haul traffic from edge to origin can use http/1.1 without a huge opportunity cost. This may be of help to you, if you have existing proxies, test scripts, etc., that benefit from an easy-to-use text-based protocol (http/1.1) in some way, or that do not yet support h2.

For metrics and monitoring, there are a lot of available tools and platforms. There are great cloud resources you can use, but that’s not always very practical to run internally, and the emulations of real devices that they provide don’t have the right chips to give you a realistic idea of what’s happening. We use them to some extent, and if we didn’t have a more accurate option we would be using them more. To get a more accurate read, we set up a lab in the office, which has a very “Pat Meenan’s basement from 10 years ago” vibe to it. Here’s Sam, from our Performance Engineering team, among the Intel NUCs, phones and tablets that we set up:

sam-in-the-lab

This way, we get chips like in the real world, and we use traffic shaping software to control for the fact that we’re on a fast network in the office.

A word on opportunistic roadmapping (not to be confused with its evil twin, scope creep). When we began this work, our main focus was early flushing, and our metrics improvements have now demonstrated that this was indeed the most important of the things we have tried so far. However, half way through the project, by a pure happy coincidence, the SEO community started to notice that Google had finally cleaned up its 301 redirect penalty problem, which had been foiling its own attempts to get everyone to go to all TLS all the time. We had wanted to do that for our biggest sites for a while, but the 15% in lost natural search traffic was holding us back. It was incredible good fortune that this barrier was removed at the time it was, while we were in the middle of a concerted effort to improve the stack, with all the associated rounds of thinking, coding, probing and testing, all lined up. When we saw that, we decided to really push on the h2 roadmap items, to get them done before the holidays this year. The reward, if the technique worked out, was going to apply not just to our smaller sites that were already all TLS, but to our largest site, wayfair.com. We got all that done, and we’re very happy with the results.

With these pieces in place, we are psyched about our roadmap going forward. We can start really coding for http/2, which will be great for desktop browsers, single-page apps for mobile web, and everything in between. Other things are just out, imminent or planned for pretty soon: resource hints, pre-rendering, dns-prefetch, HSTS and OCSP stapling. We’ll share more later on all that, if there’s something interesting to share. In retrospect, I would describe the web performance optimizations we had been doing before we made all these changes, as strenuous efforts with one arm tied behind our backs. On this modern stack, we’re looking forward to being able to use all the latest clever tricks to give our customers a really awesome experience. In the mean time, internally, we have put in place well socialized KPIs, training for our developers on all these techniques, and better tooling than we had before, with the goal of building sound principles of web performance into our product and engineering culture, and into everything we build. To give one example (and we’re not the first to do this) we made a bookmarklet based on Pat Meenan’s Javascript function for calculating Speed Index, which he opensourced here, that’s on all our developers’ Chrome toolbars now.

How to build an Elasticsearch analytics platform in a weekend

order_heatmap

That’s a heat map of 3 months of Wayfair North American orders, exported via Logstash from our relational orders database into Elasticsearch, and visualized with Kibana 4. For some time now, ELK, or Elasticsearch / Logstash / Kibana, has been our opensource platform for text-based log messages, complemented by Graphite and Grafana for numeric logging of counts, averages, etc. We have long used ELK for realtime analysis of emergent problems. With the Kibana 4 upgrade, we’re starting to make it a first-class citizen of our analytics suite, which also includes Hadoop, Vertica, Tableau, and cube-style SQL Server. This blog post lays out the details of a hackathon project that served as a POC for that move. The basic idea was to put some data from and about our search systems into Kibana 4.

Our search platform is Solr, so this meant we were going to put Solr-based data into Elasticsearch. What could go wrong? Plenty. To see why, just start typing Solr into google:

1

Click on through to “Solr vs. elasticsearch” and you’ll find useful comparisons, lively discussion about strengths and weaknesses, bold declarations of imminent dominance or eclipse, trash talk, etc. In the end they are both web service wrappers around inverted indices made with the Lucene Java library. They have more in common than they having dividing them, and they shouldn’t be fighting. Could Solr and ES get along? Could we put Solr data into ES without having ES talk bad about Solr? We would soon find out. Cognizant of the historical tribal disagreement among the Solr and ES communities, we called our effort “Project Forbidden Marriage.”

Our search usage data is stored in Hive/Hadoop, and our order data in MSSQL. Elastic provides a connector for Hive via jdbc to directly connect to Elasticsearch, and we used pymssql and Kafka to send the order data to logstash and on to ES.

Now that we had data flowing, we hit our first roadblock. We wanted to see order history on a timeline based on order completion date, not the time the message is indexed by Elasticsearch or processed through Logstash, which is the default that appears in the @timestamp field that drives default-behavior time series charting in Kibana. So we changed our query to name the column @timestamp in the Python.

2

Uh…well that didn’t work. OK, plan B – we tried using the date filter in Logstash to replace the @timestamp field data with our order_complete_date field data.

date {
    match => { “order_complete_date " => [ "dd/mm/yy HH:mm:ss" ] }
    target => "@timestamp"
}

Awesome! That should work perfectly!

2

Hm…that didn’t seem to make a difference at all. To make this long story a little shorter, I tried a few other things, even going as far as writing some Ruby in the Logstash config file to remove the @timestamp field entirely, or modify it, or do ANYTHING with it pre-Elasticsearch. Fun fact: you can’t do that. While it’s not documented as such – in my experience and frantic head-banging I’ve found that this field could not be modified without hacking the infrastructure. We have hacked Solr on several occasions, but we really try not to go there if we don’t have to, because it complicates upgrades. We have been making do with vanilla ELK.

All right, plan C. I renamed my date field in my SQL query to event_timestamp, and set that as the timestamp reference field in Kibana.

3

Victory! Now we just need to adjust the rest of our timeline-based data to send to that field properly, account for the differences in timestamp formatting across our data sets, and we’re good to go. Luckily for us – a date type field in Elasticsearch will accept multiple date formats, assuming you specify them properly.

"event_timestamp": {
    "type": "date",
    "format": "yyy-MM-dd HH:mm:ss||yyyy-MM-dd||strict_date_optional_time||epoch_millis"
}

 

Now that the time stuff is all taken care of, and we know that will work, we need to normalize our fields to get all of this data to play nice. While Elasticsearch is a schema-less document store database, it is still important that our data be coherent, so we can visualize all of it together. What I mean by this is, Elasticsearch allows you to enter data without a defined schema, but it’s important to keep your field names the same between data sources. Since we want to combine these two non-relational sets, we need to make sure that the common field names match. Luckily, for our purposes, there are really only a few fields that are linked between our different data sources – and that is customer_id, store_name, ip_address, sku, and city. Keeping these named the same between our data sources will allow us to maintain the relationship between data sources; when I query for

sku:abc123

I’d expect to see results from both data sets, rather than having to search for

sku:abc123 OR skus:abc123

 

It’s definitely worth discussing a few of these field design choices in detail, as it’s ultimately important as to how we visualize this data. The first field that is noteworthy in its design decision is SKU. In Elasticsearch 2.x, multi-value fields have been removed as a defined field type, however the functionality still remains by default. If we send an array of SKUs, we can still read and aggregate on them independently. For example:

{“sku”: “ABC123”}
{“sku”: [“ABC123”,”DEF456”,”GHI789”]}

If we run a query in Kibana for sku:ABC123, two results will return. These will also aggregate as individual skus, and every operation will work in this manner. This winds up being really important, since some records from orders or search history returns one SKU, and some can return up to dozens – having them all accessible allows us to visualize our data properly, rather than aggregating on all individual elements together, we can aggregate the results of an individual element.

Another important field to mention is keyword. This field represents the search string that a user entered when searching on Wayfair.com, and we want to do a few different things with this data. We want to be able to run a full text search, but we also want to aggregate the results. For those who don’t know, aggregating on an analyzed field can be extremely expensive depending on the contents of the field. For example: if we have data that looks like the keyword ”blue area rugs”, we want to see the count for that entire phrase. If it’s an analyzed field, we’ll wind up seeing counts for blue, area, and rugs independently. However, we need an analyzed field to do a full text search. Enter: the multi-field field. The mapping looks like this:

"keyword": {
    "type": "string",
    "fields": {
        "analyzed": {
            "type": "string",
            "analyzer": "english"
        },
        "raw": {
            "type": "string",
            "index": "not_analyzed"
        }
    }
}

This allows us a lot of flexibility with our searching and aggregation – we can ask questions like “what are the most common search phrases that include ‘blue’?”. So we search for keyword.analyzed:‘blue’, and aggregate on keyword.raw.

4

Perfect! Full text search without expensive aggregation. The last field worth mentioning is our geo_point field, which we called coords. We decided to go ip based, considering that is a common item between search and order data – so to get geo information we ran these IPs through a few different facilities. From the hive side, we ran IPs through UDFs to spit the coordinates out from an ip database we have. The Python side of things did something similar.

On geo data –

  1. Go zip code based where possible; this is only relevant to a portion of our data, however. IP is used due to it being common among all data sets
  2. Filter out anything with a default location

To point 1 – geolocation data based on IP is inherently unreliable. It’s getting better over time, sure, but even in the best circumstances it can only be so accurate. Looking back, I would only use IP information for search data, and I would use zip code for order data. Zip code is significantly more reliable, in terms of geolocation. And on that note – geolocation data happens to have a default location if they don’t know where your IP geo-locates to. There are two hot spots on this map. One is on top of New York City, where we do actually do a ton of business, and the other is in Kansas, which is a default address for test orders. Those are outliers. If we wanted to clean this up and see more meaningful data, we’d probably exclude the Kansas one, overlay with population data, and get orders per capita.

5

So now that we have all of this data, what can we do with it? A LOT of information can come off of two tables, come to find out. We were able to use our search and order data independently and together, and were able to display some pieces of data that no one had really seen here at Wayfair before. Also, we accidentally built a product recommendation engine.

You may have noticed that this dashboard won’t work without providing a SKU to search for. As I mentioned previously, we’re sending an array of SKUs to the SKU field when that is applicable – so this search returns any of the associated SKUs to the SKU that was searched for. This allows us to show “customers also viewed” type material – this is the representation in the top left histogram. As far as this screenshot goes, we searched for a SKU that we know to be a barbecue grill. The other SKUs shown in the top left happen to be grills as well. The top right histogram is similar as well – except this is order based. This is our “customers also purchased” type view, and this histogram lists the grill (of course) and grill accessories, like covers, spatulas, and patio furniture. Furthermore, the bottom left visual shows exactly which search terms lead to this item, which is extremely useful data for analysis. The rest of the data here was “just for fun”. Where did people buy this? What is their name (Jennifers like grills!)? Lastly, fraudulent order ratio, and average order total.

6

Our next item is a holistic search dashboard. It breaks down all of our search data in a general sense. What we have here first is our Keyword Clickthrough panel. This is a simple idea: after someone searched for something, did they click a SKU? If so, that’s a successful search. Further down, we have our most popular search phrases (using keyword.raw) and most popular search terms. Search terms is an aggregation on an analyzed field (oh no!) but totally useful in this case. We can see that our most common search phrase is storage cabinet, but our most popular term is table. 7Furthermore, we can see search volume by store, and then search success by store, with count. This allows us some flexibility as well – what if we want to see search success and most popular terms on Birch Lane, which is one of our lifestyle brands?

It turns out that wall mirrors are most popular!

Next, we have a bit more of a granular search dashboard. This allows us to search for a search term, product, or otherwise, and view the results in a few ways. First, we have the count of searches by channel on the top. In this example, we’re looking at our friend “storage cabinets”. In the first chart, we have the counts of searches for this phrase by channel – we can see if this phrase was most popular through mobile, desktop, or otherwise. Below that – we have another visualization of our success. The cool thing about this dashboard is we can geo-locate this data as well. It looks like storage cabinets are popular everywhere, but what about something like pineapples?

This changes our geo-location information quite drastically. Apparently people in the Midwest don’t care about pineapples!

Furthermore – we can use this dashboard to determine the SKUs people clicked on after, and which store they were on. If we filter this by store as well, we’ll be able to see popular SKUs in that term for that store, how successful those searches were, and so on and so forth.

Next useful visual: fraudulent orders. The nature of being an online retail store is that people are going to try and scam us, or use stolen credit cards, or whatever. The good news is – our fraud evaluation system catches these. We decided to plot all fraudulent orders on a map, and break it down by state/city, as well as sku and customer ID. We now know where the majority of our fraud comes from, and which SKUs are commonly marked as fraudulent.

Another dashboard we were able to pull out of this data is a detailed order source visualization.

We are able to see order percentage by channel by store on top – turns out Birch Lane receives a ton of mobile traffic, and Wayfair Canada doesn’t! Then we broke down order count by store, and percent of total orders by store. At the bottom, we have revenue information by store and channel, by store, and then the average order cost (this is, of course, blurred, since we are a publicly traded company and all…).

Next up is a detailed order costs breakdown. This takes all of the parts of an order (shipping, tax, SKU quantity, total, etc) and puts them on a heatmap. This allows us to see potential shipping cost discrepancies in an area, and allows us to say “maybe we need a warehouse near <city>”. For the sake of screen space – there’s a few other heat maps on this dashboard, and they all represent a different portion of an orders cost. This gets particularly powerful when filtering by store – we can see exactly where people are or are not spending money by store – and maybe use that data for more marketing!

And then…. the finale. This is possibly the most important dashboard we built, and proves the true power of this engine. We used it to stalk our company founders!

We tracked their searches, orders, and activity. You can see that they’re mostly in Boston, but they travel to New York City once in a while. Click to embiggen, and you can see what they were looking for. The results show bathroom vanities and a high order count. When we showed the results to Steve Conine, he said “oh wow, look at that! I just remodeled my bathroom!”.

stalkerdash

This just goes to show how deep this tool can go – we can see a high level view of all of our search traffic, all the way down to finding the behavior of individual customers.

Medals of Freedom for Software

Last week, President Obama awarded his Presidential Medals of Freedom. There were some incredible people on the list of recipients, and we salute them all. Most notably for us he recognized two women, Margaret Hamilton and Grace Hopper, who blazed a path in software development with their incredible breakthroughs, led the pack for women in engineering, and helped to inspire our Women In Technology group here at Wayfair. This is a fantastic step forward in recognition for women in engineering – especially as we aim to further close the gender gap in STEM fields.  We are thrilled to see two fierce women representing the technology field, and so we here at Wayfair Engineering would like to extend our enthusiastic congratulations to both Margaret and Grace! We truly appreciate everything they’ve done for us all. There’s so much to learn from these women, including that you can’t stop when faced with adversity.

Wayfair hopes to keep the legacy they’ve created alive and support all kinds of diversity in engineering; We’ve recently grassrooted our Women in Technology group here at Wayfair as a way to provide support to our own women in engineering and reach out to the larger community. We’re excited to announce some key partnerships that we believe will help to help us grow and contribute.

We will be hosting a screening of “Code: Debugging the Gender Gap” on January 19th at our Wayfair HQ. Details and tickets can be found here. We are inviting several companies throughout Boston and local universities to attend.  We hope to see everyone there!

We are also partnering with ChickTech, a national nonprofit dedicated to retaining women in the technology workforce and increasing the number of women and girls pursuing technology-based careers.  We will be hosting the Boston launch of their high school program here at Wayfair on February 11-12th.

Looking forward, we hope to continue to create useful, interesting, and inspiring events that foster and support diversity in engineering so that there may be many more Margarets and Graces of all stripes in the future.

Wayfair’s 3D Model API

When we started Wayfair Next – Wayfair’s R&D group that explores next generation experiences, we spoke to a lot of AR and VR application developers who were interested in creating experiences for us. In line with our engineering culture, we wanted to create the experiences ourselves and figure out how to help our customers better visualize our products. Home decor lends itself very well to the application of AR and VR technology. As we ventured further into AR and VR territory, we realized that visualization technology was not the barrier. The bottleneck was the dearth of 3D content out there, especially in the E-commerce V-commerce space.

In January 2016, we were beginning to digitize our catalog and create 3D models of our products to be used for photographic rendering. More on that later. While developing WayfairView, our AR application for Google’s Tango platform, we started laying down the foundation for a 3D content pipeline to create app-ready real-scale 3D models of our products.

We saw an opportunity here – why not empower other application developers to create amazing 3D V-commerce experiences in AR and VR just like we are, and push the envelope of mixed reality!

Real estate developers are beginning to use VR experiences to showcase spaces to their clients and customers. Why not decorate the space with real products!

Game developers can begin to embed 3D models of real furniture products in their games!

3D interior design applications that assist with room planning and creating inspirational content can use and offer Wayfair products!

Is this beginning to sound like an affiliate model for product sales via partners… but in mixed reality apps and 3D games? That’s exactly what it could be – Developers can choose to redirect users to Wayfair and earn money when product purchases originated from their apps.

So lo and behold, we’re releasing our first public API that offers 3D models of our products for developers, so that you as developers can spend less time worrying about content and more time enhancing your app’s experience.

There are two ways to interact with our API, using standard HTTP GET requests or using GraphQL. If you haven’t heard of GraphQL, it’s Facebook’s internally developed interface to their data, which they have opensourced for general use, with implementations in a number of languages. It’s a very fast on-ramp for React.js-based apps, but we have found it quite useful at Wayfair, along with its ecosystem of tooling, even though we don’t use React. You as a developer can use Wayfair’s instance of GraphiQL to get information about our 3D ready products.

What kind of 3D models you ask? For now we’re offering models in the universally supported Autodesk FBX format. If you’re a Unity developer, we also offer platform specific Unity asset bundles.

Below are some examples of data users will be able to retrieve from the API –

Information about all dining chairs that are 3D ready –

{
...
  "pdp": "https://www.wayfair.com/Daniel-Tufted-Side-Chair-HOHN2381-HOHN2381.html?refid=3DAPI.2"
},
{
  "sku": "HOHN2563",
  "product_name": "Maza Parsons Chair",
  "sale_price": 153.99,
  "product_description": "A dining chair ...",
  "image_url": "https://secure.img2.wfrcdn.com/lf/43/hash/37313/28458225/1/custom_image.jpg",
  "class_name": "Dining Chairs",
  "class_id": 168,
  "pdp": "https://www.wayfair.com/Parsons-Chair-HOHN2563-HOHN2563.html?refid=3DAPI.2"
},
{
  "sku": "HOHN3044",
...

List of 3D models for the Maza Parsons Chair –

{
  "HOHN2563": {
    "viewer": "https://www.wayfair.com/v/api/model_viewer...",
    "android": "http://img.wfrcdn.com/docresources/37313/40/408920.android",
    "web": "http://img.wfrcdn.com/docresources/37313/44/441369.web",
    "fbx": "http://img.wfrcdn.com/docresources/37313/44/444894.fbx"
  }
}

Sample GraphQL query for fetching 3D models for the Maza Parsons Chair –

{
  product(sku: "HOHN2563") {
    sku
    name
    three_d_models(page: 2, user: "xxx", api_key: "xxx") {
      fbx
      web
      android
      win
      viewer
    }
  }
}

 

Screenshot of the 3D model of the a web viewer we built to view the 3D models of our products

Screenshot of the 3D model of the Maza Parsons Chair being displayed in a web viewer

At the time of writing this post, we have ~10,000 products, distributed across all our categories, that are 3D ready. We are hard at work applying our engineering chops to scale our 3D catalog. We’ll talk about how we’re doing that in another post.

Meanwhile, if you would like to start using this API, e-mail us at Next3DApi@wayfair.com with a short description of how you would like use our 3D models, and we’ll hook you up with an API key and documentation on how to use the API.

We would love to hear from you!

Happy developing!


If you’re local, and this sounds exciting, we would highly recommend checking out the Reality, Virtually Hackathon hosted by the MIT Media Lab from Oct 7 – 10, 2016. We’re going to be giving participants beta access to the API for the weekend and are also conducting a workshop on how to use the API.

Testing PHP7

PHP7 is out. This isn’t news. It’s been out since last December, with nine minor revisions since then. What’s new is that it’s serving all of Wayfair’s customer-facing traffic.

Performance-wise, PHP7 is the rocket ship people said it would be. We’re nothing but pleased. If you can upgrade, you should do it yesterday. With barely any changes to our code, pages render in about half the time. CPU utilization dropped by about 30%.

Part of what made us nothing but pleased was that we haven’t had to roll anything back after flipping the switch. For context, Wayfair’s PHP presence is 3.5M LoC over 28K files. Coding conventions span several versions of PHP. There’s a similarly diverse range of third-party packages (composer and otherwise). We also use 66 PHP extensions – that is, C/C++ – including some forked third-party and some totally custom code. We serve 30M+ unique visitors a month from all around the world.

In a way, upgrading to PHP7 was less about the language and more about testing a complex codebase. Testing is a tricky thing with dynamically-typed languages. At the risk of stating the obvious, this means that every error will be a runtime error. This is great for fast development, but makes it difficult to ensure code “correctness.”

Here’s a brief guide to upgrading, as we experienced it. YMMV.

Planning

A first step was to get a lay of the land. Use PHP built-ins (phpinfo() or `php -i`) to find out which extensions you rely on and how they’re configured. Check out the official migration guide and the goPHP7 notes on extension compatibility.

With a decent mental map, you can grep your way to a punchlist of compatibility fixes. For example, knowing that the ereg extension has been dropped, apply /\b(ereg_replace|ereg|eregi_replace|eregi|split|spliti|sql_regcase)(/ to your codebase.

Static Analysis

There are a few ways you can check your code without running it. The above grep example is, technically speaking, a form of static analysis. You can use `php -l` to check basic syntax. php7mar will pair that check with a group of others (mostly regex-based) for common migration gotchas.

If you want to be more robust, Phan and HHVM perform true static analysis on your code. (The Phan readme lists examples of what this includes.) It’s a tempting concept – marrying the rigor of strongly-typed languages with the agility of PHP – but it can be imperfect in practice. We used HHVM with PHP5.6 and found it difficult to maintain alongside our build process. We tested Phan and had some concerns about false positives and other noise on our aging codebase. That’s not to say either tool won’t have value for you, but that they require some careful evaluation.

Automated Testing

Automated testing comes in many guises: unit tests, integration tests, browser tests. There are no universal truths in this world. So much depends on how your team works. But codified, reproducible tests can reveal subtle regressions that could impact your customers.

Replay Testing

Use web logs and other information you already have about how people use your product. No unit test coverage? No problem! Take a day’s worth of URLs and curl a test server. Naturally, that means you’ll need to be able to detect when things go wrong.

Monitoring

Does your application already have a safety net? Logging and alerting are invaluable when preparing for large systems changes. (Really, they’re invaluable, period. If your application doesn’t have monitoring in place, stop reading this right now and set it up. Trees can fall in the woods without making a sound.)

Stress Testing

Before you roll out to real traffic, try hammering a test server with ApacheBench. You might have an extension with memory allocation issues that aren’t revealed until you apply load in parallel.

Extensions

Writing custom extensions is niche activity, but one that presents a significant testing burden. The internals of PHP have changed a bit – the slimmer data structures behind PHP variables are actually responsible for much of the speed gains in PHP7 – though there is code out there to help with some of the changes. A strong suite of .phpt tests, run with Valgrind, is invaluable. Code coverage tools, like gcov, can help validate the breadth and depth of your tests. A sandboxed development environment, like the php7dev vagrant box, can streamline your process.

Even if you’re not rolling your own, you may rely on pecl or other third-party extensions. You may need to isolate test cases to help guide others. Basic comfort with capturing and examining core dumps can go a long way here.

The Community

If you encounter issues, don’t forget about the existing PHP bug tracker. It’s a big community, so it’s not unlikely that someone else has already had your problem. Follow maillists. There’s a whole suite for PHP core. Third-party extensions and libraries may have their own. Familiarize yourself with the GitHub repositories for any third-party code. PHP7 is still relatively new. A bug fix you need might still be in an stable release.

Above all, remember that open source projects work when you give back to them. Share findings on maillists, file bug reports, submit patches for improvements.

When You’re Done

A big question is how to determine when you’ve testing something just enough. No method is totally comprehensive. Diminishing returns are a fact of life. We didn’t do anything particularly special in this department. We kept good notes, recorded questions as they arose, and ensured low-friction communication within the working group. At a certain point, you – the group you – will feel comfortable enough with the state of things to move to the next step: asking manual testers to have a look, beginning stress testing, exercising your deploy scripts, and, finally, sharing your work with the world.

Wayfair Engineering FAQ, a conversation with the Chief Architect

Q1: What’s the tech stack in a nutshell?
A: PHP on Linux, and a few data backends, with continuous deployment at the pace of ~250 zero-downtime code pushes a day.

Q2: Whoah, that’s a lot of code pushes! Break that down for me.
A: What I’m actually counting there is git changesets that are going to some type of production system. We group them in batches, so merges and testing don’t get too hairy. The rumors that we’re so cowboy we only test in production are an exaggeration.

Q3: What’s the main idea of Wayfair Engineering?
A: Stay fast while growing to enormous size.

Q4: That sounds cool, but how do you measure ‘enormous size’?
A: Anything you can think to measure: revenue, active customers, daily visitors, site traffic, terabytes of data, engineering headcount, warehouse square footage, linear miles traveled by our packages being delivered, square miles covered by our proprietary transportation network, number of products we sell. The numbers are out there every quarter in our reports, and in the independent venues: however you count, we’re huge. I’d give you the actual numbers today, but I want to be able to leave this page up for a while, and at the rate we’re growing, snapshot numbers get stale fast.

Q5: How do you measure ‘fast’?
A: Two things. First and foremost, speed to market for new business ideas and features. We also look at performance metrics: web page load times, lead times for delivery, etc., etc. We measure everything.

Q6: How’s the relationship between tech and the business?
A: It’s very good, and it’s one of mutual respect and cooperation. It’s a founder-led business. Steve Conine (tech) and Niraj Shah (business), co-founders/owners, provide a good model of that kind of relationship for the rest of the company, and we take our cue from them. Steve is a very business-minded tech entrepreneur who’s also a phenomenal programmer, and Niraj is an unusually tech-savvy business person. They were both engineering majors at Cornell. There’s less of a divide there in the first place than at many companies. The core of Wayfair is an innovative, completely custom e-commerce platform, and management has consistently described that to the outside world as being a big part of the equity value of the company. On the tech side, let me ask people out there: how valuable would it be to you, to have a deep reserve of confidence that you’re working at a well managed business, where your engineering efforts aren’t going to be wasted on some improperly vetted hunch that will send the balance sheet into the red for no good reason? Combine that analytical sense and good judgment with Wayfair’s characteristic aggressive business innovation, and we all feel pretty good about working together. The home goods niche of the economy isn’t going to go on line all by itself: we’re going to need to make it easy for people, and that’s going to come from a strenuous, combined effort by all parts of Wayfair.

Q7: The home goods niche? You’re taking credit for a whole sector of the retail economy going on line?
A: Of course it’s not *all* us, but we do account for a hefty percentage of every dollar that goes on line for the purchase of home goods.

Q8: ‘Innovative, completely custom e-commerce platform,’ you say? Do you want to elaborate?
A: It’s not that we never buy third-party software, but we have a strong bias towards ‘build’, in build-vs-buy discussions, and that has only become more pronounced over time. At the core, there’s no third-party platform like DemandWare or Magento, just an evolving set of data models and architectural principles, a lot of code, and some great components that our developers know what to do with. We’re very patient with the early stages of DIY efforts that aren’t necessarily up to industry standards at first, if we think we can gain a sustainable advantage over time. Most recently, we’ve insourced some big parts of our marketing tech stack, that we formerly used outside vendors or commercial software for. It’s satisfying when we can leave behind vendor-based point solutions to individual problems, and stand up a new part of the living, breathing, integrated whole, which allows us to take advantage of everything our platform has to offer.

Q9: What languages do you write code in?
A: PHP and Javascript are the bread and butter, including our opensource Tungstenjs framework. We have also written important things in Python and C#, and some key components in Java. Objective C for iOS mobile apps, Java for Android, and some emerging language platforms for VR and AR. We use Puppet for configuration management, so there’s a certain amount of Ruby hacking as well, and a lot of systems scripting with Python. Once in a while we write some C or C++, for optimized numerical work, PHP extensions, and opensource infrastructure like Twitter’s Twemproxy (patches) and Statsdcc (from scratch, inspired by things in Node.js and other languages).

Q10: So you opensource code. Where can I find that?
A: Yes, we do that all the time, most of it on https://github.com/wayfair. Check it out!

Q11: VR and AR?
A: Virtual and augmented reality. We’ve got a lot going on in that space, particularly in a small department we call Wayfair Next that Steve Conine is leading. Right now the biggest push is to model the catalogue. From this we get excellent 2D imagery for the site, and next-gen experiences on things like the Google Tango AR devices that are becoming available to the general public in September. If you have a dev kit, check out the ‘WayfairView’ app. Big picture: we want to make it easier and easier to buy, say, an easy chair from your couch. VR/AR is going to be a big part of that.

Q12: What are your data platforms?
A: We’re proud of how far we got as a business, from our founding in 2002 until 2010, on a keep-it-simple-stupid or KISS architecture of relational-database-backed web scripting. SQL Server was and still is our core for OLTP, and it allows us to plug new tools into our integrated operational and analytical infrastructure very quickly. But to drive innovative customer experiences, we now rely on Solr-backed search and browse, and Redis/Memcache for fast access to complex data structures and ordinary caching. We have a modern, on-premise big data infrastructure consisting of Hadoop and Vertica clusters, and some specialized, vertically-scaled big-memory and GPU machines, for analytical workloads. We do our machine learning, statistical analysis and other types of computation on that setup, and funnel the results to the ‘Storefront,’ as we call it, and to the operational business systems. RabbitMQ and Kafka provide a kind of circulatory system for the data, and they are gradually replacing what traditional ETL we have. As I speak with other architects and CTOs around the industry, I actually think it’s pretty rare, at the biggest and most successful companies, to junk your relational databases, even when you’re many years into adopting these next-generation auto-sharding platforms. We’re fine with that.

Q13: OK, so with all these relational databases, do you use an ORM?
A: There’s a joke around Wayfair Engineering, that if you use the word ‘ORM’ in a positive way, you might notice a sudden drop in your career prospects. Joking aside, I do think excessive reliance on ORMs tends to foster careless data access code, and excessive round trips to the back end. We mostly use the ‘phrasebook’ pattern and hand-make our data access layer. Besides, it’s not as if an easily generated mock would really help you: by the time you’ve horizontally partitioned your data to the extent we have, ORMs are close to useless. We try to make it easy for everybody to develop against a readily accessible development database infrastructure. On the other hand, there is actually a bit of Hibernate, SQLAlchemy and both the Entity Framework and nHibernate in the Java, Python and C#, respectively. ORMs on language platforms like those can have some engineering benefits in addition to the convenience features, such as connection pooling, caching of various kinds, etc. None of that works, or at least works well, in PHP, so we just use PDO like the rest of the PHP world, and we’re experimenting with SQL Relay for some other kinds of optimization and encapsulation of the details of how we talk to the databases. At the higher levels, we have some pretty handy traits, which are the multiple inheritance thing in PHP, to inject common functionality into our codebase in a DRY way. No fanaticism, one way or the other.

Q14: What are your thoughts on web services?
A: We have a handful of important web services behind the scenes at Wayfair. Search and browse for products, orders, and some other things, are powered by Solr, which is an opensource, Java-based web service that we have patched a few times for our own needs. Our Python-based customer recommendations and search enhancements, and our C#-based inventory service, deliver a lot of value. There are other examples.

Q15: Do you have any other kinds of services?
A: Good question. Some of the highest-value systems we have are data processing services that ingest data from our messaging platforms (Rabbit, Kafka) and push value-added results to where we can use them to move faster on behalf of customers and suppliers. There are some DIY ones, but most live in the frameworks Celery, Storm or Spark. You could call our caching system a service. It’s a composite Redis/Memcache/consistent-hashing thing with smart proxies everywhere. You’re using regular Redis and Memcache commands, not going through an adapter layer, but that’s true of Elasticache too, so we’re far from eccentric in this way. Unlike in Elasticache, the sharding is taken care of for you. We built it on the back of work by Twitter, Pinterest and Instagram, but we added some innovative elements of our own. It has some similarities with Facebook’s McRouter, which is pretty awesome, and which we might well have chosen instead, if it had Redis support.

Q16: What about micro-services, or SOA?
A: We’re not really into all that, although as I said, we have some pretty awesome services back there. Is your code base, or a big part of it, really a monolith, in any pejorative sense, if it has decent separation of concerns, and you can deploy small modifications to any layer of it without rebuilding the whole thing, and without down time? We’ve had all of that for years. Many of the best big tech companies have largely monolithic code bases, and they’re too busy adding awesome features to the core to want to replatform. But don’t get me wrong: there are some cool micro-service set-ups out there. If we keep developing very valuable *macro* services at the rate we’ve been doing that, we’ll eventually have so many of them that micro-service-style orchestration techniques will start to make sense for us. Our Python services are the most numerous ones we have, and we’re already experimenting with Docker, Mesos and Kubernetes, for them. It’s just that over time I have seen the importance of web services diminishing, as data platforms become easier to scale horizontally, and server-side-of-the-front-end programming becomes easier and more powerful. The data is just too readily available for these layer cakes of http indirection to make any sense in a well-designed, modern setup.

Q17: Why do you like PHP?
A: I’m not sure I *do* like it, but it attracts the right kind of people: neither ivory-tower language snobs, nor hipster code posers. No fanatics, but no luddites either. We have some fun with all that, when we’re trying to make sure the culture stays strong. I tried to depict both sides of the ivory-tower/hipster thing in this picture a few years ago, in a comic-strip-style blog post on our Python ops: I think the tweed jacket combined with the Brooklyn t-shirt really gets the point across. (To MIT professors, and to my former neighbors in Park Slope: I kid because I love!)
eye-rolling sysadmin
With every other language, there are a lot of fanatics who think it’s the answer to every problem, and will wear your ears out explaining why. Even people who love PHP don’t think that about PHP. It’s just a solid platform for web development, the kind the tattooed web ops expert in the picture would think is a fine thing to have running on his servers. There’s no server lifecycle management to worry about, and practical problem-solvers gravitate to it. It’s also very readable, even if you don’t know it well, so let’s all just pause to give it the big thank you it deserves for killing Perl (with a substantial assist from Python, of course, on the systems scripting side). That needed to happen, and in retrospect it’s obvious that neither Java nor .NET had the slightest chance to do it. 80% of the internet runs on PHP, including a bunch of the biggest sites, which we’re rapidly becoming one of. It’ll do.

Q18: So PHP is a cultural thing?
A: Yes. Let me draw an analogy, which I sometimes use in talks for new hires. Do you remember the scene in Star Wars, when Luke Skywalker sees the Millennium Falcon for the first time, and says “What a piece of junk!”? Han Solo responds, “She’ll make point five past lightspeed. She may not look like much, but she’s got it where it counts, kid. I’ve made a lot of special modifications myself.” H/t @danmil for that analogy. Our PHP-and-friends stack might seem inelegant to language snobs, but it takes us where we want to go fast. The Millenium Falcon is still a space ship, after all! Let’s try a ‘car’ analogy: anyone who has ever done the coding equivalent of putting a Porshe engine in a VW Golf, or can show us the chops and attitude for that and wants to try, is welcome at Wayfair Engineering. Adding lambdas to the opensource php_mustache extension, which we did, is a great example of something that fits that mold. If you’re more of an “I won’t drive at all unless I have a Lamborghini” person, you should seek a company more willing to splurge on the shiniest tools, before thinking about whether there is really a need. If your mindset is more “My tank-like SUV keeps me safe, and I don’t care that it handles poorly,” there are plenty of J2EE shops out there.

Q19: OK, I’m getting the picture. Why use the other languages at all?
A: PHP is not a great fit for every programming task. Sometimes you need a long-running daemon that can respond to requests with little startup/wake-up overhead. We have excellent services in Python, C# and Java for that. The C# code grew out of our early-stage Microsoft heritage, but we are now doing some phenomenal things with it, and we have added some very elegant functional programming in F#. Python is our favorite language for data science, machine learning and the like, and it combines low-latency service qualities, the way we run it, with the convenience and productivity of loose typing, and that super-handy mix of the functional and object-oriented styles. Java allows us to tap directly into platform-level infrastructure such as Solr, Elastic Search, Hadoop, Kafka and Storm.

Q20: You’ve been talking about speed, and you mentioned that you measure web performance earlier. Can you give a bit more detail on that?
A: Sure. Web performance measurement is basically a 3-legged stool: RUM, or real user monitoring; synthetic monitoring, which is externally-located bots that measure page speed; and server-side execution metrics. We have a centralized performance team that makes sure we have the right tools and dashboards to be proactive about all of that. They also work on framework-level changes that can make a big impact, when those aren’t naturally more specialized with another group. They play a strategic role for us, but that team wouldn’t be very effective if we didn’t have a good culture of thinking about web performance in a broader context of putting a great user experience into the hands, and onto all the devices, of our customers. The RUM instrumentation gives us great insight into what our customers are actually experiencing. I’m not original with this name, but my joke name for that department is the RUM distillery, and you can imagine the joking about operating precise instruments in the right state of mind. We have some cool ‘responsive’ experiences here and there, but the RUM tells us that our decision to emphasize adaptive delivery over responsive design was a good one. Check out Wayfair mobile web, on a small iOS or Android device, and you’ll see what I mean. Our native apps are fast too, but that’s a separate discipline, where server-side execution and expert Java and Objective C programming are the key components.

Q21: Thoughts on the cloud?
A: We run a few elastic workloads on public cloud infrastructure, but that’s a drop in the ocean of Wayfair tech. Don’t get me wrong: if we were starting Wayfair today, we would do it on public cloud infrastructure, for the speed-to-market aspect, for sure. In fact, Wayfair was very briefly a Yahoo! Store in 2002, before Steve built the first version of the platform to run in a colocated cage in a data center. We run colo-style to this day. Wayfair was already a multi-hundred-million-dollar company before the cloud was a thing. We think about it, and do some analysis and experiments periodically. But ultimately our traffic is not extremely spiky, and we grow into the holiday spike provisioning pretty early the following year. The economics, control and convenience have not yet aligned to make it worthwhile to go through a big process of switching. We’re not big enough to have whole data centers, at least not yet, but we have our /22 ARIN range, and we use the border gateway protocol to make sure we have the kind of relationship with our ISPs where we have a lot more control than when we were smaller. Wrestling with these types of configurations is interesting work, and it attracts really good network and systems people. Let’s face it: the public cloud is awesome, but when the problem is under the hood of the hypervisor, you’re in for a frustrating day at the office. We do a lot of virtualization, and we like it, but when various types of systems become very cookie-cutter or have certain types of requirements, we run physical boxes. Virtualization adds overhead, and it’s one more thing that can break. If you can provision basic types ahead of demand, the IAAS side of the cloud becomes just another provider, and of course the higher-level services are fraught with problems of vendor lock-in. The way cloud adoption presents itself to new or small companies, it’s kind of ironic that we’re moving too fast to be bothered with moving to the cloud. But never say never.

Q22: OK, sign me up. How do you succeed in Wayfair Engineering?
A: It’s hard to answer that question without using some cliches, but I’ll try to use the ones that are characteristic and relevant. Programmers with the polyglot, DevOps-savvy innovator background tend to do really well here. Boyscout principle for refactoring, rather than a penchant for from-scratch rewrites. Bias for action: if you’re not embarrassed by the first version, you waited too long to ship it. Just ship! If you find yourself tempted by a months-long science project, don’t do it. Instead, fast-follow/adopt something that’s already here in the general area (whether we wrote it or it’s open source from outside), and innovate at the margins for now. But when you see a quick win that you think is on a path to a real breakthrough, pounce.

Boston events during the week of 25 July 2016, on augmented/virtual reality, data science and PHP

There are three events in Boston this week where Wayfair engineers will be speaking.

On Monday, July 25th, I am on a panel called “What’s Hot in Tech: Providing Next Gen Experiences,” which is part of the MITX E-Commerce Summit 2016, being held all day, from 8:30 to 5, at Google in Cambridge. Details here: http://mitxecommerce.org. Fair warning: paid admission. I’ll be speaking about Wayfair Next and our augmented and virtual reality applications. The panel is from 1:45 to 2:30 in the afternoon (details here: http://mitxecommerce.org/session/tech/), but Shrenik Sadalgi of Wayfair Next and I will be there all day, demo-ing next-gen technology, including our Tango app WayfairView, which is already in the Tango app store in Google Play, and will be available to the general public when the Lenovo Tango devices hit the market in September.

On Tuesday, July 26th, at 7:30 pm, Robby Grodin of our Marketing Engineering group is speaking on data science at General Assembly in Boston: https://generalassemb.ly/education/data-science-lets-break-it-down/boston/27296.

On Wednesday, July 27th, it’s open mike night at the Boston PHP Meetup, hosted at Wayfair. Pizza at 6, talks from 7-8:30, Q&A, wrap-up, out for beers? after that. Adam Baratz, George Carrette and I will get the ball rolling with some material about PHP 7 and a couple of other things. If you’ve been wondering why we haven’t been blogging about PHP7, it’s because we’re just rolling it out now, after wrestling with some interesting APC issues. Details on Wednesday.

Statsdcc

Statsdcc is a Statsd-compatible high performance multi-threaded network daemon written in C++. It aggregates stats and sends the results to backends, especially Graphite. We are proud to announce that we are opensourcing it today. Check out the code at https://github.com/wayfair/statsdcc.

At Wayfair we’re big believers in “measure anything, measure everything,” as the “Statsd is reborn in node.js at Etsy” announcement put it. We do application performance monitoring with the opensource tools Graphite, Grafana, the ELK stack (Elastic Search/Logstash/Kibana), and some homegrown tools. Until recently we had been using Flickr/Etsy’s 2nd-generation, node.js-based Statsd to collect metrics for Graphite. As the volume of these metrics increased, we noticed inconsistencies in the data, and realized that some metrics were being dropped. Long story short, we tried some architectural changes, scaling Statsd and Carbon horizontally (details below), but as the operational complexity of that increased, we began to wonder why we needed so many boxes. We found a bottleneck in the way Statsd buffers and flushes data to Carbon, and we decided we needed a different version.

Alternatives:

There are already quite a few alternative Statsd implementations available, but none of them really came close to meeting all of our needs. Brubeck by github is one that we found interesting, because it promised high throughput. Unfortunately, it was released after we had Statsdcc implemented and were ready to put it into production. At that point, we had no reason to take Brubeck and extend it to support the features we needed. However, we borrowed the idea of integrating a webserver to view application health from BrubeckStatsdcc and Brubeck try to solve similar problems. I would recommend checking out all these implementations and picking the one that best fits your needs.

TL;DR

If you’re interested in what we tried before starting to hack the C++, read on.

Attempts at Horizontal Scaling with Statsd:

Statsd performs aggregations on incoming metrics and sends the aggregates to a Carbon process, which in turn saves received metrics to a Whisper database.

To scale, we use multiple Statsd/Carbon chains. Each chain goes to a different disk.  Proxy daemons hash metric names to determine which chain to use. Which proxy daemon is chosen depends on round robin DNS. Consistent hashing ensures metric names are well balanced.

The diagram below depicts the architecture.

statsd_architecture

Issues:

A year ago we noticed that a certain set of metrics were being dropped, resulting in inconsistent monitoring data. We realized that this was due to a maxing out of UDP receive buffers on Statsd. So we tried adding more Statsd processes with increased UDP buffer sizes.

However, adding a new process is complicated. When a new Statsd instance is added, consistent hashing by a reverse proxy will re-route some metrics to the new process, resulting in duplicate files on different Carbon nodes for the same metric – one for the old data and one for the new data. To save space, and for Graphite to show all data, the old Whisper data files should be merged into the new ones.

In the end we were unhappy with how much traffic an individual node could handle. We discovered that the problem was a design decision in Statsd, where the same thread is responsible for both buffering incoming metrics and performing aggregations on them at every flush interval. When computing aggregations, the thread stops listening for incoming metrics, which are stored in the UDP buffer. As the rate of metrics increases, the UDP buffer overflows and drops metrics. We use single-threaded, event-looping frameworks in a few places (Node.js-based daemons for a couple of things, Python-based gunicorn+gevent for several), and we have seen this type of problem before. The event loops don’t help you when you have a blocking IO operation that can bring processing to a halt. Sometimes we work around or solve such problems within the event-loop paradigm, and sometimes we take a completely different approach.

After finding the actual root cause, we decided to rewrite Statsd as a multi-threaded application with a focus on effective use of socket-IO and CPU cycles.

Statsdcc:

Statsdcc is an alternative implementation for Statsd written in C++ for high performance. In Statsdcc, one or more server threads actively listen for incoming metrics. Server threads distribute incoming metrics among multiple workers using the formula worker = hash(metric name) % #workers. Worker threads read from their dedicated queues and update their ledgers until signaled to flush by a clock thread. Upon receiving this signal, the worker threads hand off their ledgers to short-lived flush threads, and continue with new ledgers until the next signal. To avoid lock contention and to pass metrics faster between server and worker threads, boost’s lock-free queues are used.

statsdcc_thread

We have not gotten rid of consistent hashing as we did not want to lose the ability to scale horizontally. However, to solve the scaling problem in our previous architecture, where adding a new process required cleanup on the Carbon end, we moved consistent hashing from proxies to aggregators. The proxies distribute the incoming metrics among multiple aggregators using the formula aggregator = hash(metric name) % #aggregators. Each aggregator then sends the metric aggregation to its respective Carbon process by using the consistent hash. The difference from the previous architecture is that the Carbon process has more TCP connections open, one with each aggregator. However, unlike Statsd, instead of reopening connection on each flush, Statsdcc reuses established TCP connections, thereby avoiding the overhead of a TCP handshake. The diagram below describes the current architecture.

statsdcc_architecture

Statsdcc can handle up to 10 times more load (up to 400,000 metrics/sec) than Etsy’s Statsd. Only one instance of the Statsdcc aggregator handles all our production traffic, in contrast to the previous 12 Statsd instances. Statsdcc has been used in production for about 7 months. We hope more people will find Statsdcc as useful as we have at Wayfair.

Tungsten in the news

There’s a great interview with our own Matt DeGennaro by Paul Krill of Infoworld that came out a few days ago. The topic is Tungsten.js, our awesome framework that ‘lights up’ the DOM with fast, virtual-DOM-based updates, React-style, and can be integrated with Backbone.js and pretty much whatever other framework you want. It’s spiffy, it has a logo,
tungstenjs-w1200
we do it github-first, and we’re getting a lot of mileage out of it at Wayfair. Matt mentions the templating aspect of our composite system: we use server-side PHP, including Mustache templates, and then our client-side pages, also including Mustache templates as needed, get dynamic updates via Tungsten.js. That works great for us because Mustache has implementations in both Javascript and PHP, among many other languages.

What’s that you say? The PHP implementation of Mustache is not fast enough for you? Well, we’ve got you covered! Adam Baratz just put up a blog post yesterday on a server-side optimization that’s been working well for us. We use John Boehr’s excellent PHP mustache extension, which is written in C++, and is much faster than vanilla PHP Mustache. Inspired by another snippet of PHP/Mustache code, we’ve even added lambdas to that, as Adam explains. I had to do a double-take the first time he explained that to me. As far as I can tell, the PHP community, of all groups of web programmers, is the least likely to care about lambdas in particular, and any kind of functional programming in general. And yet, we’re finding lambdas very useful for our globalization efforts, and we’re starting to use them for other things as well.

We’re still working on a date, but Adam, Matt and Andrew Rota will be giving a talk on all of this at the Boston Web Performance Meetup, hosted at Wayfair, in the near future.

Wayfair Labs in the news

Scott Kirsner has a terrific piece about tech talent wars in Boston, that was in Beta Boston on Friday, and then in the print edition of the Boston Globe on Sunday, October 26th. It features Wayfair Labs, which is our hiring and onboarding program for level 1 engineers in most of the department (a few specialized roles excepted). I am the director of it, so if you have any questions, please reach out.

Logo-Vert