Consistent hashing with Memcached or Redis, and a patch to libketama

Summary

This is a howto for consistent hashing of keys in key-value stores, with a focus on cache servers.  We have code to share, in the form of a set of patches to ketama on github.  At Wayfair, we have been using this code, or slightly less evolved versions of it, with our production Redis servers for a long while, and our production Memcached servers for a short while.  It is now mature enough that we can release the code and describe our results.  The code is cache-server agnostic: there is nothing specific to Memcached or Redis in it, and it could be used with other servers.  The internal project that led to this was one of those little efforts intended to smooth over some blips in our fault log, and which has now produced a very useful set of building blocks for scalable systems.

Before I get started, I have three public messages, for three audiences:

  • To the non-free key-value-store salesmen of the world: one of my main purposes here is to continue to be able to use free key-value stores, as my team helps to scale Wayfair up into larger and more numerous data centers.  This relationship that you keep trying to initiate is just not going to work.  Please stop calling me.
  • To Brad Fitzpatrick and the other authors of Memcached: Memcached, you’re still the one, after all these years!  You rock the low-latency performance like nobody else.
  • To the author of Redis, with gratitude I say this: I’m sure Redis cluster is going to be an awesome thing for a lot of users.  But Salvatore, don’t sweat it too hard.  We’ve got you covered, with this technique, for some scaling scenarios that people can go very far with.  As long as plain old Redis is fast and awesome, we won’t be needing Redis cluster in order to go wide and large with Redis.

Background

Consistent hashing is an idea that has been around for a while. It solves a common problem with sharded cache server clusters, which can be disrupted when a single node fails and sets off a festival of rehashing, key shuffling and cache-filling database calls. The history of the idea and its implementations is roughly this: Danny Lewin, an Akamai founder, thought of it. (I won’t dwell on the details, but if you don’t know the story, look it up: in addition to being a hero of applied mathematics for this invention, he is a hero, full stop, for what he tried to do just before he was killed during 9/11.)   A group of MIT professors and students including him wrote a paper in 1997 (Karger, Leighton, Lewin et al.), and Akamai  has an in-house implementation.  Various people, at various times, have written versions in Perl, Python, PHP, and other languages. Richard Jones, then CTO of Last.fm, wrote an open source C library called ‘ketama’ in 2007, with wrappers or ports in a few languages.  Akamai alumni at Basho have incorporated an Erlang version of the idea into Riak.  Jeremy Zawodny and others have written excellent blog posts describing configuration niceties of the way the idea is used at Craig’s List and other places.  I have put a list of links at the end of this post.   Of the implementations we looked at, ketama has the broadest language/platform coverage.  We started using it early in our process of design and prototyping, and although we have found reasons to patch it, we never felt compelled to switch to something else or write an alternative from scratch.  My thanks go to Peter Newell, now of Srsly Software, for finding it, and for convincing me that it was going to work.  He did that by making  sure you could write the keys from client code in any one of the supported languages, and read them from any of the other languages, and be sure they would always be on the right servers.  I’m not sure about the other implementations, but the Riak and ketama versions strictly observe the original Akamai design of scattering a large number of points (160 of them) around a unit circle, a new set for each shard.  Ketama calls this circle the ‘continuum’.  I call it the ‘ring’.  This is done so that when a node disappears, the keys from the vanished shard are distributed  to the other members of the ring in even quantities.  If you have n shards, with m as the maximum amount of memory used on any one of them, you only need m/(n-1) extra space on each shard to hold all the extra keys for a single node failure, m/(n-2) to support 2 node failures, etc.  If you were to divide the key space naively as with a pizza cutter, with one slice per shard, you would need 2m on each shard to support a single node failure, 3m to support 2 failures (worst case, when the nodes are contiguous in the ring), etc.

What’s up with the name ‘ketama’?  A Google search tells us that ‘ketama’ can be a type of marijuana or a flamenco band.  Last.fm is a music site, so the flamenco interpretation is tempting.  On the other hand ‘hashing’ and ‘hash’ make an obvious play on words.  A quick look at the code removes any ambiguity.  There are functions in the library called ‘ketama_roll’ (create) and ‘ketama_smoke’ (destroy).  Draw your own conclusions about people in the music business, their sly senses of humor, and what they might be smoking at parties.  Not to mention that there were a couple of half-baked things in the code.  Heh. Absent our patches it used to read from the configuration files too often, didn’t use shared memory in exactly the way we wanted, and could not modify its configuration on the fly.  We’re actually not making a pull request yet, because we have a not-quite-fully-tested version that eliminates shared memory from the system altogether. But these are quibbles, about problems we were able easily either to solve, or to work around with wrapper code up the stack.  Mad props, and many thanks, to Mr. Jones for his excellent library.

Pre-existing state at Wayfair

Before implementing the consistent hash, we had been using Memcached at Wayfair in a very simple configuration: twin large-memory machines, one master, one slave.  Why not a sharded system?  Simple.  We could afford the big boxes, and we hadn’t ever had a very large key space.  But we knew that we would eventually come to a maximum height for those twin mounds of memory, and that we would want to move to a larger number of cache servers that we could scale horizontally.  As it says in all the Memcached documentation: if you need more space, just add more nodes!  So we decided to get ahead of the problem.

Why both Memcached and Redis, you may ask?  The short answer is that we’re in a period of experimenting with key-value stores for various purposes.  For now, we’re using Memcached for completely transient in-memory data, where every cache key has a lazy-initialization path.  We use Redis for in-memory data that is backed by disk storage, where each Redis node replicates to a slave, which writes to the disk.  The master does not write to disk, and when failover occurs to the slave, they switch roles.  This is typically, but not exclusively, for values that do not change very often, and are populated by a batch process of some kind.  If there’s no lazy-initialization path for a key, it’s going in Redis, at least for now.

Design considerations

How should the clients find and talk to the nodes of the cluster?  Put the list of nodes in a configuration file, and you’re all set, right?  Heh.  After looking at the literature and giving the matter some thought, we came up with three candidate architectures.

  1. 3-tiered design. All the clients talk to a central dispatcher (with a SPOF-avoiding twin, naturally), which keeps track of which nodes are in the cluster, probably with some sort of monitor background thread that updates an in-memory configuration. The dispatcher helps with race conditions and ‘thundering herd’ problems in failover situations, and reduces the number of connections among different servers.
  2. Fully distributed configuration. The clients are occasionally updated with fresh lists of all the nodes in the cluster, and they are set up to try a progression of candidates if their default choice is unavailable.  This is also good for thundering herds and race conditions.  However, it might lead to a lot of running through lists of possible shard locations under stress.  This seems wasteful, but is perhaps not disastrous.
  3. Distributed configuration, but with an external nanny, or let’s say an ‘animal trainer’.  In this scenario the clients would check a primary shard location, possibly with a locally configured failover-to-secondary option.  But the overall system would depend on an independent monitor of the members of the ring. Zookeeper is good at this kind of thing, so we started experimenting with it in this role.
Our decision-making process was primarily focused around failover scenarios.  We had two very different ones to consider, between Memcached and Redis.  We’ll call Memcached, as we use it, the lazy cache, and Redis the persistent cache.  With the lazy cache, if you can’t find the node on the ketama ring, you need to look for the next node on the ring. You don’t need a master-slave configuration, because the keys migrate lazily to different arcs on the ring when a node disappears.  However, once you’ve done that a few times, you might not want all your clients to be repeating the process of checking with the dead node on every request.  So we asked Zookeeper to monitor which members of the ring are available, and to notify the agents on the client nodes when the list of members of the ring has changed.

With the persistent cache, you can’t do that, because there is no lazy initialization path for the cached items.  We configure our persistent cache ring with the consistent hash for simplicity’s sake, and so we can scale all this out horizontally as far as we like.  But if an arc disappears, the failover is to a waiting hot spare slave backup, which can be hustled into service quickly by Zookeeper, whose agents give the ip address of the shard to the slave node, and turn it into the master.

Implementation

The first operating system on which we started to use ketama was FreeBSD. The library wouldn’t compile, so I wrote some autotools scripts, and hacked the code to get it to build on Mac OSX, Solaris, FreeBSD, and Linux.  Ketama uses md5 as its hashing algorithm by default, but we discovered a patch file already in the tarball that implemented a faster algorithm, FNV-1a.  So Steve and Viktoras from our team merged all those changes together, ran the resulting system through Valgrind to look at memory usage, and made some improvements.  They also changed the way the code loads its configuration, to make it easier to change the list of servers on the fly.

In the mean time we switched our servers from FreeBSD to Linux.  Thanks anyway, FreeBSD!  Your strict, old-school C compiler and standard library made it a lot easier to see the need for some improvements that will help us even on Linux, which had been allowing the looser code to run.

Deployment, Results

Rolling this out was quite a thing.  We handed the project off to Andrii from Site Reliability Engineering, and Aaron, from our Storefront team, to tame the animal trainer, as it were, write a Zookeeper script, and refactor the cache client wrapper code until it behaved as desired.  Here is a picture of a test we ran, of the spiffy new 14-node cluster in our Massachusetts data center (it has a twin in Washington state, and more coming soon).  In these graphs, we’re counting requests to the different nodes of the cluster.  The gaps are periods where we forced a particular node, or more than one, off line.  You can see that traffic to the other nodes picks up when that happens.  We set Zookeeper to fail over upon 5 consecutive failed health checks, so it won’t be too twitchy.  The failover takes 5-7 seconds, and during that time, it’s not as if the requests are failing–they are just going back to the source, or origin, as the content delivery network people would say.

Charts of requests to the 14 nodes of a memcached cluster, with distribution governed by ketama

 

 

Happy hashing with this, if it fits the size and performance requirements of the cache that you have.  And if it doesn’t, and you’re willing to use some even more involved techniques, you can always scale Memcached to Facebook scale!  The Facebookies describe all that in this blog post, which has a link to a Usenix paper they wrote.

References

Here’s a little bibliography of consistent hashing.  It includes the original research paper, some blog posts about its use at various sites, and links to a bunch of code:

Wikipedia article on consistent hashing

Lecture: Consistent Hashing, Danny Lewin, and the Creation of Akamai, by Tom Leighton, Akamai founder.

MIT/Akamai paper (Karger, Leighton, Lewin et al.). Other downloads here, if that goes away.

Consistent hashing in Riak.

Mathias Meyer, author of Riak Handbook, on practical consistent hashing

Another general treatment, by the audiogalaxy guy

On sharding in redis, by the author of redis

Jeremy Zawodny on sharding at Craig’s List

Richard Jones’s blog post on using his code at Last.fm

Richard Jones’s ketama code on github

Perl consistent hashing module

Python ‘hash_ring’ module

 

Exchange 2010 Upgrade: The Archiving Edition

Over this past summer, Wayfair upgraded from Exchange 2007 to a new, highly available, Exchange 2010 email environment.  With it brought many improvements, but with these enhancements came new methods for performing common tasks.  One operation affected by the upgrade was the mailbox export process.  To export a mailbox in Exchange 2007, one would simply grant the engineer or process performing this task administrative rights to the target mailbox and run the PowerShell command:

Export-Mailbox –Identity Username –PSTFolderPath \\Server\Destination

The procedure for exporting mailboxes in Exchange 2010 is a bit more complicated – Continue reading

PHP APC and Handling Dynamic Inheritance Opcodes

PHP APC and Handling Dynamic Inheritance Opcodes

Gopal Vijayaraghavan, one of the core APC developers, has a very good write-up (APC Autofilter: The Real Story) on the consequences of static and dynamic inheritance on opcodes generated by the compiler as well as how APC handles them. If you’re familiar with it, you might want to skip right to ‘Handling Dynamic Opcodes’ where we focus on some discrepancies we found as well as how to make things perform better. Continue reading

BarCampBoston is so good it’s spooky!

A group of engineers from Wayfair will pause trying to make Halloween pictures in Graphite and stop looking for a ghost in the machine with Skyline this weekend to head over to the 8th reincarnation of BarCampBoston.

BarCampBoston is a great example of an unconference. These conferences are filled with their own kind of tech conf magic, so if you haven’t registered yet, I’d put down your magic wand and signup quick!. 600+ attendees will be headed to the MIT Stata Center this weekend like a swarm of undead, showing how very much alive the Boston tech scene is.

Over the past 7 years the event has continued to grow in size and popularity without losing the unconference spirit and feel. It will be interesting to see how this year’s event handles keeping with the unconference model at such a scale. Some of the normal process at an unconference is geared more for smaller groups, so I wouldn’t be frightened if things need to be tweaked a little to work for such a large crowd.

We look forward to seeing all of the talks that are conjured up at BarCampBoston this weekend!

Improving German Solr/Lucene search results

On the search and recommendations team at Wayfair, we like to think of ourselves as sophisticated men and women of the world.  We have speakers of French, Chinese, Hebrew, German and Persian in the crew.  Some of us host couchsurfers at our swanky apartments, and then turn around and couchsurf all over Europe.  Others travel to Singapore, attend music festivals in Thailand, etc., etc.  But until recently, when it came to giving our German-speaking customers a decent search experience, we could fairly have been characterized as inhospitable xenophobes. Continue reading

Dynamic WAN Optimizer Routing

Wayfair recently deployed Steelhead WAN optimizers in our network.  We were unsatisfied with the 3 (2.5?) main suggested methods of deployment from Riverbed.  So we designed our own deployment.  But first, the vendor suggested methods: Continue reading

Stormin’ OMS

Overview

There’s storm in Wayfair! And yes, the “a” article before the word “storm” is purposely not there. When referring to “storm” at Wayfair, we do not mean a conglomerate of barometric circumstances that lead to downpours from the skies and other natural phenomena (a storm); we mean real-time computation, horizontal scalability, and system robustness. We mean bleeding edge technologies (a storm). Wayfair’s Order Management System (OMS) team introduced storm into our ever-growing technical infrastructure in February to implement event driven processes. Continue reading

/dev/null vs. MongoDB benchmark bake-off

We’ve used MongoDB at Wayfair for a subset of our customer data for a while.  But we’re always looking for opportunities to speed up our infrastructure and give our customers a more responsive user experience.  So when we heard about a new database platform called ‘/dev/null’, we became pretty excited.  We can’t post a link, because it’s in a very private beta testing phase, but we can assure you that the stealth-mode startup that’s working on it is supported by a pair of high-class Silicon Valley VCs.  The technology is supposed to be too cutting-edge for stodgy Boston, so we felt pretty lucky to be included.  /dev/null is web scale, we heard, and it supports sharding!  The slashes in the name certainly give it an edgy feel.  IMHO it’s a bold move to name it that, because of the potential for gfail (weird names doing badly in Google search) and unexpected placement in alphabetical lists.  But hey, as an NYC cabbie character said in Taxi Driver, they’re way ahead of us out there in California.

Everything comes with trade-offs, and the word on the street is that /dev/null is so heavily optimized for write performance that ‘read’ reliability can be less than ideal.  But who knows?  Maybe that’s the balance we want for our write-heaviest workloads.

So we got out our testing tools and went to work on a bake-off.  We wanted to simulate real-world conditions as much as possible, so we wrote some PHP scripts that connected to our sharded development Mongo cluster.  On the /dev/null side, configuration of a cluster was pretty easy, as long as you start from a standard posix-style system.

After function-testing the PHP, we wrote a quick Apache Bench script to bake off the two systems.  The results speak for themselves:

Past the upper 90%, MongoDB is a classic hockey stick.  /dev/null starts fast and stays flat, out to the far horizon.  Love it.  I don’t know that we’re ready to switch right away: I’ll be a little uncomfortable while it’s still in beta, and I’ll have to get to the bottom of those unreliable read operations.  But this is looking *very* promising.

For now, I’m going to follow our internal process for new technologies like this, which is to email around a Wayfair Technical Finding (‘WTF’) to all the senior software engineers and architects, so we can put our heads together, evaluate further, and eventually make a plan to roll this out across all our data centers.

Hat tip gar1t on xtranormal.

Proposal to Control Wheelchairs with Google Glass

When our company’s co-founder encouraged all of our Engineering department to participate in the Google Glass Explorer contest, I thought about project ideas that could help people by using the unique features of this new augmented-reality technology. I remembered a project that some fellow students did during a robotics class that I took in graduate school. It used eye-tracking technology to remotely control the motors on a vehicle. After confirming that Google planned to embed eye-tracking technology in their new product, I realized this idea could work for applications such as wheelchairs.

My plan is to provide feedback about the wearer’s surroundings, including obstacles and suggested paths, and enable him or her to control the wheelchair with eye movements. The original student project used patterns of a user’s eyes being opened or closed to change between types of motion. For my project, I want to use subtle yet deliberate movements of the eye to let the user interact seamlessly with the surrounding environment. I think this technology could be life-changing for persons with disabilities. I hope that being able to work on this project with the support of the Google Glass Explorer program will help make it a reality.

I wrote up my idea, posted it on Google Plus with the #ifihadglass hashtag, and some fellow Wayfairians tweeted about it. Walter Frick of BostInno saw a tweet, did an interview with me, and then wrote an article about it. You can read the full story at these links:

BostInno: http://bit.ly/X7nD6b
Popular Science write-up: http://bit.ly/16unjVS

Tesla: Feel the Power

Data Warehousing at Wayfair

In 2009 Wayfair’s database infrastructure was based almost entirely on Microsoft SQL Server. Our Business Intelligence team was using a SQL Server data warehouse to prepare a large amount of data for import into Analysis Services (SSAS) each day. We populated our data warehouse using transaction log shipping from production servers, which required about 3 hours of downtime on the data warehouse at midnight each night to restore the previous day’s logs. Once that was done, a series of stored procedures were kicked off by jobs that would crunch through data from several different servers to produce a star schema that could be pulled into SSAS. Wayfair was scaling rapidly, and this approach started to become painfully slow, often taking 10-16 hours to crunch through the previous day’s data.

The BI team decided to look into other solutions for data warehousing, and ultimately purchased a Netezza appliance. Netezza is essentially a fork of PostgreSQL that takes a massively parallel cluster of nodes (24 in our case) and makes them look like one database server to the client. In our tests, Netezza could crunch through our data in roughly a quarter of the time, bringing 10-16 hours down to a much more reasonable 2-4 hours. The dream of updating our data warehouse multiple times each day was starting to look feasible. The feedback loop on business decisions would become dramatically shorter, enabling us to iterate more quickly and make well informed decisions at a much faster pace. There was just one glaring problem.

Great, But How Are We Going to Get Data Into It?

As soon as the DBA team heard that the Netezza purchase had been finalized, our first question was “great, but how are we going to get data into it?” The folks at Netezza didn’t have an answer for us, but they did send us an engineer to help devise a solution. As it turned out, the problem of how to incrementally replicate large amounts of data into a data warehouse was a common one, and there were surprisingly few open source solutions. Google it, and most people will tell you that they just reload all their data every day, or that they only have inserts so they can just load the new rows each day. “Great, but what if you want incremental replication throughout the day? What if you have updates or deletes? How do you deal with schema changes?” Crickets.

The First Solution

The solution we arrived upon was to use SQL Server Change Tracking to keep track of which rows had changes on each table, and we built a replication system around that. We created stored procedures for each table that contained the commands required to use the CHANGETABLE() function to generate change sets, dump those to flat files on a network share using bcp, pipe them through dos2unix to fix the line endings, and load them into netezza using the proprietary nzload command. Over the course of a few months we came up with an elaborate series of REPLACE() functions for text fields to escape delimiters, eliminate line breaks and clean up other data anomalies that had the potential to break the nzload. The whole process was driven by SSIS packages.

This solution worked, but it was a maintenance nightmare. We frequently had to edit stored procedures when adding new columns, and we had to edit the SSIS packages to add new tables. SSIS uses GUI based programming, and the editor for it (Business Intelligence Development Studio) is extremely slow and clunky, so even making simple changes was a painful process. Adding a new table into change tracking was a 14-step process that took over an hour of DBA time, and setting up a new database took roughly 28 steps and around two days of DBA time. We also had no solution for schema changes – we needed to manually apply them to the Netezza server, and if we forgot to do so the change tracking jobs would fail.

Release Early, Then Iterate Like Hell

Over the next few years, we iterated on this solution and added a number of useful features. We got rid of the stored procedures per table and switched to a single stored procedure that used dynamic SQL instead. We created a solution for automated schema changes based off of DDL triggers. We created a single stored procedure to handle adding new tables into change tracking, turning it into a one-step process. We added features to publish a subset of a table’s columns, because Netezza had a fixed row size limit that some of our tables exceeded. We added a feature to trim the length of text fields, because large blobs of text usually aren’t needed on the data warehouse and they slowed down the process. We added logging of performance and health metrics to statsD with alerts in Tattle. We added the ability to replicate changes from sharded master databases and consolidate them into one database on the slave. We added the ability to replicate to multiple SQL Server data warehouses in addition to Netezza. We had data on our masters that was moved into archive tables when certain criteria were met, so we added a feature to apply changes to a table and its archive in one transaction on the slave to eliminate the temporary appearance of duplicate data.

Not Good Enough

Ultimately, we were still unhappy with the solution. It was too heavily based on stored procedures, functions, configuration via database tables, xp_cmdshell, and worst of all - linked servers. It was still a nightmare to set up new databases, and when wanted to make changes we had to edit the same stored procedures in 20+ different places.  It was still single threaded. Worst of all, it was tightly coupled. If one slave server fell behind, the others suffered for it. It was also extremely specific to the use case of replicating data from SQL Server to either SQL Server or Netezza, and Wayfair was beginning to make much more use of open source databases like MySQL and MongoDB. In early 2012, we realized this solution wasn’t going to scale any further through iteration. We needed a redesign. We needed something fresh.

Enter Tesla

Redesigned from the ground up and inspired by the Tesla Replicator in The Prestige, Tesla was the solution to our data warehousing woes. We completely avoided stored procedures, functions, configuration tables, SSIS, dynamic SQL, xp_cmdshell and linked servers. Instead, we wrote Tesla in C# (primarily due to one incredibly useful .NET class for copying data between SQL servers) and moved all the logic into the application. Tesla is a single console application that takes care of everything we were doing with stored procedures and SSIS before. Its configuration is based on files rather than tables, which we can version control and deploy using our push tool. It’s multi-threaded and uses a configurable number of threads, allowing us to replicate the most important databases as quickly as possible. It’s completely decoupled, meaning that if one slave falls behind it doesn’t impact the others. It was also designed to be extensible to other data technologies, both as sources and destinations.

Tesla’s Design

Tesla is built into a few agents such as Master and Slave. These agents are run as scheduled tasks in the scheduler of your choice, and they each have their own configuration files. They are completely decoupled and can be run on separate servers and at separate times.

The design for Tesla was inspired by LinkedIn’s article about DataBus. Specifically, the idea of a master server publishing its change sets to a relay server and the slaves polling the relay for those changes was appealing to us. It meant less load on the masters, and it also meant we could store the change sets in such a way that if a slave fell behind it would be able to get consolidated deltas to more efficiently catch up. The biggest difference between Tesla and DataBus is that we focus on batch-based change sets, rather than streaming. Batches are captured on the master as one semi-consistent view of a database at a given point in a time, reducing the chance of orphaned or incomplete data on the data warehouse. It also makes the most sense for a technology like Netezza, which is terrible at small transactions and great at large batches.

Open Source

Tesla is fully open source and available on github. It currently supports SQL Server as a master, slave and relay server, and Netezza as a slave. It was designed with extensibility in mind, so we expect to add more technologies on both sides over time. We already have a slave adapter for Hive in the works. Feel free to hack away, add features, and submit pull requests!