Data Warehousing at Wayfair
In 2009 Wayfair’s database infrastructure was based almost entirely on Microsoft SQL Server. Our Business Intelligence team was using a SQL Server data warehouse to prepare a large amount of data for import into Analysis Services (SSAS) each day. We populated our data warehouse using transaction log shipping from production servers, which required about 3 hours of downtime on the data warehouse at midnight each night to restore the previous day’s logs. Once that was done, a series of stored procedures were kicked off by jobs that would crunch through data from several different servers to produce a star schema that could be pulled into SSAS. Wayfair was scaling rapidly, and this approach started to become painfully slow, often taking 10-16 hours to crunch through the previous day’s data.
The BI team decided to look into other solutions for data warehousing, and ultimately purchased a Netezza appliance. Netezza is essentially a fork of PostgreSQL that takes a massively parallel cluster of nodes (24 in our case) and makes them look like one database server to the client. In our tests, Netezza could crunch through our data in roughly a quarter of the time, bringing 10-16 hours down to a much more reasonable 2-4 hours. The dream of updating our data warehouse multiple times each day was starting to look feasible. The feedback loop on business decisions would become dramatically shorter, enabling us to iterate more quickly and make well informed decisions at a much faster pace. There was just one glaring problem.
Great, But How Are We Going to Get Data Into It?
As soon as the DBA team heard that the Netezza purchase had been finalized, our first question was “great, but how are we going to get data into it?” The folks at Netezza didn’t have an answer for us, but they did send us an engineer to help devise a solution. As it turned out, the problem of how to incrementally replicate large amounts of data into a data warehouse was a common one, and there were surprisingly few open source solutions. Google it, and most people will tell you that they just reload all their data every day, or that they only have inserts so they can just load the new rows each day. “Great, but what if you want incremental replication throughout the day? What if you have updates or deletes? How do you deal with schema changes?” Crickets.
The First Solution
The solution we arrived upon was to use SQL Server Change Tracking to keep track of which rows had changes on each table, and we built a replication system around that. We created stored procedures for each table that contained the commands required to use the CHANGETABLE() function to generate change sets, dump those to flat files on a network share using bcp, pipe them through dos2unix to fix the line endings, and load them into netezza using the proprietary nzload command. Over the course of a few months we came up with an elaborate series of REPLACE() functions for text fields to escape delimiters, eliminate line breaks and clean up other data anomalies that had the potential to break the nzload. The whole process was driven by SSIS packages.
This solution worked, but it was a maintenance nightmare. We frequently had to edit stored procedures when adding new columns, and we had to edit the SSIS packages to add new tables. SSIS uses GUI based programming, and the editor for it (Business Intelligence Development Studio) is extremely slow and clunky, so even making simple changes was a painful process. Adding a new table into change tracking was a 14-step process that took over an hour of DBA time, and setting up a new database took roughly 28 steps and around two days of DBA time. We also had no solution for schema changes – we needed to manually apply them to the Netezza server, and if we forgot to do so the change tracking jobs would fail.
Release Early, Then Iterate Like Hell
Over the next few years, we iterated on this solution and added a number of useful features. We got rid of the stored procedures per table and switched to a single stored procedure that used dynamic SQL instead. We created a solution for automated schema changes based off of DDL triggers. We created a single stored procedure to handle adding new tables into change tracking, turning it into a one-step process. We added features to publish a subset of a table’s columns, because Netezza had a fixed row size limit that some of our tables exceeded. We added a feature to trim the length of text fields, because large blobs of text usually aren’t needed on the data warehouse and they slowed down the process. We added logging of performance and health metrics to statsD with alerts in Tattle. We added the ability to replicate changes from sharded master databases and consolidate them into one database on the slave. We added the ability to replicate to multiple SQL Server data warehouses in addition to Netezza. We had data on our masters that was moved into archive tables when certain criteria were met, so we added a feature to apply changes to a table and its archive in one transaction on the slave to eliminate the temporary appearance of duplicate data.
Not Good Enough
Ultimately, we were still unhappy with the solution. It was too heavily based on stored procedures, functions, configuration via database tables, xp_cmdshell, and worst of all - linked servers. It was still a nightmare to set up new databases, and when wanted to make changes we had to edit the same stored procedures in 20+ different places. It was still single threaded. Worst of all, it was tightly coupled. If one slave server fell behind, the others suffered for it. It was also extremely specific to the use case of replicating data from SQL Server to either SQL Server or Netezza, and Wayfair was beginning to make much more use of open source databases like MySQL and MongoDB. In early 2012, we realized this solution wasn’t going to scale any further through iteration. We needed a redesign. We needed something fresh.
Redesigned from the ground up and inspired by the Tesla Replicator in The Prestige, Tesla was the solution to our data warehousing woes. We completely avoided stored procedures, functions, configuration tables, SSIS, dynamic SQL, xp_cmdshell and linked servers. Instead, we wrote Tesla in C# (primarily due to one incredibly useful .NET class for copying data between SQL servers) and moved all the logic into the application. Tesla is a single console application that takes care of everything we were doing with stored procedures and SSIS before. Its configuration is based on files rather than tables, which we can version control and deploy using our push tool. It’s multi-threaded and uses a configurable number of threads, allowing us to replicate the most important databases as quickly as possible. It’s completely decoupled, meaning that if one slave falls behind it doesn’t impact the others. It was also designed to be extensible to other data technologies, both as sources and destinations.
Tesla is built into a few agents such as Master and Slave. These agents are run as scheduled tasks in the scheduler of your choice, and they each have their own configuration files. They are completely decoupled and can be run on separate servers and at separate times.
The design for Tesla was inspired by LinkedIn’s article about DataBus. Specifically, the idea of a master server publishing its change sets to a relay server and the slaves polling the relay for those changes was appealing to us. It meant less load on the masters, and it also meant we could store the change sets in such a way that if a slave fell behind it would be able to get consolidated deltas to more efficiently catch up. The biggest difference between Tesla and DataBus is that we focus on batch-based change sets, rather than streaming. Batches are captured on the master as one semi-consistent view of a database at a given point in a time, reducing the chance of orphaned or incomplete data on the data warehouse. It also makes the most sense for a technology like Netezza, which is terrible at small transactions and great at large batches.
Tesla is fully open source and available on github. It currently supports SQL Server as a master, slave and relay server, and Netezza as a slave. It was designed with extensibility in mind, so we expect to add more technologies on both sides over time. We already have a slave adapter for Hive in the works. Feel free to hack away, add features, and submit pull requests!