Why not give Code Deploy Clients access to the repository?

We’ve received a few online, and in person questions like this, so i figured it was probably worth explaining in a little more detail.

On the Deployment server, we have a variety of applications that we deploy. From Windows .Net Services, Python, Classic ASP, CSS/JS and PHP to name a few.

We chose to standardize the interface to the Deployment server to make creating new code deployment clients simpler. Our Deployment server is essentially an on demand package creation and deployment system. Continue reading

Lessons from a datacenter move

Last winter we were discussing all of our upcoming projects, and what they would require for new hardware in the datacenter.  Then we took a look at the space we had in our cage space at our main datacenter.  Turns out, we didn’t have enough space, and the facility wouldn’t give us any more power in the current footprint we had.  There was also no room to expand our cage.  We had two basic options, one would have been to add additional cage space either in the same building, or even another facility and rely on cross connects or WAN connections.  We weren’t wild about this approach because we knew it would come back to bite us later as we continuously fought with the concept, and had to decide which systems should be in which space.  The other option was to move entirely into a bigger footprint.  We opted to stay in the same facility, which made moving significantly easier, and moved to a space that is 70% larger then our old space, giving us lots of room as we grow.  Another major driver in the decision to move entirely was that it afforded us the opportunity to completely redo our network infrastructure from the ground up to have a much more modular setup and finally using 10Gb everywhere in our core and aggregation layers.

Some stats on the move:

  • Data migrated for NAS and SAN block storage: 161 TB
  • Network cables plugged in: 798
  • Physical servers moved or newly installed: 99 rack mount and 50 blades
  • Physical servers decommissioned to save power and simplify our environment: 49
  • VMs newly stood up or migrated: 619

It’s worth noting that the physical moves were done over the course of 2 months.  Why so long?  Unlike many companies that can have a weekend to bring things down, we aren’t afforded that luxury.  We have customer service working in our offices 7 days a week both in the US as well as Europe, and we have our website to think about, which never closes.  In fact, we were able to pull this off with only a single 4-hour outage to our storefront, and several very small outages to our internal and backend systems during weeknights throughout the project.

Lessons Learned:

No matter how good your documentation is, it’s probably not good enough.  Most folks documentation concentrates on break/fix and general architecture of a system, what’s installed, how it’s configured, etc.  Since we drastically changed our network infrastructure, we had to re-ip every server when it was moved.  We had to go through and come up with procedures for what else needed to happen when a machine suddenly had a new IP address.  We use DNS for some things, but not everything, so we had to ensure that inter-related systems were also updated when we moved things.

Get business leads involved in the timeline.  This sounds funny, but one of the biggest metrics in measuring the success of a project like this is the perception of the users.  Since a good percentage of the systems moved had certain business units as the main “customers”, we worked with leaders from these business units to ensure we understood  their use of the systems, what days or times of day were they using it the most, or if they had any concerns over off-hours operations during different times of the week.  Once we had this info from many different groups, we sat down in a big room with all the engineers responsible for these systems, and came up with a calendar for the move, then got final approval for dates from the business leads.  This was probably the smarted thing we did, and went a long way in helping our “customer satisfaction”.

Another thing we learned early on was to divide the work of the physical moving of equipment and the work done by the subject matter experts to make system changes and ensure things are working properly after the physical move.  This freed the subject matter expert to get right to work, and not have to worry about other, non-related systems that were also being moved in the same maintenance window.  How did we pull this off?  Again, include everyone.  We have a large Infrastructure Engineering team, 73 people as of this writing.  We got everyone involved, from our frontline and IT Support groups, all the way up to directors; even Steve Conine, one of our co-founders did an overnight stint at the datacenter helping with the physical move of servers.  It was an amazing team effort, and we would never have had such a smooth transition if everyone didn’t step up in a big way.

I hope these little tidbits are helpful to anyone taking on such a monumental task as moving an entire data center.  As always, thanks for reading.

Rest@Wayfair.com

Some Background

At Wayfair, we are working on a next generation of systems to power our business. The decade old  systems that currently keep us running in stride have allowed Wayfair.com to vault from nothing to where it is today. But as with all systems, they have started to show their age. Continue reading

Better Lucene/Solr searches with a boost from an external naive Bayes classifier

Me: Doug, what are you doing?

Doug: Solving the problem of class struggle with one of Greg‘s classifiers.

Me:  Karl Marx should call his office.  What do you mean by that?

Doug: Let me explain… Continue reading

Better three-word searches with SOLR

We use the Apache SOLR search platform behind the scenes at Wayfair.  Sometimes, when vanilla SOLR doesn’t quite do what we want, we improve it for our purposes. When we suspect that others might have the same purposes, and we think that we have solved our problems in a generally useful way, we contribute our solutions back to the open source community, either on github, or through a more project-specific distribution channel.  SOLR is an Apache project, so for SOLR, this means attaching a patch to a ‘Jira’.  This blog post is about SOLR Jira 1093. Continue reading

Information Week Interviews Wayfair on its use of Markov Clustering

These days, in the big data community, we often hear how biologists have adopted and are using distributed computing technologies that were first introduced to solve problems in software engineering. The fact that Wayfair has done the inverse and used a tool initially developed to help biologists cluster similar proteins together to solve a problem in e-commerce, piqued the curiosity of Information Week magazine, who asked us for an interview about our February blog post on using Markov clustering for generating recommendations http://engineering.wayfair.com/recommendations-with-markov-clustering/. Read the interview here http://www.informationweek.com/big-data/news/big-data-analytics/240007850/online-retailer-uses-dna-research-to-connect-with-customers

 

Northeast PHP Recap

Last weekend was the inaugural run of the Northeast PHP Conference in Boston.  Wayfair was a gold sponsor, so we bought t-shirts, paid for apps and beer at the Saturday night event, and also sent about 15 engineers to the event.  I gave a talk on High Performance PHP, and we had a blast. Check out the slides from my talk. The feedback was great, and we look forward to sponsoring the conference again next year!

You can also take a look at some of the other talks that we really enjoyed:

Thanks to Michael Bourque and the other organizers for putting on a great event!

Measuring CDN Performance Benefits with Real Users

A couple of weeks ago I ran a test with WebPagetest that was designed to quantify how much a CDN improves performance for users that are far from your origin.  Unfortunately, the test indicated that there was no material performance benefit to having a CDN in place.  This conclusion sparked a lively discussion in the comments and on Google+, with the overwhelming suggestion being that Real User Monitoring data was necessary to draw a firm conclusion about the impact of CDNs on performance.  To gather this data I turned to the Insight product and its “tagging” feature.

Before I get into the nitty-gritty details I’ll give you the punch line: the test with real users confirmed the results from the synthetic one, showing no major performance improvement due to the use of a CDN. Continue reading

Webops for Python, part 2: the how-to

In part 1 of this 2-part series we used a comic strip to depict Python programmers and web operations folk working together to figure out how to deploy some scientific computing to an e-commerce site.  Joking aside, let’s describe exactly what were were trying to accomplish, and how we did it. Continue reading

WebOps for Python, part 1: the comic strip

Python is my favorite computer language for data science, but it is a poorly standardized beast when it comes to packaging, deployment, web operations, etc.  There are plenty of people who are deploying Python code to the web effectively, but especially in the data science area, there is no equivalent of the LAMP stack that you can just plug in and start coding against.  We have a way, among other possible ways, of solving these problems, that we think people might find useful, and I am going to describe our methods in a couple of blog posts.  The first one will tell the story as a comic strip.  The next one will have the code and instructions. Continue reading