Wayfair Code Deployment, Part 3

If you haven’t already read the overview of our deployment system or the architecture of our deployment server you might want to check those posts out first.

In this article I will discuss how we deploy code in a unique way that gives us all the flexibility of the symlink method without requiring a symlink.

Part of the decision on how to setup our deployment system was determined by the need to roll forward and roll back our deployments as quickly as possible. We also needed a way to do this atomically, so that no errors occurred when the code was physically being copied to the webservers.

We initially tried the normal symlink method, which we thought would meet the two requirements above. Unfortunately we ran into problems with this approach.

Those of you who have tried the symlink deployment method with PHP and APC stat caching enabled will know that even clearing the cache will lead to mixed file versions in cache while under heavy load due to a combination of filesystem and APC caching. A fairly common solution to this is reset PHP to clear APC's cache and start fresh. With Apache the graceful restart is fairly unobtrusive, but if you are relying on APC for cached data as well as an opcode cache, then you purge both of those caches when you restart Apache (unless you are using Apache with FastCGI).

In the FastCGI context the webserver and the FastCGI server are two separate, independently running processes. It allows you to have multiple PHP FastCGI processes local to the webserver and remote for distributing processing. This allows for more flexibility from a backend processing / load balancing perspective, but doesn't directly help the APC caching issue. It just contains the problem within the PHP process instead of the webserver.

While researching the problem we stumbled upon a blog post about a similar situation and solution when running NGINX with PHP. The 10 second summary of their solution is that they automated the changing of their NGINX conf for each deploy to alter the fastcgi_param SCRIPT_FILENAME so that it pointed directly at the new code base instead of the old code base. A side benefit of this is that it allows you to avoid the need for a symlink altogether.

Unfortunately, it still requires a restart of the webserver and a scripted change to the running config. Lighttpd does not have the same graceful restart functionality that NGINX or Apache have. At the rate at which we deploy code, we didn't want to be doing a graceful restart on all of our servers each time, let alone a hard restart. The benefit of this option though is that it didn't require restarting PHP but did fix (avoided) the APC caching issues. So our next question was how do we implement the same idea using the webserver to set a PHP fastcgi parameter but without restarting PHP or Lighttpd??

Lighttpd has a very powerful module called mod_magnet which allows the inclusion of LUA based scripts in line with the configuration (similar to NGINX’s embedded perl module). This lets you add a lot more smarts during the request execution, prior to it ever hitting the filesystem or PHP code.

Normally Lighttpd passes the file path to PHP via the standard SCRIPT_FILENAME and PHP executes the code in the standard path. With LUA, we are able to modify the file path variable during the request and tell PHP to serve the content from a particular location instead of the default. So now we can change the location via LUA, and we're done right?

Caching strikes again! The LUA is cached in Lighttpd and doesn't get reloaded instantly when altering the file. Since the goal was to not need to restart Lighttpd, that put us right back to the drawing board for a few minutes.

Luckily with LUA there are features like global variables and I/O functions built in. Using these functions we are able to open a file, check the currently deployed version and serve the PHP content from that path. Example LUA function snippet:

function get_deploy_version()
  inf = io.open("deployed_version.txt", "r")
  if (nil == inf) then
    return 'data'
  else
    line = inf:read("*line")
    return line
  end
end

This function grabs the value out of the deployed_version.txt and returns it. If the file doesn't exist or the contents are empty, we default to the "data" folder. We do keep the symlink around for ease of administration. It is always updated during deploys, so we can fall back to it in a pinch.

So now we have a way to dynamically read from a text file and almost instantly switch codebases ( wait ALMOST instantly?). In order to minimize the number of requests to this file on a busy server, we cache the version in the LUA global variables and set a TTL on that variable. This adds a 10-second delay but reduces contention on the file. Below is a code snippet showing an example of what this might look like in LUA:

current_time = os.time() if (nil == _G['deployed_version']) then -- debug("No deployed version set",1) _G['timer'] = nil end if (nil == _G['timer']) then _G['timer'] = current_time _G['deployed_version'] = get_deploy_version() else expired_time = _G['timer'] + 10 if (expired_time > current_time) then -- debug("Cached",1) else _G['timer'] = current_time _G['deployed_version'] = get_deploy_version() -- debug("Updated",1) end end -- Modify Path to include deployed_version -- /usr/local/www/<version_number>/<deployed_application_name> lighty.env["physical.doc-root"] = "/usr/local/www/" .. _G['deployed_version'] .. "/application/" lighty.env["physical.path"] = lighty.env["physical.doc-root"] .. lighty.env["physical.rel-path"]

All of these changes put together allow us to deploy our code via a folder switch/symlink model, while avoiding the apc.stat caching issues and without restarting Lighttpd or PHP. For us this has been a huge win all around and a great way to allow us to deploy as frequently as needed without fear of the deployment not being atomic due to caching issues or fear of blowing away our APC user cache.

The actual client side code is polling the deployment server every minute looking for a new version. When it finds a new version, it downloads the content, untars the files into a folder based on the SVN revision number and updates deployed_version.txt to contain the newly deployed svn number for Lighttpd to read.

The client code runs on every webserver, polling the deployment server for changes, and with the current 1-minute schedule, we are able to deploy code on average in under 90 seconds. We are pretty excited to be well under our 2-minute SLA, but in the true Wayfair spirit of never settling for good enough, we are looking at ways to improve on this still.

The low hanging fruit is our cron job which currently adds up to 60 seconds onto the deployment. We could hack that and have it run every minute but poll every 30 seconds or even 5 seconds via a quick shell script, but that starts to add a lot of overhead and requests to the deployment server. We are currently looking to convert our deployment code from a cron job every minute to an "always-on" AMQP subscriber. This will allow us to have the deployment server publish to a queue and get automagically delivered semi-instantly to all of our servers bringing down deployment time even further.