Wayfair Tech Blog

Dynamic WAN Optimizer Routing

Wayfair recently deployed Steelhead WAN optimizers in our network.  We were unsatisfied with the 3 (2.5?) main suggested methods of deployment from Riverbed.  So we designed our own deployment.  But first, the vendor suggested methods:

Physical In-Path – Physically puts the device in-line with your traffic, easy deployment, but if you ever need to remove it you will lose connectivity (fail-to-wire still needs wire’s attached to work!).  You also lose the ability to dynamically fully remove the device from the traffic path for troubleshooting purposes.

Virtual In-Path – Logically directs traffic through the steelhead appliance, but depends on redirection mechanisms like WCCP, PBR, or Layer-4 switching/routing.  WCCP is proprietary, so right off the bat that’s out the window.  PBR is incredibly powerful, but not dynamic.  You can make it “dynamic” by utilizing “verify-availability,“ but this feature is not widely supported, particularly in the hardware we run in our branch offices, where this will be deployed. So that option is out as well.

Out-of-Path – This option is only supported for server-side Steelheads (in the datacenter) and requires the remote Steelheads to be in one of the aforementioned, already disqualified, configurations.

So with the predefined options not meeting the grade we set out to find another way.  Our goal was to have an easy physical deployment, with the flexibility to manually or dynamically completely remove the Steelhead from the traffic path, and to do this with the same deployment model company wide.  We ended up designed our own “Dynamic-Physical-In-Path” deployment (DPIP?).  We derived it off of the classic, albeit ugly, way of moving data between two distinct VRFs with a physical cable, we just extended that cable to include the Steelhead as well.  This allowed us to keep this design contained within the switch and the riverbed, which kept with our goal of a nice, compact and relatively simple deployment.

In our remote locations the hardware we’re running this off of is 3560’s with IOS 15.0(1)SE3 and the feature set “IP Services.”  In our datacenter we’re running this off of 4500X’s with IOS 3.3.0SG and the “Enterprise Services” license.  I normally wouldn’t bring up this level of minutia, but we ran into a number of issues that were IOS and model line specific.

3560 CSCtr94182 – ARP was not working correctly between two interfaces on the same switch, even if they were in separate VRFs, in 12.2(58)SE2 (but DID work in earlier versions!) to get around this we upgraded to IOS 15.0(1).  Version 15.0(1) was not without faults, as it has an issue with overlapping IP space between the implicit, default, “global” table and our VRF, so we had to explicitly create another VRF that takes the place of the default table.

4500X CSCue71580 – The 4500X reverts it’s unique interface MAC address to the systems Burned In MAC Address (BIA) whenever that interface switches from L2 mode to L3 mode.  The Steelhead does its WAN-optimization-magic in a blend of L2 transparent and L3 traffic originated at the device and in our situation the Steelhead resides on a link between VRFs on the same device, and sees the same BIA MAC address for both interfaces.  The L2 traffic flows right through without an issue but the traffic sourced from the IP address of the Steelhead has a strange condition.  It has the correct L3 addresses and routes for both sides of the link, but the MAC addresses for both IPs are the same; meaning that when the Steelhead passes this traffic down to the NIC, it doesn’t know the correct physical interface to use and just “picks” one at random.  By default Cisco gear will send ICMP redirects out to correct a situation where you’re talking to the wrong router on a segment, but since this redirect “updates” the Steelhead with the same MAC address it already knew of, it has no effect on traffic aside from dropping traffic while the redirect is happening.  To “fix” this we disabled ICMP Redirects on the interfaces.  Now instead of sending an ICMP Redirect to the Steelhead, the switch just sends that traffic right back out with the correct next-hop; even if that traffic is going back out the interface it was received on.  When this traffic re-traverses the Steelhead it is in L2 form, and is just pushed through to the other side without issue.  This “fix” results in a tangible amount of packet-ricochet, but that is still preferable to packet loss.

 

Now with all of that out of the way, let’s get to the actual config!

We start with a normal 3560-8PC, operating in SDM “desktop routing” mode, and used the first 4 ports for this configuration.  This portion of the config is just for getting traffic through the Steelhead, or around it, you’ll obviously need interfaces from your hosts on the “global” VRF, and an interface to your next hop in the “rb” VRF to have a fully functional data path.

Step 1: Create your VRFs

ip vrf global description global vrf for normal trafficip vrf rb description vrf for wan traffic going through the steelhead

Step 2: Create your fallback data path

interface FastEthernet0/1 description Steelhead WAN no switchport ip vrf forwarding rb ip address 10.1.1.6 255.255.255.248 no cdp enable no ip redirects spanning-tree portfast spanning-tree bpdufilter enable!interface FastEthernet0/2 description Steelhead LAN no switchport ip vrf forwarding global ip address 10.1.1.1 255.255.255.248 no cdp enable no ip redirects spanning-tree portfast spanning-tree bpdufilter enable

Step 3: Create the Steelhead path

interface FastEthernet0/3 description vrf-loop (rb) no switchport ip vrf forwarding rb ip address 10.2.2.2 255.255.255.252 no cdp enable no ip redirects spanning-tree portfast spanning-tree bpdufilter enable!interface FastEthernet0/4 description vrf-loop (global) no switchport ip vrf forwarding global ip address 10.2.2.1 255.255.255.252 no cdp enable no ip redirects spanning-tree portfast spanning-tree bpdufilter enable

Step 4: Create the SLAs and track elements

ip sla 10 icmp-echo 10.1.1.6 source-ip 10.1.1.1 vrf global threshold 500 timeout 1000 frequency 1ip sla schedule 10 life forever start-time nowip sla 20 icmp-echo 10.2.2.1 source-ip 10.2.2.2 vrf rb threshold 500 timeout 1000 frequency 1ip sla schedule 20 life forever start-time now!track 10 ip sla 10 reachabilitytrack 20 ip sla 20 reachability

Step 5: Add the routing that pulls it all together

! routes to whatever your host networks areip route vrf rb 10.100.100.0 255.255.255.0 10.1.1.1 10 track 10ip route vrf rb 10.100.100.0 255.255.255.0 10.2.2.2.1 250 track 20! whatever your next hop isip route vrf rb 0.0.0.0 0.0.0.0 10.3.3.1 10!ip route vrf global 0.0.0.0 0.0.0.0 10.1.1.6 10 track 10ip route vrf global 0.0.0.0 0.0.0.0 10.2.2.2 250 track 20
Share