This paper describes SPAIN, a method to build data center networks (load balanced networks) from commodity ethernet switches and specialized hosts.

The hosts are required with no special hardware but a specialized driver so that all the packets sending out are in a VLAN and all the hosts are capable of doing some kind of signalling to other hosts. The idea is, for all hosts to any other hosts, we find \(k\) paths, preferably disjoin but not required, and combine them into a small number of VLAN spanning trees. Then the switches are configured with those VLANs with hand-coded forwarding tables. The host is then send packets with a per-flow random VLAN so that the traffic can be balanced. The merit of this paper is in the algorithm to combine paths into VLANs.

Summary

Problem of ethernet:

  • Spanning tree
    • STP makes a tree, which means inefficient use of cross-sectional bandwidth
    • STP makes network susceptible to failure near root
    • Making STP network of high bandwidth need ultra-fast cores, which is expensive
  • Packet floods
    • Ethernet learns and relearns the location of MAC
    • If packet’s address is not known at the moment, packets are flood to all ports
  • Host broadcasts
    • Protocols such as ARP, DHCP relies on bcast

SPAIN: Use VLAN to select different trees, so that total comm. bandwidth can be larger than a single VLAN

VLAN building:

  • Find k paths from node s to node t, disjoint paths preferred, but not required
  • Among all the paths, build loop-free aggregates (VLAN trees)
    • Each path belongs to a unique tree
    • Number of trees are minimized
  • Optimal solution is NP-hard, so heuristic algorithm is used

Performance

  • Compare different topologies: FatTree, BCube, 2D HyperX, CiscoDC
  • The algorithm works good: Number of VLANs is close to optimal
  • The algorithm is doing stochastic VLAN packing, the time required is reasonable

Fault tolerance

  • Either by Per-VLAN Spanning Tree (Cisco) / Multiple Spanning Tree (IEEE 802.1s)
  • Or by end-host failure mechanisms
  • End-host failure mechanisms can do faster repair

FIB pinning: Directly program the FIB tables on the switches

  • Requires central knowledge of MAC locations
  • Program the VLAN map and FIB tables on all switches
  • Disadvantage: May produce a larger FIB table
    • If FIB is learning-based, unused host are not recorded
    • FIB pinning will always have an entry for all possible destinations
  • Typical switch FIB table: 16K entries on SRAM, 128K entries on DRAM

End host algorithm

  • Five goals:
    • Spread load
    • Minimize overhead of bcast and flooding
    • Detect and react to failures
    • Facilitate mobility (e.g. VM migration)
    • Enable incremental deployment
  • Send packet:
    • Select an usable VLAN and send (randomly select)
    • If no candidate VLAN, send on default VLAN (VLAN 1)
    • Probe on all candidate but not usable VLANs
      • Send unicast chirp message to the destination on a VLAN
      • Rx of chirp signals VLAN usable at rx side
      • Respond of chirp may be requested
  • Re-pinning: Change a flow’s selected VLAN
    • Only in case of
      • Failure (immediate re-pin)
      • VM migration (immediate re-pin)
      • Improve load balance
      • Probe for revived VLANs
    • Re-pin algorithm for a flow
      • This host moved
      • My destination host moved
      • This flow is new
      • Non-TCP flow hasn’t re-pinned for too long
      • TCP flow becomes too slow (cwnd < threshold) due to rexmit
  • Receive packet
    • If chirp packet, do respond as requested on unicast
    • Any incoming packet is a proof of the health of the source on this VLAN
    • If chirp hasn’t been sent, send one to the source to signal for healthy VLAN
  • End hosts keeps the following information
    • VLAN in use for a dest switch (addr)
    • VLAN usable for a dest switch (addr)
    • These info are cleared periodically
  • Failure detection: By no packet received in VLAN
    • Stop using that VLAN for that destination

Bibliographic data

@techreport{
   title = "SPAIN: Design and Algorithms for Constructing Large Data-Center Ethernets from Commodity Switches",
   author = "Jayaram Mudigonda and Praveen Yalagandula and Mohammad Al-Fares and Jeffrey C. Mogul",
   howpublished = "HP TechRep",
   institution = "HP",
   number = "2009-241",
   year = "2009",
}