Solution to incast problem using switch, transparent to the end hosts.

Data center switch usually requires extremely low delay and extremely high capacity. This is a stringent requirement. TCP incast is caused by fast and bursty transmissions that overfilled the switch buffer, causing intense packet loss and TCP timeout.

While TCP incast usually is about synchronized short flows, this paper also described the starvation of long flows. The simulation they did is having 400 long TCP NewReno flows with RTT=100us sending concurrently. This is a goodput collapse because the long flows starved for tens of seconds while the total switch throughput stays high. This is also caused by TCP timeout, and the timeout cause unfairness among flows, as they have high variation in delay.

The algorithm proposed, Hashed Credits Fair (HCF) in as follows:

The switch holds two queues, the high-priority (HP) and low-priority (LP) queues, each of size $B/2$, making up the total buffer size of $B$. The HP queue holds packets of flows that have recently been under-served; and the LP queue are for all other packets. The switch also has an array of $K$ counters, one counter for one “bin”.

When a flow arrives to the switch, the flow is hashed into one of the K bins. The hash function is updated regularly to avoid persistent hash collision between flows. The hash function is changed for each priority period. A priority period is defined as the time from the HP queue just emptied until the HP queue contains some packet and emptied again.

Initially, at the beginning of a priority period, besides picking a new hash function $f$, the $K$ counters are initialized with $c(k)=c_0$ credits. Then upon the arrival of packet $p$, the bin correspond to the packet is computed by $k=f(p)$. Then if the bin has credits, i.e. $c(k)>0$, and the HP queue is not full, packet $p$ joins the HP queue and the credit in $c(k)$ is deduced. Otherwise, the packet join LP queue or dropped if it is full. Packets departure is always in HP queue; unless HP queue is empty, then the head of line packet in LP queue is dispatched.

To guarantee in-order transmission of the packet in the same flow, upon transiting to a new priority period, the LP queue and HP queue are exchanged, so that the packet arrived in the previous priority period is always transmitted first. The HCF algorithm has a complexity of $O(1)$

This paper found that, with HCF, the incast scenario is mitigated. It has a higher goodput then FIFO switches during incast. [But why the goodput decreases as the number of servers increases but the goodput increases again when we have even more servers?] The paper also found that, HCF achieved a better fairness in terms of more even throughput distribution between concurrent flows.

Further readings:

  • R. Morris, “TCP Behaviour with many flows,” ICNP’97, Oct. 1997
  • R. Morris, “Scalable TCP Congestion Control”, INFOCOM’00, Mar. 2000
  • D. Stiliadis and A. Varma, “Latency-rate servers: A General Model for Analysis of Traffic Scheduling Algorithms”, IEEE/ACM Trans Netw, Oct. 1997
  • M. Shreedhar and G. Varghese, “Efficient Fair Queeuing using Deficit Round-Robin”, IEEE/ACM Trans Netw, 4(3):375–385, Jun. 1996
  • P. E. McKenney, “Stochastic Fairness Queueing”, INFOCOM’90, Vol.2, pp.733-740, Jun. 1990
  • C. Estan and G. Varghese, “New Directions in Traffic Measurement and Accounting”, ACM SIGCOMM’02, Vol.32 No.4 pp.323-336, 2002

Bibliographic data

   author = "Alexander Shpiner and Isaac Keslassy",
   title = "A Switch-Based Approach to Throughput Collapse and Starvation in Data Centers",
   booktitle = "Proc. IWQoS",
   year = "2010",