[Bloat] Jumbo frames and LAN buffers (was: RE: Burst Loss)

Sun May 15 20:31:41 EDT 2011

On 15 May, 2011, at 11:49 pm, Fred Baker wrote:

> 
> On May 15, 2011, at 11:28 AM, Jonathan Morton wrote:
>> The fundamental thing is that the sender must be able to know when sent frames can be flushed from the buffer because they don't need to be retransmitted.  So if there's a NACK, there must also be an ACK - at which point the ACK serves the purpose of the NACK, as it does in TCP.  The only alternative is a wall-time TTL, which is doable on single hops but requires careful design.
> 
> To a point. NORM holds a frame for possible retransmission for a stated period of time, and if retransmission isn't requested in that interval forgets it. So the ack isn't actually necessary; what is necessary is that the retention interval be long enough that a nack has a high probability of succeeding in getting the message through.

Okay, so because it can fall back to TCP's retransmit, the retention requirements can be relaxed.

>> ...recent versions of Ethernet *do* support a throttling feedback mechanism, and this can and should be exploited to tell the edge host or router that ECN *might* be needed.  Also, with throttling feedback throughout the LAN, the Ethernet can for practical purposes be treated as almost-reliable.  This is *better* in terms of packet loss than ARQ or NACK, although if the Ethernet's buffers are large, it will still increase delay.  (With small buffers, it will just decrease throughput to the capacity, which is fine.)
> 
> It increases the delay anyway. It just pushes the retention buffer to another place. What do you think the packet is doing during the "don't transmit" interval?

Most packets delayed by Ethernet throttling would, with small buffers, end up waiting in the sending host (or router).  They thus spend more time in a potentially active queue instead of in a dumb one.  But even if the host queue is dumb, the overall delay is no worse than with the larger Ethernet buffers.

> Throughput never exceeds capacity. If I have a 10 GBPS link, I will never get more than 10 GBPS through it. Buffer fill rate is statistically predictable. With small buffers, the fill rate acheives the top sooner. They increase the probability that the buffers are full, which is to say the drop probability. Which puts us to an end to end retransmission, which is the worst case of what you were worried about.

Let's suppose someone has generously provisioned an office with GigE throughout, using a two-level hierarchy of switches.  Some dumb schmuck then schedules every single computer to run it's backups (to a single fileserver) at the same time.  That's say 100 computers all competing for one GigE link to the fileserver.  If the switches are fair, each computer should get 10Mbps - that's the capacity.

With throttling, each computer sees the link closed 99% of the time.  It can send at link rate for the remaining 1% of the time.  On medium timescales, that looks like a 10Mbps bottleneck at the first link.  So the throughput on that link equals the capacity, and hopefully the goodput is also thus.  The only queue that is likely to overflow is the one on the sending computer, and one would hope there is enough feedback in a host's own TCP/IP stack to prevent that.

Without throttling but with ARQ, NACK or whatever you want to call it, the host has no signal to tell it to slow down - so the throughput on the edge link is more than 10Mbps (but the goodput will be less).  The buffer in the outer switch fills up - no matter how big or small it is - and starts dropping packets.  The switch then won't ask for retransmission of packets it's just dropped, because it has nowhere to put them.  The same process then repeats at the inner switch.  Finally, the server sees the missing packets, and asks for the retransmission - but these requests have to be switched all the way back to the clients, because the missing packets aren't in the switches' buffers.  It's therefore no better than a TCP SACK retransmission.

So there you have a classic congested network scenario in which throttling solves the problem, but link-level retransmission can't.

Where ARQ and/or NACK come in handy is where the link itself is unreliable, such as on WLANs (hence the use in amateur radio) and last-mile links.  In that case, the reason for the packet loss is not a full receive buffer, so asking for a retransmission is not inherently self-defeating.

> I'm not going to argue against letting retransmission go end to end; it's an endless debate. I'll simply note that several link layers, including but not limited to those you mention, find that applications using them work better if there is a high high probability of retransmission in an interval on the order of the link RTT as opposed to the end to end RTT. You brought up data centers (aka variable delays in LAN networks); those have been heavily the province of fiberchannel, which is a link layer protocol with retransmission. Think about it.

What I'd like to see is a complete absence of need for retransmission on a properly built wired network.  Obviously the capability still needs to be there to cope with the parts that aren't properly built or aren't wired, but TCP can do that. Throttling (in the form of Ethernet PAUSE) is simply the third possible method of signalling congestion in the network, alongside delay and loss - and it happens to be quite widely deployed already.

 - Jonathan