[Bloat] Burst Loss

Wed May 11 04:53:16 EDT 2011

> Within the context of a given "priority" at least, NICs are
> setup/designed to do things in order.  I too cannot claim to be a NIC
> designer, but suspect it would be a non-trivial, if straight-forward
> exercise to get a NIC to cycle through multiple GSO/TSO sends.  Yes,
> they could probably (ab)use any prioritization support they have.
>
> NICs and drivers are accustomed to "in order" processing - grab packet,
> send packet, update status, lather, rinse, repeat (modulo some
> pre-fetching).  Those rings aren't really amenable to "out of order"
> completion notifications, so the NIC would have to still do "in order"
> retirement of packets or the driver model will loose simplicity.
>
> As for the issue below, even if the NIC(s) upstream did interleave
> between two GSO'd sends, you are simply trading back-to-back frames of a
> single flow for back-to-back frames of different flows.  And if there is
> only the one flow upstream of this bottleneck, whether GSO is on or not
> probably won't make a huge difference in the timing - only how much CPU
> is burned on the source host.

Well, the transmit descriptors (header + pointer to the data to be 
segmented) is in the hand of the hw driver...
The hw driver could at least check if the current list of transmit 
descriptors is for different tcp sessions
(or interspaced non-tcp traffic), and could interleave these descriptors 
(reorder them, before they are processed
by hardware - while obviously maintaining relative ordering between the 
descriptors belonging to the same flow.

Also, I think this feature could be utilized for pacing to some extent - 
interspace the (valid) traffic descriptors
with descriptors that will cause "invalid" packets to be sent (ie. dst mac 
== src max; should be dropped by the first switch). It's been well known 
that properly paced traffic is much more resilient than traffic being sent 
in short bursts of wirespeed trains of packets. (TSO defeats the 
self-clocking of TCP with ACKs).

Just a thought...

Richard

----- Original Message ----- 
From: "Rick Jones" <rick.jones2 at hp.com>
To: "Richard Scheffenegger" <rscheff at gmx.at>
Cc: "Neil Davies" <Neil.Davies at pnsol.com>; "Stephen Hemminger" 
<shemminger at vyatta.com>; <bloat at lists.bufferbloat.net>
Sent: Monday, May 09, 2011 8:06 PM
Subject: Re: [Bloat] Burst Loss

> On Sun, 2011-05-08 at 14:42 +0200, Richard Scheffenegger wrote:
>> I'm not an expert in TSO / GSO, and NIC driver design, but what I 
>> gathered
>> is, that with these schemes, and mordern NICs that do scatter/gather DMA 
>> of
>> dotzends of "independent" header/data chuncks directly from memory, the 
>> NIC
>> will typically send out non-interleaved trains of segments all belonging 
>> to
>> single TCP sessions. With the implicit assumption, that these burst of up 
>> to
>> 180 segments (Intel supports 256kB data per chain) can be absorped by the
>> buffer at the bottleneck and spread out in time there...
>>
>> From my perspective, having such GSO / TSO to "cycle" through all the
>> different chains belonging to different sessions (to not introduce
>> reordering at the sender even), should already help pace the segments per
>> session somewhat; a slightly more sophisticated DMA engine could check 
>> each
>> of the chains for how much data is to be sent by those, and then clock an
>> appropriate number of interleaved segmets out... I do understand that 
>> this
>> is "work" for a HW DMA engine and slows down GSO software 
>> implementations,
>> but may severly reduce the instantaneous rate of a single session, and
>> thereby the impact of burst loss to to momenary buffer overload...
>>
>> (Let me know if I should draw a picture of the way I understand TSO / HW 
>> DMA
>> is currently working, and where it could be improved upon):
>
> GSO/TSO can be thought of as a symptom of standards bodies (eg the IEEE)
> refusing to standardize an increase in frame sizes.  Put another way,
> they are a "poor man's jumbo frames."
>
> Within the context of a given "priority" at least, NICs are
> setup/designed to do things in order.  I too cannot claim to be a NIC
> designer, but suspect it would be a non-trivial, if straight-forward
> exercise to get a NIC to cycle through multiple GSO/TSO sends.  Yes,
> they could probably (ab)use any prioritization support they have.
>
> NICs and drivers are accustomed to "in order" processing - grab packet,
> send packet, update status, lather, rinse, repeat (modulo some
> pre-fetching).  Those rings aren't really amenable to "out of order"
> completion notifications, so the NIC would have to still do "in order"
> retirement of packets or the driver model will loose simplicity.
>
> As for the issue below, even if the NIC(s) upstream did interleave
> between two GSO'd sends, you are simply trading back-to-back frames of a
> single flow for back-to-back frames of different flows.  And if there is
> only the one flow upstream of this bottleneck, whether GSO is on or not
> probably won't make a huge difference in the timing - only how much CPU
> is burned on the source host.
>
>> Best regards,
>>    Richard
>>
>>
>> ----- Original Message ----- 
>> > Back to back packets see higher loss rates than packets more spread out 
>> > in
>> > time. Consider a pair of packets, back to back, arriving over a 
>> > 1Gbit/sec
>> > link into a queue being serviced at 34Mbit/sec, the first packet being
>> > 'lost' is equivalent to saying that the first packet 'observed' the 
>> > queue
>> > full - the system's state is no longer a random variable - it is known 
>> > to
>> > be full. The second packet (lets assume it is also a full one) 'makes 
>> > an
>> > observation' of the state of that queue about 12us later - but that is
>> > only 3% of the time that it takes to service such large packets at 34
>> > Mbit/sec. The system has not had any time to 'relax' anywhere near to 
>> > back
>> > its steady state, it is highly likely that it is still full.
>> >
>> > Fixing this makes a phenomenal difference on the goodput (with the 
>> > usual
>> > delay effects that implies), we've even built and deployed systems with
>> > this sort of engineering embedded (deployed as a network 'wrap') that 
>> > mean
>> > that end users can sustainably (days on end) achieve effective 
>> > throughput
>> > that is better than 98% of (the transmission media imposed) maximum. 
>> > What
>> > we had done is make the network behave closer to the underlying
>> > statistical assumptions made in TCP's design.
>> >
>> > Neil
>> >
>> >
>> >
>> >
>> > On 5 May 2011, at 17:10, Stephen Hemminger wrote:
>> >
>> >> On Thu, 05 May 2011 12:01:22 -0400
>> >> Jim Gettys <jg at freedesktop.org> wrote:
>> >>
>> >>> On 04/30/2011 03:18 PM, Richard Scheffenegger wrote:
>> >>>> I'm curious, has anyone done some simulations to check if the
>> >>>> following qualitative statement holds true, and if, what the
>> >>>> quantitative effect is:
>> >>>>
>> >>>> With bufferbloat, the TCP congestion control reaction is unduely
>> >>>> delayed. When it finally happens, the tcp stream is likely facing a
>> >>>> "burst loss" event - multiple consecutive packets get dropped. Worse
>> >>>> yet, the sender with the lowest RTT across the bottleneck will 
>> >>>> likely
>> >>>> start to retransmit while the (tail-drop) queue is still 
>> >>>> overflowing.
>> >>>>
>> >>>> And a lost retransmission means a major setback in bandwidth (except
>> >>>> for Linux with bulk transfers and SACK enabled), as the standard 
>> >>>> (RFC
>> >>>> documented) behaviour asks for a RTO (1sec nominally, 200-500 ms
>> >>>> typically) to recover such a lost retransmission...
>> >>>>
>> >>>> The second part (more important as an incentive to the ISPs 
>> >>>> actually),
>> >>>> how does the fraction of goodput vs. throughput change, when AQM
>> >>>> schemes are deployed, and TCP CC reacts in a timely manner? Small 
>> >>>> ISPs
>> >>>> have to pay for their upstream volume, regardless if that is "real"
>> >>>> work (goodput) or unneccessary retransmissions.
>> >>>>
>> >>>> When I was at a small cable ISP in switzerland last week, surely
>> >>>> enough bufferbloat was readily observable (17ms -> 220ms after 30 
>> >>>> sec
>> >>>> of a bulk transfer), but at first they had the "not our problem" 
>> >>>> view,
>> >>>> until I started discussing burst loss / retransmissions / goodput vs
>> >>>> throughput - with the latest point being a real commercial incentive
>> >>>> to them. (They promised to check if AQM would be available in the 
>> >>>> CPE
>> >>>> / CMTS, and put latency bounds in their tenders going forward).
>> >>>>
>> >>> I wish I had a good answer to your very good questions.  Simulation
>> >>> would be interesting though real daa is more convincing.
>> >>>
>> >>> I haven't looked in detail at all that many traces to try to get a 
>> >>> feel
>> >>> for how much bandwidth waste there actually is, and more formal 
>> >>> studies
>> >>> like Netalyzr, SamKnows, or the Bismark project would be needed to
>> >>> quantify the loss on the network as a whole.
>> >>>
>> >>> I did spend some time last fall with the traces I've taken.  In 
>> >>> those,
>> >>> I've typically been seeing 1-3% packet loss in the main TCP 
>> >>> transfers.
>> >>> On the wireless trace I took, I saw 9% loss, but whether that is
>> >>> bufferbloat induced loss or not, I don't know (the data is out there 
>> >>> for
>> >>> those who might want to dig).  And as you note, the losses are
>> >>> concentrated in bursts (probably due to the details of Cubic, so I'm
>> >>> told).
>> >>>
>> >>> I've had anecdotal reports (and some first hand experience) with much
>> >>> higher loss rates, for example from Nick Weaver at ICSI; but I 
>> >>> believe
>> >>> in playing things conservatively with any numbers I quote and I've 
>> >>> not
>> >>> gotten consistent results when I've tried, so I just report what's in
>> >>> the packet captures I did take.
>> >>>
>> >>> A phenomena that could be occurring is that during congestion 
>> >>> avoidance
>> >>> (until TCP loses its cookies entirely and probes for a higher 
>> >>> operating
>> >>> point) that TCP is carefully timing it's packets to keep the buffers
>> >>> almost exactly full, so that competing flows (in my case, simple 
>> >>> pings)
>> >>> are likely to arrive just when there is no buffer space to accept 
>> >>> them
>> >>> and therefore you see higher losses on them than you would on the 
>> >>> single
>> >>> flow I've been tracing and getting loss statistics from.
>> >>>
>> >>> People who want to look into this further would be a great help.
>> >>>                 - Jim
>> >>
>> >> I would not put a lot of trust in measuring loss with pings.
>> >> I heard that some ISP's do different processing on ICMP's used
>> >> for ping packets. They either prioritize them high to provide
>> >> artificially good response (better marketing numbers); or
>> >> prioritize them low since they aren't useful traffic.
>> >> There are also filters that only allow N ICMP requests per second
>> >> which means repeated probes will be dropped.
>> >>
>> >>
>> >>
>> >> -- 
>> >> _______________________________________________
>> >> Bloat mailing list
>> >> Bloat at lists.bufferbloat.net
>> >> https://lists.bufferbloat.net/listinfo/bloat
>> >
>> > _______________________________________________
>> > Bloat mailing list
>> > Bloat at lists.bufferbloat.net
>> > https://lists.bufferbloat.net/listinfo/bloat
>>
>> _______________________________________________
>> Bloat mailing list
>> Bloat at lists.bufferbloat.net
>> https://lists.bufferbloat.net/listinfo/bloat
>
>