[Bloat] [Cerowrt-devel] Ubiquiti QOS

dpreed at reed.com dpreed at reed.com
Wed May 28 08:33:42 PDT 2014


Same concern I mentioned with Jim's message.   I was not clear what I meant by "pacing" in the context of optimization of latency while preserving throughput.  It is NOT just a matter of spreading packets out in time that I was talking about.   It is a matter of doing so without reducing throughput.  That means transmitting as *early* as possible while avoiding congestion.  Building a "backlog" and then artificially spreading it out by "add-on pacing" will definitely reduce throughput below the flow's fair share of the bottleneck resource.
 
It is pretty clear to me that you can't get to a minimal latency, optimal throughput control algorithm by a series of "add ons" in LART.  It requires rethinking of the control discipline, and changes to get more information about congestion earlier, without ever allowing a buffer queue to build up in intermediate nodes - since that destroys latency by definition.
 
As long as you require buffers to grow at bottleneck links in order to get measurements of congestion, you probably are stuck with long-time-constant control loops, and as long as you encourage buffering at OS send stacks you are even worse off at the application layer.
 
The problem is in the assumption that buffer queueing is the only possible answer.  The "pacing" being included in Linux is just another way to build bigger buffers (on the sending host), by taking control away from the TCP control loop.
 
 


On Tuesday, May 27, 2014 1:31pm, "Dave Taht" <dave.taht at gmail.com> said:



> This has been a good thread, and I'm sorry it was mostly on
> cerowrt-devel rather than the main list...
> 
> It is not clear from observing google's deployment that pacing of the
> IW is not in use. I see
> clear 1ms boundaries for individual flows on much lower than iw10
> boundaries. (e.g. I see 1-4
> packets at a time arrive at 1ms intervals - but this could be an
> artifact of the capture, intermediate
> devices, etc)
> 
> sch_fq comes with explicit support for spreading out the initial
> window, (by default it allows a full iw10 burst however) and tcp small
> queues and pacing-aware tcps and the tso fixes and stuff we don't know
> about all are collaborating to reduce the web burst size...
> 
> sch_fq_codel used as the host/router qdisc basically does spread out
> any flow if there is a bottleneck on the link. The pacing stuff
> spreads flow delivery out across an estimate of srtt by clock tick...
> 
> It makes tremendous sense to pace out a flow if you are hitting the
> wire at 10gbit and know you are stepping down to 100mbit or less on
> the end device - that 100x difference in rate is meaningful... and at
> the same time to get full throughput out of 10gbit some level of tso
> offloads is needed... and the initial guess
> at the right pace is hard to get right before a couple RTTs go by.
> 
> I look forward to learning what's up.
> 
> On Tue, May 27, 2014 at 8:23 AM, Jim Gettys <jg at freedesktop.org> wrote:
> >
> >
> >
> > On Sun, May 25, 2014 at 4:00 PM, <dpreed at reed.com> wrote:
> >>
> >> Not that it is directly relevant, but there is no essential reason to
> >> require 50 ms. of buffering.  That might be true of some particular
> >> QOS-related router algorithm.  50 ms. is about all one can tolerate in
> any
> >> router between source and destination for today's networks - an
> upper-bound
> >> rather than a minimum.
> >>
> >>
> >>
> >> The optimum buffer state for throughput is 1-2 packets worth - in other
> >> words, if we have an MTU of 1500, 1500 - 3000 bytes. Only the bottleneck
> >> buffer (the input queue to the lowest speed link along the path) should
> have
> >> this much actually buffered. Buffering more than this increases
> end-to-end
> >> latency beyond its optimal state.  Increased end-to-end latency reduces
> the
> >> effectiveness of control loops, creating more congestion.
> 
> This misses an important facet of modern macs (wifi, wireless, cable, and gpon),
> which which can aggregate 32k or more in packets.
> 
> So the ideal size in those cases is much larger than a MTU, and has additional
> factors governing the ideal - such as the probability of a packet loss inducing
> a retransmit....
> 
> Ethernet, sure.
> 
> >>
> >>
> >>
> >> The rationale for having 50 ms. of buffering is probably to avoid
> >> disruption of bursty mixed flows where the bursts might persist for 50
> ms.
> >> and then die. One reason for this is that source nodes run operating
> systems
> >> that tend to release packets in bursts. That's a whole other discussion -
> in
> >> an ideal world, source nodes would avoid bursty packet releases by
> letting
> >> the control by the receiver window be "tight" timing-wise.  That is, to
> >> transmit a packet immediately at the instant an ACK arrives increasing
> the
> >> window.  This would pace the flow - current OS's tend (due to scheduling
> >> mismatches) to send bursts of packets, "catching up" on sending that
> could
> >> have been spaced out and done earlier if the feedback from the
> receiver's
> >> window advancing were heeded.
> 
> This loop has got ever tighter since linux 3.3, to where it's really as tight
> as a modern cpu scheduler can get it. (or so I keep thinking -
> but successive improvements in linux tcp keep proving me wrong. :)
> 
> I am really in awe of linux tcp these days. Recently I was benchmarking
> windows and macos. Windows only got 60% of the throughput linux tcp
> did at gigE speeds, and osx had a lot of issues at 10mbit and below,
> stretch acks and holding the window too high for the path)
> 
> I keep hoping better ethernet hardware will arrive that can mix flows
> even more.
> 
> >>
> >>
> >>
> >> That is, endpoint network stacks (TCP implementations) can worsen
> >> congestion by "dallying".  The ideal end-to-end flows occupying a
> congested
> >> router would have their packets paced so that the packets end up being
> sent
> >> in the least bursty manner that an application can support.  The effect
> of
> >> this pacing is to move the "backlog" for each flow quickly into the
> source
> >> node for that flow, which then provides back pressure on the application
> >> driving the flow, which ultimately is necessary to stanch congestion. 
> The
> >> ideal congestion control mechanism slows the sender part of the
> application
> >> to a pace that can go through the network without contributing to
> buffering.
> >
> >
> > Pacing is in Linux 3.12(?).  How long it will take to see widespread
> > deployment is another question, and as for other operating systems, who
> > knows.
> >
> > See: https://lwn.net/Articles/564978/
> 
> Steinar drove some of this with persistence and results...
> 
> http://www.linux-support.com/cms/steinar-h-gunderson-paced-tcp-and-the-fq-scheduler/
> 
> >>
> >>
> >>
> >> Current network stacks (including Linux's) don't achieve that goal -
> their
> >> pushback on application sources is minimal - instead they accumulate
> >> buffering internal to the network implementation.
> >
> >
> > This is much, much less true than it once was.  There have been substantial
> > changes in the Linux TCP stack in the last year or two, to avoid generating
> > packets before necessary.  Again, how long it will take for people to deploy
> > this on Linux (and implement on other OS's) is a question.
> 
> The data centers I'm in (linode, isc, google cloud) seem to be
> tracking modern kernels pretty good...
> 
> >>
> >> This contributes to end-to-end latency as well.  But if you think about
> >> it, this is almost as bad as switch-level bufferbloat in terms of
> degrading
> >> user experience.  The reason I say "almost" is that there are tools,
> rarely
> >> used in practice, that allow an application to specify that buffering
> should
> >> not build up in the network stack (in the kernel or wherever it is). 
> But
> >> the default is not to use those APIs, and to buffer way too much.
> >>
> >>
> >>
> >> Remember, the network send stack can act similarly to a congested switch
> >> (it is a switch among all the user applications running on that node). 
> IF
> >> there is a heavy file transfer, the file transfer's buffering acts to
> >> increase latency for all other networked communications on that machine.
> >>
> >>
> >>
> >> Traditionally this problem has been thought of only as a within-node
> >> fairness issue, but in fact it has a big effect on the switches in
> between
> >> source and destination due to the lack of dispersed pacing of the packets
> at
> >> the source - in other words, the current design does nothing to stem the
> >> "burst groups" from a single source mentioned above.
> >>
> >>
> >>
> >> So we do need the source nodes to implement less "bursty" sending
> stacks.
> >> This is especially true for multiplexed source nodes, such as web
> servers
> >> implementing thousands of flows.
> >>
> >>
> >>
> >> A combination of codel-style switch-level buffer management and the
> stack
> >> at the sender being implemented to spread packets in a particular TCP
> flow
> >> out over time would improve things a lot.  To achieve best throughput,
> the
> >> optimal way to spread packets out on an end-to-end basis is to update
> the
> >> receive window (sending ACK) at the receive end as quickly as possible,
> and
> >> to respond to the updated receive window as quickly as possible when it
> >> increases.
> >>
> >>
> >>
> >> Just like the "bufferbloat" issue, the problem is caused by applications
> >> like streaming video, file transfers and big web pages that the
> application
> >> programmer sees as not having a latency requirement within the flow, so
> the
> >> application programmer does not have an incentive to control pacing. 
> Thus
> >> the operating system has got to push back on the applications' flow
> somehow,
> >> so that the flow ends up paced once it enters the Internet itself.  So
> >> there's no real problem caused by large buffering in the network stack
> at
> >> the endpoint, as long as the stack's delivery to the Internet is paced
> by
> >> some mechanism, e.g. tight management of receive window control on an
> >> end-to-end basis.
> >>
> >>
> >>
> >> I don't think this can be fixed by cerowrt, so this is out of place
> here.
> >> It's partially ameliorated by cerowrt, if it aggressively drops packets
> from
> >> flows that burst without pacing. fq_codel does this, if the buffer size
> it
> >> aims for is small - but the problem is that the OS stacks don't respond
> by
> >> pacing... they tend to respond by bursting, not because TCP doesn't
> provide
> >> the mechanisms for pacing, but because the OS stack doesn't transmit as
> soon
> >> as it is allowed to - thus building up a burst unnecessarily.
> >>
> >>
> >>
> >> Bursts on a flow are thus bad in general.  They make congestion happen
> >> when it need not.
> >
> >
> > By far the biggest headache is what the Web does to the network.  It has
> > turned the web into a burst generator.
> >
> > A typical web page may have 10 (or even more images).  See the "connections
> > per page" plot in the link below.
> >
> > A browser downloads the base page, and then, over N connections, essentially
> > simultaneously downloads those embedded objects.  Many/most of them are
> > small in size (4-10 packets).  You never even get near slow start.
> >
> > So you get an IW amount of data/TCP connection, with no pacing, and no
> > congestion avoidance.  It is easy to observe 50-100 packets (or more) back
> > to back at the bottleneck.
> >
> > This is (in practice) the amount you have to buffer today: that burst of
> > packets from a web page.  Without flow queuing, you are screwed.  With it,
> > it's annoying, but can be tolerated.
> >
> >
> > I go over this is detail in:
> >
> >
> http://gettys.wordpress.com/2013/07/10/low-latency-requires-smart-queuing-traditional-aqm-is-not-enough/
> >
> > So far, I don't believe anyone has tried pacing the IW burst of packets.
> > I'd certainly like to see that, but pacing needs to be across TCP
> > connections (host pairs) to be possibly effective to outwit the gaming the
> > web has done to the network.
> >                                                                  - Jim
> >
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Sunday, May 25, 2014 11:42am, "Mikael Abrahamsson"
> <swmike at swm.pp.se>
> >> said:
> >>
> >> > On Sun, 25 May 2014, Dane Medic wrote:
> >> >
> >> > > Is it true that devices with less than 64 MB can't handle QOS?
> ->
> >> > >
> >> > >
> https://lists.chambana.net/pipermail/commotion-dev/2014-May/001816.html
> >> >
> >> > At gig speeds you need around 50ms worth of buffering. 1 gigabit/s
> =
> >> > 125 megabyte/s meaning for 50ms you need 6.25 megabyte of buffer.
> >> >
> >> > I also don't see why performance and memory size would be relevant,
> I'd
> >> > say forwarding performance has more to do with CPU speed than
> anything
> >> > else.
> >> >
> >> > --
> >> > Mikael Abrahamsson email: swmike at swm.pp.se
> >> > _______________________________________________
> >> > Cerowrt-devel mailing list
> >> > Cerowrt-devel at lists.bufferbloat.net
> >> > https://lists.bufferbloat.net/listinfo/cerowrt-devel
> >> >
> >>
> >>
> >> _______________________________________________
> >> Cerowrt-devel mailing list
> >> Cerowrt-devel at lists.bufferbloat.net
> >> https://lists.bufferbloat.net/listinfo/cerowrt-devel
> >>
> >
> >
> > _______________________________________________
> > Cerowrt-devel mailing list
> > Cerowrt-devel at lists.bufferbloat.net
> > https://lists.bufferbloat.net/listinfo/cerowrt-devel
> >
> 
> 
> 
> --
> Dave Täht
> 
> NSFW:
> https://w2.eff.org/Censorship/Internet_censorship_bills/russell_0296_indecent.article
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.bufferbloat.net/pipermail/bloat/attachments/20140528/7a3765be/attachment-0001.html>


More information about the Bloat mailing list