On Fri, Nov 23, 2012 at 09:57:34AM +0100, Dave Taht wrote:
> David Woodhouse and I fiddled a lot with adsl and openwrt and a
> variety of drivers and network layers in a typical bonded adsl stack
> yesterday. The complexity of it all makes my head hurt. I'm happy that
> a newly BQL'd ethernet driver (for the geos and qemu) emerged from it,
> which he submitted to netdev...

Cool!!!  ;-)

> I made a recording of us last night discussing the layers, which I
> will produce and distribute later...
> 
> Anyway, along the way, we fiddled a lot with trying to analyze where
> the 350ms or so of added latency was coming from in the traverse geo's
> adsl implementation and overlying stack....
> 
> Plots: http://david.woodhou.se/dwmw2-netperf-plots.tar.gz
> 
> Note: 1:
> 
> The  netperf sample rate on the rrul test needs to be higher than
> 100ms in order to get a decent result at sub 10Mbit speeds.
> 
> Note 2:
> 
> The two nicest graphs here are nofq.svg vs fq.svg, which were taken on
> a gigE link from a Mac running Linux to another gigE link. (in other
> words, NOT on the friggin adsl link) (firefox can display svg, I don't
> know what else) I find the T+10 delay before stream start in the
> fq.svg graph suspicious and think the "throw out the outlier" code in
> the netperf-wrapper code is at fault. Prior to that, codel is merely
> buffering up things madly, which can also be seen in the pfifo_fast
> behavior, with 1000pkts it's default.

I am using these two in a new "Effectiveness of FQ-CoDel" section.
Chrome can display .svg, and if it becomes a problem, I am sure that
they can be converted.  Please let me know if some other data would
make the point better.

I am assuming that the colored throughput spikes are due to occasional
packet losses.  Please let me know if this interpretation is overly naive.

Also, I know what ICMP is, but the UDP variants are new to me.  Could
you please expand the "EF", "BK", "BE", and "CSS" acronyms?

> (Arguably, the default queue length in codel can be reduced from 10k
> packets to something more reasonable at GigE speeds)
> 
> (the indicator that it's the graph, not the reality, is that the
> fq.svg pings and udp start at T+5 and grow minimally, as is usual with
> fq_codel.)

All sessions were started at T+5, then?

> As for the *.ps graphs, well, they would take david's network topology
> to explain, and were conducted over a variety of circumstances,
> including wifi, with more variables in play than I care to think
> about.
> 
> We didn't really get anywhere on digging deeper. As we got to purer
> tests - with a minimal number of boxes, running pure ethernet,
> switched over a couple of switches, even in the simplest two box case,
> my HTB based "ceroshaper" implementation had multiple problems in
> cutting median latencies below 100ms, on this very slow ADSL link.
> David suspects problems on the path along the carrier backbone as a
> potential issue, and the only way to measure that is with two one way
> trip time measurements (rather than rtt), time synced via ntp... I
> keep hoping to find a rtp test, but I'm open to just about any option
> at this point. anyone?
> 
> We also found a probable bug in mtr in that multiple mtrs on the same
> box don't co-exist.

I must confess that I am not seeing all that clear a difference between
the behaviors of ceroshaper and FQ-CoDel.  Maybe somewhat better latencies
for FQ-CoDel, but not unambiguously so.

> Moving back to more scientific clarity and simpler tests...
> 
> The two graphs, taken a few weeks back, on pages 5 and 6 of this:
> 
> http://www.teklibre.com/~d/bloat/Not_every_packet_is_sacred-Battling_Bufferbloat_on_wifi.pdf
> 
> appear to show the advantage of fq_codel fq + codel + head drop over
> tail drop during the slow start period on a 10Mbit link - (see how
> squiggly slow start is on pfifo fast?) as well as the marvelous
> interstream latency that can be achieved with BQL=3000 (on a 10 mbit
> link.)  Even that latency can be halved by reducing BQL to 1500, which
> is just fine on a 10mbit. Below those rates I'd like to be rid of BQL
> entirely, and just have a single packet outstanding... in everything
> from adsl to cable...
> 
> That said, I'd welcome other explanations of the squiggly slowstart
> pfifo_fast behavior before I put that explanation on the slide.... ECN
> was in play here, too. I can redo this test easily, it's basically
> running a netperf TCP_RR for 70 seconds, and starting up a TCP_MAERTS
> and TCP_STREAM for 60 seconds a T+5, after hammering down on BQL's
> limit and the link speeds on two sides of a directly connected laptop
> connection.

I must defer to others on this one.  I do note the much lower latencies
on slide 6 compared to slide 5, though.

Please see attached for update including .git directory.

							Thanx, Paul

> ethtool -s eth0 advertise 0x002 # 10 Mbit
>