On Fri, Nov 23, 2012 at 09:57:34AM +0100, Dave Taht wrote: > David Woodhouse and I fiddled a lot with adsl and openwrt and a > variety of drivers and network layers in a typical bonded adsl stack > yesterday. The complexity of it all makes my head hurt. I'm happy that > a newly BQL'd ethernet driver (for the geos and qemu) emerged from it, > which he submitted to netdev... Cool!!! ;-) > I made a recording of us last night discussing the layers, which I > will produce and distribute later... > > Anyway, along the way, we fiddled a lot with trying to analyze where > the 350ms or so of added latency was coming from in the traverse geo's > adsl implementation and overlying stack.... > > Plots: http://david.woodhou.se/dwmw2-netperf-plots.tar.gz > > Note: 1: > > The netperf sample rate on the rrul test needs to be higher than > 100ms in order to get a decent result at sub 10Mbit speeds. > > Note 2: > > The two nicest graphs here are nofq.svg vs fq.svg, which were taken on > a gigE link from a Mac running Linux to another gigE link. (in other > words, NOT on the friggin adsl link) (firefox can display svg, I don't > know what else) I find the T+10 delay before stream start in the > fq.svg graph suspicious and think the "throw out the outlier" code in > the netperf-wrapper code is at fault. Prior to that, codel is merely > buffering up things madly, which can also be seen in the pfifo_fast > behavior, with 1000pkts it's default. I am using these two in a new "Effectiveness of FQ-CoDel" section. Chrome can display .svg, and if it becomes a problem, I am sure that they can be converted. Please let me know if some other data would make the point better. I am assuming that the colored throughput spikes are due to occasional packet losses. Please let me know if this interpretation is overly naive. Also, I know what ICMP is, but the UDP variants are new to me. Could you please expand the "EF", "BK", "BE", and "CSS" acronyms? > (Arguably, the default queue length in codel can be reduced from 10k > packets to something more reasonable at GigE speeds) > > (the indicator that it's the graph, not the reality, is that the > fq.svg pings and udp start at T+5 and grow minimally, as is usual with > fq_codel.) All sessions were started at T+5, then? > As for the *.ps graphs, well, they would take david's network topology > to explain, and were conducted over a variety of circumstances, > including wifi, with more variables in play than I care to think > about. > > We didn't really get anywhere on digging deeper. As we got to purer > tests - with a minimal number of boxes, running pure ethernet, > switched over a couple of switches, even in the simplest two box case, > my HTB based "ceroshaper" implementation had multiple problems in > cutting median latencies below 100ms, on this very slow ADSL link. > David suspects problems on the path along the carrier backbone as a > potential issue, and the only way to measure that is with two one way > trip time measurements (rather than rtt), time synced via ntp... I > keep hoping to find a rtp test, but I'm open to just about any option > at this point. anyone? > > We also found a probable bug in mtr in that multiple mtrs on the same > box don't co-exist. I must confess that I am not seeing all that clear a difference between the behaviors of ceroshaper and FQ-CoDel. Maybe somewhat better latencies for FQ-CoDel, but not unambiguously so. > Moving back to more scientific clarity and simpler tests... > > The two graphs, taken a few weeks back, on pages 5 and 6 of this: > > http://www.teklibre.com/~d/bloat/Not_every_packet_is_sacred-Battling_Bufferbloat_on_wifi.pdf > > appear to show the advantage of fq_codel fq + codel + head drop over > tail drop during the slow start period on a 10Mbit link - (see how > squiggly slow start is on pfifo fast?) as well as the marvelous > interstream latency that can be achieved with BQL=3000 (on a 10 mbit > link.) Even that latency can be halved by reducing BQL to 1500, which > is just fine on a 10mbit. Below those rates I'd like to be rid of BQL > entirely, and just have a single packet outstanding... in everything > from adsl to cable... > > That said, I'd welcome other explanations of the squiggly slowstart > pfifo_fast behavior before I put that explanation on the slide.... ECN > was in play here, too. I can redo this test easily, it's basically > running a netperf TCP_RR for 70 seconds, and starting up a TCP_MAERTS > and TCP_STREAM for 60 seconds a T+5, after hammering down on BQL's > limit and the link speeds on two sides of a directly connected laptop > connection. I must defer to others on this one. I do note the much lower latencies on slide 6 compared to slide 5, though. Please see attached for update including .git directory. Thanx, Paul > ethtool -s eth0 advertise 0x002 # 10 Mbit >