[Cerowrt-devel] [Bloat] Network tests as discussed in Washington, DC

Sun Nov 11 08:39:05 EST 2012

On Sun, Nov 11, 2012 at 9:40 AM, Daniel Berger <dberger at student.ethz.ch> wrote:
> Hi everybody,
>
> I totally love the idea to test for browsing performance. Thanks for
> that ;-)

Jim's demos of the effect of network load on the performance of web
sites are quite revealing,

http://gettys.wordpress.com/2012/02/01/bufferbloat-demonstration-videos/

using the chrome web page benchmark available here:

https://chrome.google.com/webstore/detail/page-benchmarker/channimfdomahekjcahlbpccbgaopjll

You can fairly easily replicate his results on your own hardware, both
locally and over the internet. Go for it!

However in attempting to get to a general purpose test, the simplicity
of his demo (which used a very short path to MIT) didn't work well,
thus, I came up with the methods described in the rRul document. They
seem to scale fairly well up past 60ms RTT. More testers would be
nice!

One of the things that really bugs me about today's overbuffered
networks is doing things like a file upload via scp, which nearly
completes, then stops, and retransmits, over and over again, like jon
corbet's example of what happened to him at a conference hotel last
year.

http://lwn.net/Articles/496509/

> Nevertheless, I have another critical question on this 40s network test
> idea:
> Did someone consider the robustness of the results? That is, did sb
> check for statistical significance?

Presently the effects on multiple sorts of networks are interesting.
As one example, here is a run of one rrul prototype on wired and wifi
toke put together:

http://www.teklibre.com/~d/bloat/rrul-denmark-germany-wired-pfifo-fast.pdf

vs

http://www.teklibre.com/~d/bloat/rrul-denmark-germany-wlan2.pdf

I LOVE the first graph (configured for pfifo_fast on the gateways) as
it clearly shows classic drop tail "TCP global synchronization" on the
egress gateway, and the resulting loss of utilization. It's nice to
have been able to get it on a 50+ms *real-world* path.

It also shows how traffic classification of TCP doesn't work across
the internet very well, as the TCP's classified, different ways,
evolve and change places.

The second (taken on a good wifi) shows how noisy the data is..

http://www.teklibre.com/~d/bloat/rrul-denmark-germany-wlan2.pdf

(I note that using a TCP "ping" is a bad idea except for showing why
tcp encapsulated inside tcp is a bad idea, which gets progressively
worse at longer RTTs. Anyone have a decent RTP test we can replace
this with?)

A graph taken against a loaded wifi network is pretty horrify-ing...

http://www.teklibre.com/~d/bloat/Not_every_packet_is_sacred-Battling_Bufferbloat_on_wifi.pdf

(don't look. Halloween is over)

I have a ton of interesting statistics gathered at ietf and linuxcon
this past week... but finding good ways to present it remain a problem
and I note that most of the stuff above is intended as a BACKGROUND
process while loading web pages and doing useful work like making
phone calls is the real intended result of the benchmark.

So, no, the only statistical significance so far calculated is that
tests like this can cause a network to have one to three orders of
magnitude of latency inserted into it.  Compared to that, I'm not
terribly concerned with a few percentage points here or there, at this
time, but I'd welcome analysis.

The biggest major unknown in the test is the optimal TCP ack count,
and TCP's response to packet loss (retransmits) which could account
for a great deal of the actual data transmitted, vs the amount of
useful data transmitted.

"useful data transmitted under no load and under load" would be
tremendously useful statistic.

It is my hope that the volume web and dns traffic projected to be in
the test are going to be fairly minimal compared to the rest of it,
but I'm not counting on it. That needs to be measured too.

It's a pretty big project to do this up right in other words!

> I currently see that there are two steps:
> First, the test with few load, which shows (I guess) low jitter/variance.
> Second, busy queues.
>  This second "phase" is probably when jitter/variance will inflate a
> lot, right?
>  Then, also the mean (and most other statistical summary-measures) won't
> be stable.

Correct.

>   Thus, I doubt that in order to compute an aggregate "score" we can
> rely on this, in all cases.

The "score" as a ratio of various measured parameters from unloaded to
load seems viable.

> Obviously the best solution would be to run the test long enough so that
> confidence intervals appear to be small and similar for both steps.

There is nothing stopping a network engineer, device driver writer, or
device maker, or mathematician or network queue theorist or sysadmin,
or manager or concerned citizen...

from running the test continuously, going from unloaded, to load, to
unload, to load, and tweaking various underlying variables in the
network stack and path. I do this all the time!

It is my hope, certainly, that those that should do so, will do so. A
core component IS the "mtr" tool which will point at the issues on the
bottleneck link, which might be anything from the local OS, or device,
to wireless ap, to cpe, to somewhere else on the path. Giving the end
user data with (occasionally) something other than their ISP to blame
would be a goodness, and having tools available to find and fix it,
even better.

However, the average citizen is not going to sit still for 60 seconds
on a regular basis, which is the purpose of trying to come up with a
score and panel of useful results that can be presented briefly and
clearly.

I also have hope that a test as robust and stressful as this can be
run on edge gateways automatically, in the background, on selected
routers throughout the world, much as bismark already does. See
examples at:

http://networkdashboard.org/

> Probably is not feasible to expand the test into unusual long intervals
> but at least computing a 95% confidence interval would give me a better
> sense of results.

Go for it!

"Bufferbloat.net: Making network research fun since 2011!"

I note that the rRul work being done right now is the spare time
project of myself and one grad student, leveraging the hard work that
has been put into the Linux OS over the last year by so many, and the
multitude of useful enhancements like classification, priority and
congestion control algorithm that rick jones has put into netperf over
the past year, also in his spare time.

No funding for this work has yet arrived. Various proposals for grants
have been ignored, but we're not grant writing experts.

Cerowrt is getting some lovely support from interested users, but the
size of the task to get code written, analyzed, and tests deployed is
intimidating.

There are a wealth of other tests that can be performed, while under a
RRUL-like load. For example, this december I'll be at the connext
conference in Nice, with some early results from the lincs.fr lab
regarding the interactions of AQM and LEDBAT. I hope to be doing some
follow up work on that paper also in december, against codel and
fq_codel, and more realistic representations of uTP.

a rrul-like test would be useful for analyzing and creating
comparitive the results from any congestion control algorithm, alone
or in combination, such as TCP-LP, or DC-TCP, or (as one potentially
very interesting example) the latest work done at MIT on their TCP,
that I forget the name of right now.

I am very interested in how video sharding technologies work - what
often happens there is that there is a HTTP get of one of 10 seconds
of video at various rates. The client measures the delivery time of
that 10 second shard and increases or decreases the next get to suit.

This generally pushes TCP into slow start, repeatedly, and slams the
downstream portion of the network, repeatedly.

Then there's videoconferencing. Which I care about a lot. I like it
when people's lips match up with what they are saying, being partially
deaf, myself.

And gaming. I'd like very much to have a better picture (packet
captures!) of how various online games such as quake, starcraft, and
world of warcraft interact with the network.

(I think this last item would be rather fun for a team of grad
students to take on. Heck, I'd enjoy "working" on this portion of the
problem, too. :) )

> Doing this might also be a means to account for a broad variety of
> testing/real-world environment and still get reliable results.

I would argue that settling on a clear definition of the tests,
writing the code, and collecting a large set of data would be
"interesting". As for being able to draw general conclusions from it,
I generally propose that we prototype tests, and iterate, going deeply
into packet captures, until we get things that make sense in the lab
and in the field...

and rapidly bug report everything that is found.

A great number of pathological behaviors we've discovered so far have
turned out to be bugs at various levels in various stacks. It's
generally been rather difficult to get to a "paper-writing stage", the
way my life seems to work looks like this:

> Anyone else with this thought?

An example of how you can fool yourself with network statistics, misapplied:

https://lists.bufferbloat.net/pipermail/bloat/2011-November/000715.html

Frank Rowand gave a very good (heretical!) presentation on core
analysis and presentation ideas at last weeks linuxconf - particularly
when it comes to analyzing real time performance of anything.

I don't know if it's up yet.

I have generally found that using mountain and cdf plots are the best
ways to deal with the extremely noisy data collected from wifi and
over the open internet, and that having packet captures and tcp
instrumentation is useful also.

-- 
Dave Täht

Fixing bufferbloat with cerowrt: http://www.teklibre.com/cerowrt/subscribe.html