From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <moeller0@gmx.de>
Received: from mout.gmx.net (mout.gmx.net [212.227.15.19])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(Client CN "mout.gmx.net",
	Issuer "TeleSec ServerPass DE-1" (verified OK))
	by huchra.bufferbloat.net (Postfix) with ESMTPS id 7521F21F463
	for <cerowrt-devel@lists.bufferbloat.net>;
	Sun, 27 Jul 2014 04:17:24 -0700 (PDT)
Received: from hms-beagle.lan ([134.2.89.70]) by mail.gmx.com (mrgmx003) with
	ESMTPSA (Nemesis) id 0MTkNU-1X2WlQ33QW-00QSPl;
	Sun, 27 Jul 2014 13:17:16 +0200
Content-Type: text/plain; charset=windows-1252
Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\))
From: Sebastian Moeller <moeller0@gmx.de>
In-Reply-To: <alpine.DEB.2.02.1407261733500.21739@nftneq.ynat.uz>
Date: Sun, 27 Jul 2014 13:17:16 +0200
Content-Transfer-Encoding: quoted-printable
Message-Id: <7539AE21-DB18-45AF-B1F7-0F502B575867@gmx.de>
References: <CACj-SW2xRzNJa_c7CyOGzY-Yvun7UjNyp0W0aeF5DjO_Guu=ag@mail.gmail.com>
	<13144.1406313454@turing-police.cc.vt.edu>
	<36889fad276c5cdd1cd083d1c83f2265@lang.hm>
	<2483CF77-EE7D-4D76-ACC8-5CBC75D093A7@gmx.de>
	<alpine.DEB.2.02.1407261322260.19912@nftneq.ynat.uz>
	<E0E72F83-F4E5-4912-A9D3-F421640B6C7F@gmx.de>
	<alpine.DEB.2.02.1407261434570.19912@nftneq.ynat.uz>
	<93489218-DB72-4A74-96A4-F95AF4800BBE@gmx.de>
	<alpine.DEB.2.02.1407261543170.19912@nftneq.ynat.uz>
	<1E34489F-F863-41DC-8935-DE1B798B5D3E@gmx.de>
	<alpine.DEB.2.02.1407261733500.21739@nftneq.ynat.uz>
To: David Lang <david@lang.hm>
X-Mailer: Apple Mail (2.1878.6)
X-Provags-ID: V03:K0:LrRex/45e8DbNcXQr2bhu6enJiWM8qqOZNv0fjVLUAUqHYOIdW+
	/uqY4eixRQR925ym9d3XtYHDu+UDEJBdwiNkRV7PeTaJJsJDZsjp7NIW9K2dC27Y7K4ZbBm
	Cj6q8G3HIMg8/avjAmS5+pVkOIpAE0jxbz28CrI/9cfL05ACqiAZ7UQOsIMNJZv3dsdQRjy
	2VFAokOqSIIC3KN8kHYKg==
Cc: cerowrt-devel@lists.bufferbloat.net
Subject: Re: [Cerowrt-devel] Ideas on how to simplify and popularize
	bufferbloat control for consideration.
X-BeenThere: cerowrt-devel@lists.bufferbloat.net
X-Mailman-Version: 2.1.13
Precedence: list
List-Id: Development issues regarding the cerowrt test router project
	<cerowrt-devel.lists.bufferbloat.net>
List-Unsubscribe: <https://lists.bufferbloat.net/options/cerowrt-devel>,
	<mailto:cerowrt-devel-request@lists.bufferbloat.net?subject=unsubscribe>
List-Archive: <https://lists.bufferbloat.net/pipermail/cerowrt-devel>
List-Post: <mailto:cerowrt-devel@lists.bufferbloat.net>
List-Help: <mailto:cerowrt-devel-request@lists.bufferbloat.net?subject=help>
List-Subscribe: <https://lists.bufferbloat.net/listinfo/cerowrt-devel>,
	<mailto:cerowrt-devel-request@lists.bufferbloat.net?subject=subscribe>
X-List-Received-Date: Sun, 27 Jul 2014 11:17:24 -0000

Hi David,

On Jul 27, 2014, at 02:49 , David Lang <david@lang.hm> wrote:

> On Sun, 27 Jul 2014, Sebastian Moeller wrote:
>=20
>> On Jul 27, 2014, at 00:53 , David Lang <david@lang.hm> wrote:
>>=20
>>> On Sun, 27 Jul 2014, Sebastian Moeller wrote:
>>>=20
>>>> Hi David,
>>>>=20
>>>> On Jul 26, 2014, at 23:45 , David Lang <david@lang.hm> wrote:
>>>>=20
>>>>> On Sat, 26 Jul 2014, Sebastian Moeller wrote:
>>>>>=20
>>>>>> On Jul 26, 2014, at 22:39 , David Lang <david@lang.hm> wrote:
>>>>>>=20
>>>>>>> by how much tuning is required, I wasn't meaning how frequently =
to tune, but how close default settings can come to the performance of a =
expertly tuned setup.
>>>>>>=20
>>>>>> 	Good question.
>>>>>>=20
>>>>>>>=20
>>>>>>> Ideally the tuning takes into account the characteristics of the =
hardware of the link layer. If it's IP encapsulated in something else =
(ATM, PPPoE, VPN, VLAN tagging, ethernet with jumbo packet support for =
example), then you have overhead from the encapsulation that you would =
ideally take into account when tuning things.
>>>>>>>=20
>>>>>>> the question I'm talking about below is how much do you loose =
compared to the idea if you ignore this sort of thing and just assume =
that the wire is dumb and puts the bits on them as you send them? By =
dumb I mean don't even allow for inter-packet gaps, don't measure the =
bandwidth, don't try to pace inbound connections by the timing of your =
acks, etc. Just run BQL and fq_codel and start the BQL sizes based on =
the wire speed of your link (Gig-E on the 3800) and shrink them based on =
long-term passive observation of the sender.
>>>>>>=20
>>>>>> 	As data talks I just did a quick experiment with my ADSL2+ koine =
at home. The solid lines in the attached plot show the results for =
proper shaping with SQM (shaping to 95% of del link rates of downstream =
and upstream while taking the link layer properties, that is ATM =
encapsulation and per packet overhead into account) the broken lines =
show the same system with just the link layer adjustments and per packet =
overhead adjustments disabled, but still shaping to 95% of link rate =
(this is roughly equivalent to 15% underestimation of the packet size). =
The actual theist is netperf-wrappers RRUL (4 tcp streams up, 4 tcp =
steams down while measuring latency with ping and UDP probes). As you =
can see from the plot just getting the link layer encapsulation wrong =
destroys latency under load badly. The host is ~52ms RTT away, and with =
fq_codel the ping time per leg is just increased one codel target of 5ms =
each resulting in an modest latency increase of ~10ms with proper =
shaping for a total of ~65ms, with improper shaping RTTs increase to =
~95ms (they almost double), so RTT increases by ~43ms. Also note how the =
extremes for the broken lines are much worse than for the solid lines. =
In short I would estimate that a slight misjudgment (15%) results in =
almost 80% increase of latency under load. In other words getting the =
rates right matters a lot. (I should also note that in my setup there is =
a secondary router that limits RTT to max 300ms, otherwise the broken =
lines might look even worse...)
>>>>>=20
>>>>> what is the latency like without BQL and codel? the =
pre-bufferbloat version? (without any traffic shaping)
>>>>=20
>>>> 	So I just disabled SQM and the plot looks almost exactly like =
the broken line plot I sent before (~95ms RTT up from 55ms unloaded, =
with single pings delayed for > 1000ms, just as with the broken line, =
with proper shaping even extreme pings stay < 100ms). But as I said =
before I need to run through my ISP supplied primary router (not just a =
dumb modem) that also tries to bound the latencies under load to some =
degree. Actually I just repeated the test connected directly to the =
primary router and get the same ~95ms average ping time with frequent =
extremes > 1000ms, so it looks like just getting the shaping wrong by =
15% eradicates the buffer de-bloating efforts completely...
>>>=20
>>> just so I understand this completely
>>>=20
>>> you have
>>>=20
>>> debloated box <-> ISP router <-> ADSL <-> Internet <-> debloated =
server?
>>=20
>> 	Well more like:
>>=20
>> 	Macbook with dubious bloat-state -> wifi to de-bloated cerowrt =
box that shapes the traffic -> ISP router -> ADSL -> internet -> server
>>=20
>> I assume that Dave debated these servers well, but it should not =
really matter as the problem are the buffers on both ends of the =
bottleneck ADSL link.
>=20
> right, I was forgetting that unless you are the bottleneck, you aren't =
buffering anything and so debloating makes no difference. In a case like =
yours where you can't debloat the actual bottleneck, the best that you =
can do is to artificially become the bottleneck by shaping the traffic. =
but on the download side it's much harder.

Actually, all RRUL plots that Dave collected show that ingress shaping =
does work quite well on average. It will fail with a severe DOS, but =
let=92s face it these can only be mitigated by the ISP anyways=85=20


>=20
> What are we aiming for? something that will show the problem clearly =
so that fixes can be put in the right place? or a work-around to use in =
the meantime?

	Mmmh, I aim for decent internet connections for home-iusers like =
myself. It would be great if ISPs could use their leverage on equipment =
manufacturers to implement the current state of the art solution in =
broadband gear; realistically even if this would start like today we =
still face a long transition time, so I am all for putting the smarts =
into home-router=92s. At least the end user has enough incentive to put =
in (small amount of) work required to mitigate bad buffer management...

>=20
> I think both need to be pursued, but we need to be clear on what is =
being done for each one.

	I have no connection into telco=92s, ISPs, nor OEMs, so all I =
can help with is getting the =93work-around=94 in good shape and ready =
for deployment. Arguably convincing ISPs might be more important.

>=20
> If having BQL+fq_codel with defaults would solve the problem if it was =
on the right routers, we need to show that.

	I think Dave has pretty much shown this. Note though that it is =
rather traffic shaping and fq_codel, BQL would be needed in the DSL =
drivers on both sides of the link.

>=20
> Then, because we can't get the fixes on the right routers and need to =
work-around the problem by artificially becoming the bottleneck, we need =
to show that the 95% that we shape to is throwing away 5% of your =
capacity and make that clear to the users.

	I think if you google for =93router qos=94 you will find plenty =
of pages already describing the rational and bandwidth sacrifice =
required, so that knowledge might already be in the public knowledge.

>=20
> otherwise we will risk getting to the point where it will never get =
fixed because the ISPs will look at their routers and say that =
bufferbloat can't possibly be a problem as they never have large queues =
(because we are doing the workarounds.

	Honestly, for an ISP the best solution is us shaping our =
connections as that reduces the worst case bandwidth use per user and =
might allow higher oversubscription. We need to find economical =
incentives for ISPs to implement BQL equivalents in the broadband gear. =
In theory it should give a competitive advantage to be able to advertise =
better gaming/void suitability but many users really have no real choice =
of ISP. I cold imagine that the big push away from switched circuit =
telephony to voip even for carriers ISPs might get more interested in =
improving VOIP resilience unhand usability under load...

>=20
>=20
>>> and are you measuring the latency impact when uploading or =
downloading?
>>=20
>> 	No I measure the impact of latency of saturating both up- and =
downlink, pretty much the worst case scenario.
>=20
> I think we need to test this in each direction independently.

	Rich Brown has made a nice script to test that, =
betterspeedtest.sh at https://github.com/richb-hanover/CeroWrtScripts
For figuring out the required shaping point it is easier to work on both =
=93legs=94 independently, But to assess worst case behavior I think both =
directions need to be saturated.
There is a pretty good description of a quick bufferloat test on =
http://www.bufferbloat.net/projects/cerowrt/wiki/Quick_Test_for_Bufferbloa=
t


>=20
> Cerowrt can do a pretty good job of keeping the uplink from being =
saturated, but it can't do a lot for the downlink.

	Well, except it does. Downlink shaping is less reliable than =
uplink shaping. Most traffic sources, TCP or UDP actually need to deal =
with the variable bandwidth of the internet anyway and implement some =
congestion control, that needs to deal with packet loss as congestion =
signal. So the downlink shaping mostly works okay (even though I think =
Dave recommends to shape downlink more aggressive than 95% of link rate)

>=20
>>>=20
>>> I think a lot of people would be happy with 95ms average pings on a =
loaded connection, even with occasional outliers.
>>=20
>> 	No that is too low an aim, this still is not useable for real =
time applications, we should aim for base RTT plus 10ms. (For very slow =
links we need to cut some slack but for > 3Mbps 10ms should be =
achievable )
>=20
> perfect is the enemy of good enough.

	Sure but really according to =
http://www.hh.se/download/18.70cf2e49129168da015800094780/7_7_delay.pdf =
we only have a 400ms budget for acceptable voip (I would love real =
psychophysics papers for that instead of cisco marketing material), or =
200ms oneway delay. With ~170ms RTT to the west coast (from university =
wired network, so no ADSL delay involved) almost half of the budget is =
used up in a way that can not be fixed easily. (It takes 66ms for light =
to travel the distance of half the earth=92s circumference, or 132ms =
RTT, or assuming c(fiber) =3D 0.7* c(vacuum) rather 95ms one-way of =
190ms RTT). With ~100ms RTT from each end there is barely enough time =
left for data processing and transcoding.

>=20
> There's achievable if every router is tuned to exactly the right =
conditions and there's achievable for course settings that can be widely =
deployed. Get the second out while continuing to work on making the =
first easier.

	Okay so that is easy, if you massively overshsape latency will =
be great, but bandwidth is compromised...

>=20
> residential connections only come in a smallish number of sizes,

	Except that with say DSL there is often a wide corridor for =
allowed sync speed, e.g. the 50Mbps down / 10Mbps up vdsl2 packet of DT =
actually will synchronize in a corridor of 50 to 27Mbps and 10 to 5.5 =
Mbps (numbers are approximately right), That is almost a factor of 2, =
too much for a one size fits all approach (say 90% of advertised speed).

> it shouldn't be too hard to do a few probes and guess which size is in =
use, then set the bandwith to 90% of that standard size and you should =
be pretty good without further tuning.

	No, with ATM carriers (ADSL, some VDSL) the encapsulation =
overhead ranges from ~10% to >50% depending on packet size, so to get =
the bottleneck queue reliable under our control we would need to shape =
to ~50% of link speed, obviously a very hard sell . (And it is not easy =
to figure out whether the bottleneck link uses ATM or not, so there is =
no one size fits all). We currently have no easy and quick way of =
detecting ATM link layers from cerowrt...


>=20
>>> It's far better than sustained multi-second ping times which is what =
I've seen with stock setups.
>>=20
>> 	True, but compared to multi seconds even <1000ms would be a =
really great improvement, but also not enough.
>>=20
>>>=20
>>> but if no estimate is this bad, how bad is it if you use as your =
estimate the 'rated' speed of your DSL (i.e. what the ISP claims they =
are providing you) instead of the fully accurate speed that includes =
accounting for ATM encapsulation?
>>=20
>> 	Well ~95ms with outliers > 1000ms, just as bad as no estimate. I =
shaped 5% below rated speed as reported by the DSL modem, so disabling =
the ATM link layer adjustments (as shown in the broken lines in the =
plot), basically increased the effective shaped rate by ~13% or to =
effectively 107% of line rate, your proposal would be line rate and no =
link layer adjustments or effectively 110% of line rate; I do not feel =
like repeating this experiment right now as I think the data so far =
shows that even with less misjudgment the bloat effect is fully visible =
) Not accounting for ATM framing carries a ~10% cost in link speed, as =
ATM packet size on the wire increases by >=3D ~10%.
>=20
> so what if you shape to 90% of rated speed (no allowance for ATM vs =
other transports)?

	I have not done that but the typical recommendation for ADSL =
links for shaping without taking the link layer peculiarities into =
account is 85% (which should work for large packets, but can easily melt =
down with lots of smallish packets, like voip calls). I repeat there is =
no simple one-size fits all shaping that will solve the buffer bloat =
issue for most home-users in a acceptable fashion. (And I am not talking =
perfekt here, it simply is not good enough). Note that 90% will just =
account for the 48in53 ATM transport cost, it will not take the =
increased per packet header into account.

>=20
>>> It's also worth figuring out if this problem would remain in place =
if you didn't have to go through the ISP router and were runing fq_codel =
on that router.
>>=20
>> 	If the DSL modem would be debloated at least on upstream no =
shaping would be required any more; but that does not fix the need for =
downstream shaping (and bandwidth estimation) until the head end gear is =
debloated..
>=20
> right, I was forgetting this earlier.
>=20
>>> As long as fixing bufferbloat involves esoteric measurements and =
tuning, it's not going to be solved, but if it could be solved by people =
flahing openwrt onto their DSL router and then using the defaults, it =
could gain traction fairly quickly.
>>=20
>> 	But as there are only very few DSL modems with open sources =
(especially of the DSL chips) this just as esoteric ;) Really if =
equipment manufactures could be convinced to take these issues seriously =
and actually fix their gear that would be best. But this does not look =
like it is happening on the fast track. (Even DOCSIS developer cable =
labs punted on requiring codel or fq_codel in DOCSIS modems since the =
think that the required timestamps are to =93expensive=94 on the device =
class they want to use for modems. They opted for PIE, much better than =
what we have right now but far away from my latency under load increase =
of 10ms...)
>>=20
>>>=20
>>>>> I agree that going from 65ms to 95ms seems significant, but if the =
stock version goes into up above 1000ms, then I think we are talking =
about things that are =91close'
>>>>=20
>>>> 	Well if we include outliers (and we should as enough outliers =
will degrade the FPS and voip suitability of an otherwise responsive =
system quickly) stock and improper shaping are in the >1000ms worst case =
range, while proper SQM bounds this to 100ms.
>>>>=20
>>>>>=20
>>>>> assuming that latency under load without the improvents got =
>1000ms
>>>>>=20
>>>>> fast-slow (in ms)
>>>> ideal=3D10
>>>> untuned=3D43
>>>> bloated > 1000
>>>>=20
>>>> 	The sign seems off as fast < slow? I like this best ;)
>>>=20
>>> yep, I reversed fast/slow in all of these
>>>=20
>>>>>=20
>>>>> fast/slow
>>>>> ideal =3D 1.25
>>>>> untuned =3D 1.83
>>>>> bloated > 19
>>>>=20
>>>> 	But Fast < Slow and hence this ration should be <0?
>>>=20
>>> 1 not 0, but yes, this is really slow/fast
>>>=20
>>>>> slow/fast
>>>>> ideal =3D 0.8
>>>>> untuned =3D 0.55
>>>>> bloated =3D 0.05
>>>>>=20
>>>>=20
>>>> 	and this >0?
>>>=20
>>> and this is really fast/slow
>>=20
>>=20
>> 	What about taking the latency difference an re;aging it with a =
reference time, like say the time a photon would take to travel once =
around the equator, or the earth=92s diamater?
>=20
> how about latency difference scaled by the time to send one 1500 byte =
packet at the measured throughput?

	So you propose latency difference / time to send one full packet =
at the measured speed
=09
Not sure: think two de-bloated setups, one fast one slow: for the slow =
link we get 10ms/long for a fast link we get 10ms/short, so assuming =
that both keep the 10ms average latency increase why should both links =
show different bloat-measure?
I really think the raw latency difference is what we should convince the =
users to look at. All one-number measures are going to be too =
simplistic, but at least for the difference you can easily estimate the =
effect on RTTs for relevant traffic...

>=20
> This would factor out the data rate and would not be affected by long =
distance links.

	I am not convinced that people on a slow link can afford latency =
increases any better than people on a fast link. I actually think that =
it is the other way round. During the tuning process your measure might =
be helpful to find a good tradeoff between bandwidth and latency =
increase though.

Best Regards
	Sebastian

>=20
> David Lang