[Cerowrt-devel] SQM and PPPoE, more questions than answers...

Development issues regarding the cerowrt test router project
 help / color / mirror / Atom feed

* [Cerowrt-devel] SQM and PPPoE, more questions than answers...
@ 2014-10-11 23:12 Sebastian Moeller
  2014-10-15  0:03 ` Sebastian Moeller
  0 siblings, 1 reply; 16+ messages in thread
From: Sebastian Moeller @ 2014-10-11 23:12 UTC (permalink / raw)
  To: cerowrt-devel

Hi,

just to document my current understanding of using SQM on a router that also terminates a pppoe wan connection. We basically have two options either set up SQM on the real interface (let’s call it ge00 like cerowrt does) or on the associated pop device, pppoe-ge00. In theory both should produce the same results; in praxis current SQM has significant different results. Let me enumerate the main differences that show up when testing with netperf-wrapper’s RRUL test:

1) SQM on ge00 does not show a working egress classification in the RRUL test (no visible “banding”/stratification of the 4 different priority TCP flows), while SQM on pppoe-ge00 does show this stratification.

	Now the reason for this is quite obvious once we take into account that on ge00 the kernel sees a packet that already contains a PPP header between ethernet and IP header and has a different ether_type field, and our diffserv filters currently ignore everything except straight ipv4 and ipv6 packets, so due to the unexpected/un-handled PPP header everything lands in the default priority class and hence no stratification. If we shape on pppoe-ge00 the kernel seems to do all processing before encapsulating the data with PP so all filters just work. In theory that should be relatively easy to fix (at least for the specific PPPoE case, I am unsure about a generic solution) by using offsets to try to access the TOS bits in PPP-packets. Also most likely we face the same issue in other encapsulations that pass through cerowrt to some degree (except most of those will use an outer IP header from where we can scratch DSCPs…, but I digress)

2) SQM on ge00 shows better latency under load (LUL), the LUL increases for ~2*fq_codels target so 10ms, while SQM on pppeo-ge00 shows a LUL-increase (LULI) roughly twice as large or around 20ms.

	I have no idea why that is, if anybody has an idea please chime in.

3) SQM on pppoe-ge00 has a rough 20% higher egress rate than SQM on ge00 (with ingress more or less identical between the two). Also 2) and 3) do not seem to be coupled, artificially reducing the egress rate on pppoe-ge00 to yield the same egress rate as seen on ge00 does not reduce the LULI to the ge00 typical 10ms, but it stays at 20ms.

	For this I also have no good hypothesis, any ideas?

So the current choice is either to accept a noticeable increase in LULI (but note some years ago even an average of 20ms most likely was rare in the real life) or a equally noticeable decrease in egress bandwidth… 

Best Regards
	Sebastian

P.S.: It turns out, at least on my link, that for shaping on pppoe-ge00 the kernel does not account for any header automatically, so I need to specify a per-packet-overhead (PPOH) of 40 bytes (an an ADSL2+ link with ATM linklayer); when shaping on ge00 however (with the kernel still terminating the PPPoE link to my ISP) I only need to specify an PPOH of 26 as the kernel already adds the 14 bytes for the ethernet header…

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Cerowrt-devel] SQM and PPPoE, more questions than answers...
  2014-10-11 23:12 [Cerowrt-devel] SQM and PPPoE, more questions than answers Sebastian Moeller
@ 2014-10-15  0:03 ` Sebastian Moeller
  2014-10-15 12:02   ` Török Edwin
  2015-03-18 22:14   ` Alan Jenkins
  0 siblings, 2 replies; 16+ messages in thread
From: Sebastian Moeller @ 2014-10-15  0:03 UTC (permalink / raw)
  To: cerowrt-devel

Hi All,

some more testing:
On Oct 12, 2014, at 01:12 , Sebastian Moeller <moeller0@gmx.de> wrote:

> Hi,
> 
> just to document my current understanding of using SQM on a router that also terminates a pppoe wan connection. We basically have two options either set up SQM on the real interface (let’s call it ge00 like cerowrt does) or on the associated pop device, pppoe-ge00. In theory both should produce the same results; in praxis current SQM has significant different results. Let me enumerate the main differences that show up when testing with netperf-wrapper’s RRUL test:
> 
> 1) SQM on ge00 does not show a working egress classification in the RRUL test (no visible “banding”/stratification of the 4 different priority TCP flows), while SQM on pppoe-ge00 does show this stratification.
> 
> 	Now the reason for this is quite obvious once we take into account that on ge00 the kernel sees a packet that already contains a PPP header between ethernet and IP header and has a different ether_type field, and our diffserv filters currently ignore everything except straight ipv4 and ipv6 packets, so due to the unexpected/un-handled PPP header everything lands in the default priority class and hence no stratification. If we shape on pppoe-ge00 the kernel seems to do all processing before encapsulating the data with PP so all filters just work. In theory that should be relatively easy to fix (at least for the specific PPPoE case, I am unsure about a generic solution) by using offsets to try to access the TOS bits in PPP-packets. Also most likely we face the same issue in other encapsulations that pass through cerowrt to some degree (except most of those will use an outer IP header from where we can scratch DSCPs…, but I digress)

	Usind tc filters u32 filter makes it possible to actually dive into PPPoE encapsulated ipv4 and ipv6 packets and perform classification on “pass-through” PPPoE packets (as encountered when starting SQM on ge00 instead of pppoe-ge00, if the latter actually handles the wan connection), so that one is solved (but see below).

> 
> 2) SQM on ge00 shows better latency under load (LUL), the LUL increases for ~2*fq_codels target so 10ms, while SQM on pppeo-ge00 shows a LUL-increase (LULI) roughly twice as large or around 20ms.
> 
> 	I have no idea why that is, if anybody has an idea please chime in.

	Once SQM on ge00 actually dives into the PPPoE packets and applies/tests u32 filters the LUL increases to be almost identical to pppoe-ge00’s if both ingress and egress classification are active and do work. So it looks like the u32 filters I naively set up are quite costly. Maybe there is a better way to set these up...

> 
> 3) SQM on pppoe-ge00 has a rough 20% higher egress rate than SQM on ge00 (with ingress more or less identical between the two). Also 2) and 3) do not seem to be coupled, artificially reducing the egress rate on pppoe-ge00 to yield the same egress rate as seen on ge00 does not reduce the LULI to the ge00 typical 10ms, but it stays at 20ms.
> 
> 	For this I also have no good hypothesis, any ideas?

	With classification fixed the difference in egress rate shrinks to ~10% instead of 20, so this partly seems related to the classification issue as well.

> 
> 
> So the current choice is either to accept a noticeable increase in LULI (but note some years ago even an average of 20ms most likely was rare in the real life) or a equally noticeable decrease in egress bandwidth… 

	I guess it is back to the drawing board to figure out how to speed up the classification… and then revisit the PPPoE question again…

Regards
	Sebastian

> 
> Best Regards
> 	Sebastian
> 
> P.S.: It turns out, at least on my link, that for shaping on pppoe-ge00 the kernel does not account for any header automatically, so I need to specify a per-packet-overhead (PPOH) of 40 bytes (an an ADSL2+ link with ATM linklayer); when shaping on ge00 however (with the kernel still terminating the PPPoE link to my ISP) I only need to specify an PPOH of 26 as the kernel already adds the 14 bytes for the ethernet header…
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Cerowrt-devel] SQM and PPPoE, more questions than answers...
  2014-10-15  0:03 ` Sebastian Moeller
@ 2014-10-15 12:02   ` Török Edwin
  2014-10-15 13:39     ` Sebastian Moeller
  2015-03-18 22:14   ` Alan Jenkins
  1 sibling, 1 reply; 16+ messages in thread
From: Török Edwin @ 2014-10-15 12:02 UTC (permalink / raw)
  To: cerowrt-devel

On 10/15/2014 03:03 AM, Sebastian Moeller wrote:
> 	I guess it is back to the drawing board to figure out how to speed up the classification… and then revisit the PPPoE question again…

FWIW I had to add this to /etc/config/network (done via luci actually):
option keepalive '500 30'

Otherwise it uses these default values from /etc/ppp/options, and then I hit: https://dev.openwrt.org/ticket/7793:
lcp-echo-failure 5
lcp-echo-interval 1

The symptomps are that if I start a large download after half a minute or so pppd complains that it didn't receive reply to 5 LCP echo packets and disconnects/reconnects.
Sounds like the LCP echo/reply packets should get prioritized, but I don't know if it is my router that is dropping them or my ISP.

When you tested PPPoE did you notice pppd dropping the connection and restarting, cause that would affect the timings for sure...

Best regards,
--Edwin

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Cerowrt-devel] SQM and PPPoE, more questions than answers...
  2014-10-15 12:02   ` Török Edwin
@ 2014-10-15 13:39     ` Sebastian Moeller
  2014-10-15 17:28       ` Dave Taht
  0 siblings, 1 reply; 16+ messages in thread
From: Sebastian Moeller @ 2014-10-15 13:39 UTC (permalink / raw)
  To: Török Edwin; +Cc: cerowrt-devel

Hi Edwin,


On Oct 15, 2014, at 14:02 , Török Edwin <edwin+ml-cerowrt@etorok.net> wrote:

> On 10/15/2014 03:03 AM, Sebastian Moeller wrote:
>> 	I guess it is back to the drawing board to figure out how to speed up the classification… and then revisit the PPPoE question again…
> 
> FWIW I had to add this to /etc/config/network (done via luci actually):
> option keepalive '500 30'
> 
> Otherwise it uses these default values from /etc/ppp/options, and then I hit: https://dev.openwrt.org/ticket/7793:
> lcp-echo-failure 5
> lcp-echo-interval 1
> 
> The symptomps are that if I start a large download after half a minute or so pppd complains that it didn't receive reply to 5 LCP echo packets and disconnects/reconnects.

	I have not yet seen these in the logs, but I will keep my eyes open.

> Sounds like the LCP echo/reply packets should get prioritized, but I don't know if it is my router that is dropping them or my ISP.

	I think that is something we should be able to teach SQM (as long as the shaper is running on the lower ethernet interface and not the pppoe interface). 

> 
> When you tested PPPoE did you notice pppd dropping the connection and restarting, cause that would affect the timings for sure…

	Nope, what I see is simply more variance in bandwidth and latency numbers and a less step slope on a right shifted ICMP CDF… I assume that the disconnect reconnects should show up as periods without any data transfer…. 

Mmmh, I will try to put the PPP service packets into the highest priority class and see whether that changes things, as well as testing your PPP options.

Thanks for your help

	Sebastian

> 
> Best regards,
> --Edwin
> 
> 
> _______________________________________________
> Cerowrt-devel mailing list
> Cerowrt-devel@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/cerowrt-devel


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Cerowrt-devel] SQM and PPPoE, more questions than answers...
  2014-10-15 13:39     ` Sebastian Moeller
@ 2014-10-15 17:28       ` Dave Taht
  2014-10-15 19:55         ` Sebastian Moeller
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Taht @ 2014-10-15 17:28 UTC (permalink / raw)
  To: Sebastian Moeller; +Cc: cerowrt-devel

hmm. The pppoe LLC packets are sparse and should already be optimized
by fq_codel, but I guess I'll go look at the construction of those
headers. Perhaps they need to be decoded better in the flow_dissector
code?

I also made some comments re the recent openwrt pull request.

https://github.com/dtaht/ceropackages-3.10/commit/b9e3bafdabb3c5aa47f8f63eae2ecfe34c361855

SQM need not require the advanced qdiscs package, if it checks for
availability of the other qdiscs, and even then nobody's proposed
putting the new nfq_codel stuff into openwrt - as it's still rather
inadaquately tested, and it's my hope that cake simplifies matters
significantly when it's baked. I already have patches for sqm for it,
but it's just not baked enough...

Also I think exploring policing at higher ingres bandwidths is warrented...

On Wed, Oct 15, 2014 at 6:39 AM, Sebastian Moeller <moeller0@gmx.de> wrote:
> Hi Edwin,
>
>
> On Oct 15, 2014, at 14:02 , Török Edwin <edwin+ml-cerowrt@etorok.net> wrote:
>
>> On 10/15/2014 03:03 AM, Sebastian Moeller wrote:
>>>      I guess it is back to the drawing board to figure out how to speed up the classification… and then revisit the PPPoE question again…
>>
>> FWIW I had to add this to /etc/config/network (done via luci actually):
>> option keepalive '500 30'
>>
>> Otherwise it uses these default values from /etc/ppp/options, and then I hit: https://dev.openwrt.org/ticket/7793:
>> lcp-echo-failure 5
>> lcp-echo-interval 1
>>
>> The symptomps are that if I start a large download after half a minute or so pppd complains that it didn't receive reply to 5 LCP echo packets and disconnects/reconnects.
>
>         I have not yet seen these in the logs, but I will keep my eyes open.
>
>> Sounds like the LCP echo/reply packets should get prioritized, but I don't know if it is my router that is dropping them or my ISP.
>
>         I think that is something we should be able to teach SQM (as long as the shaper is running on the lower ethernet interface and not the pppoe interface).
>
>>
>> When you tested PPPoE did you notice pppd dropping the connection and restarting, cause that would affect the timings for sure…
>
>         Nope, what I see is simply more variance in bandwidth and latency numbers and a less step slope on a right shifted ICMP CDF… I assume that the disconnect reconnects should show up as periods without any data transfer….
>
> Mmmh, I will try to put the PPP service packets into the highest priority class and see whether that changes things, as well as testing your PPP options.
>
> Thanks for your help
>
>         Sebastian
>
>>
>> Best regards,
>> --Edwin
>>
>>
>> _______________________________________________
>> Cerowrt-devel mailing list
>> Cerowrt-devel@lists.bufferbloat.net
>> https://lists.bufferbloat.net/listinfo/cerowrt-devel
>
> _______________________________________________
> Cerowrt-devel mailing list
> Cerowrt-devel@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/cerowrt-devel



-- 
Dave Täht

thttp://www.bufferbloat.net/projects/bloat/wiki/Upcoming_Talks

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Cerowrt-devel] SQM and PPPoE, more questions than answers...
  2014-10-15 17:28       ` Dave Taht
@ 2014-10-15 19:55         ` Sebastian Moeller
  0 siblings, 0 replies; 16+ messages in thread
From: Sebastian Moeller @ 2014-10-15 19:55 UTC (permalink / raw)
  To: Dave Täht; +Cc: cerowrt-devel

Hi Dave,

On Oct 15, 2014, at 19:28 , Dave Taht <dave.taht@gmail.com> wrote:

> hmm. The pppoe LLC packets are sparse and should already be optimized
> by fq_codel, but I guess I'll go look at the construction of those
> headers. Perhaps they need to be decoded better in the flow_dissector
> code?

	So when shaping on pppoe-ge00 one does not see the LLC packets at all (tested with tcpdump -i pppoe-ge00), since they are added after the shaping. (tcpdump -i ge00 does see th dllc packets) I have no idea whether pppd issues these with higher priority or not.


> 
> I also made some comments re the recent openwrt pull request.
> 
> https://github.com/dtaht/ceropackages-3.10/commit/b9e3bafdabb3c5aa47f8f63eae2ecfe34c361855
> 
> SQM need not require the advanced qdiscs package, if it checks for
> availability of the other qdiscs,

	Well, but how to do this, I know of no safe way, except testing availability of modules for a known set of qdiscs, but what if the qdiscs are built into a monolithic kernel? Does anyone here have a good idea of how to detect all qdiscs available to the running kernel?

Best Regards
	Sebastian

> and even then nobody's proposed
> putting the new nfq_codel stuff into openwrt - as it's still rather
> inadaquately tested, and it's my hope that cake simplifies matters
> significantly when it's baked. I already have patches for sqm for it,
> but it's just not baked enough...
> 
> Also I think exploring policing at higher ingres bandwidths is warrented…
> 
> On Wed, Oct 15, 2014 at 6:39 AM, Sebastian Moeller <moeller0@gmx.de> wrote:
>> Hi Edwin,
>> 
>> 
>> On Oct 15, 2014, at 14:02 , Török Edwin <edwin+ml-cerowrt@etorok.net> wrote:
>> 
>>> On 10/15/2014 03:03 AM, Sebastian Moeller wrote:
>>>>     I guess it is back to the drawing board to figure out how to speed up the classification… and then revisit the PPPoE question again…
>>> 
>>> FWIW I had to add this to /etc/config/network (done via luci actually):
>>> option keepalive '500 30'
>>> 
>>> Otherwise it uses these default values from /etc/ppp/options, and then I hit: https://dev.openwrt.org/ticket/7793:
>>> lcp-echo-failure 5
>>> lcp-echo-interval 1
>>> 
>>> The symptomps are that if I start a large download after half a minute or so pppd complains that it didn't receive reply to 5 LCP echo packets and disconnects/reconnects.
>> 
>>        I have not yet seen these in the logs, but I will keep my eyes open.
>> 
>>> Sounds like the LCP echo/reply packets should get prioritized, but I don't know if it is my router that is dropping them or my ISP.
>> 
>>        I think that is something we should be able to teach SQM (as long as the shaper is running on the lower ethernet interface and not the pppoe interface).
>> 
>>> 
>>> When you tested PPPoE did you notice pppd dropping the connection and restarting, cause that would affect the timings for sure…
>> 
>>        Nope, what I see is simply more variance in bandwidth and latency numbers and a less step slope on a right shifted ICMP CDF… I assume that the disconnect reconnects should show up as periods without any data transfer….
>> 
>> Mmmh, I will try to put the PPP service packets into the highest priority class and see whether that changes things, as well as testing your PPP options.
>> 
>> Thanks for your help
>> 
>>        Sebastian
>> 
>>> 
>>> Best regards,
>>> --Edwin
>>> 
>>> 
>>> _______________________________________________
>>> Cerowrt-devel mailing list
>>> Cerowrt-devel@lists.bufferbloat.net
>>> https://lists.bufferbloat.net/listinfo/cerowrt-devel
>> 
>> _______________________________________________
>> Cerowrt-devel mailing list
>> Cerowrt-devel@lists.bufferbloat.net
>> https://lists.bufferbloat.net/listinfo/cerowrt-devel
> 
> 
> 
> -- 
> Dave Täht
> 
> thttp://www.bufferbloat.net/projects/bloat/wiki/Upcoming_Talks


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Cerowrt-devel] SQM and PPPoE, more questions than answers...
  2014-10-15  0:03 ` Sebastian Moeller
  2014-10-15 12:02   ` Török Edwin
@ 2015-03-18 22:14   ` Alan Jenkins
  2015-03-19  2:43     ` David Lang
  2015-03-19  8:29     ` Sebastian Moeller
  1 sibling, 2 replies; 16+ messages in thread
From: Alan Jenkins @ 2015-03-18 22:14 UTC (permalink / raw)
  To: Sebastian Moeller, cerowrt-devel

Hi Seb

I tested shaping on eth1 vs pppoe-wan, as it applies to ADSL.  (On 
Barrier Breaker + sqm-scripts).  Maybe this is going back a bit & no 
longer interesting to read.  But it seemed suspicious & interesting 
enough that I wanted to test it.

My conclusion was 1) I should stick with pppoe-wan, 2) the question 
really means do you want to disable classification 3) I personally want 
to preserve the upload bandwidth and accept slightly higher latency.


On 15/10/14 01:03, Sebastian Moeller wrote:
> Hi All,
>
> some more testing: On Oct 12, 2014, at 01:12 , Sebastian Moeller
> <moeller0@gmx.de> wrote:

>> 1) SQM on ge00 does not show a working egress classification in the
>> RRUL test (no visible “banding”/stratification of the 4 different
>> priority TCP flows), while SQM on pppoe-ge00 does show this
>> stratification.

> Usind tc filters u32 filter makes it possible to actually dive into
> PPPoE encapsulated ipv4 and ipv6 packets and perform classification
> on “pass-through” PPPoE packets (as encountered when starting SQM on
> ge00 instead of pppoe-ge00, if the latter actually handles the wan
> connection), so that one is solved (but see below).
>
>>
>> 2) SQM on ge00 shows better latency under load (LUL), the LUL
>> increases for ~2*fq_codels target so 10ms, while SQM on pppeo-ge00
>> shows a LUL-increase (LULI) roughly twice as large or around 20ms.
>>
>> I have no idea why that is, if anybody has an idea please chime
>> in.

I saw the same, though with higher difference for egress rate.  See 
first three files here:

https://www.dropbox.com/sh/shwz0l7j4syp2ea/AAAxrhDkJ3TTy_Mq5KiFF3u2a?dl=0

[netperf-wrapper noob puzzle: most of the ping lines vanish part-way 
through.  Maybe I failed it somehow.]

> Once SQM on ge00 actually dives into the PPPoE packets and
> applies/tests u32 filters the LUL increases to be almost identical to
> pppoe-ge00’s if both ingress and egress classification are active and
> do work. So it looks like the u32 filters I naively set up are quite
> costly. Maybe there is a better way to set these up...

Later you mentioned testing for coupling with egress rate.  But you 
didn't test coupling with classification!

I switched from simple.qos to simplest.qos, and that achieved the lower 
latency on pppoe-wan.  So I think your naive u32 filter setup wasn't the 
real problem.

I did think ECN wouldn't be applied on eth1, and that would be the cause 
of the latency.  But disabling ECN didn't affect it.  See files 3 to 6:

https://www.dropbox.com/sh/shwz0l7j4syp2ea/AAAxrhDkJ3TTy_Mq5KiFF3u2a?dl=0

I also admit surprise at fq_codel working within 20%/10ms on eth1.  I 
thought it'd really hurt, by breaking the FQ part.  Now I guess it 
doesn't.  I still wonder about ECN marking, though I didn't check my 
endpoint is using ECN.

>>
>> 3) SQM on pppoe-ge00 has a rough 20% higher egress rate than SQM on
>> ge00 (with ingress more or less identical between the two). Also 2)
>> and 3) do not seem to be coupled, artificially reducing the egress
>> rate on pppoe-ge00 to yield the same egress rate as seen on ge00
>> does not reduce the LULI to the ge00 typical 10ms, but it stays at
>> 20ms.
>>
>> For this I also have no good hypothesis, any ideas?
>
> With classification fixed the difference in egress rate shrinks to
> ~10% instead of 20, so this partly seems related to the
> classification issue as well.

My tests look like simplest.qos gives a lower egress rate, but not as 
low as eth1.  (Like 20% vs 40%).  So that's also similar.

>> So the current choice is either to accept a noticeable increase in
>> LULI (but note some years ago even an average of 20ms most likely
>> was rare in the real life) or a equally noticeable decrease in
>> egress bandwidth…
>
> I guess it is back to the drawing board to figure out how to speed up
> the classification… and then revisit the PPPoE question again…

so maybe the question is actually classification v.s. not?

  + IMO slow asymmetric links don't want to lose more upload bandwidth 
than necessary.  And I'm losing a *lot* in this test.
  + As you say, having only 20ms excess would still be a big 
improvement.  We could ignore the bait of 10ms right now.

vs

  - lowest latency I've seen testing my link. almost suspicious. looks 
close to 10ms average, when the dsl rate puts a lower bound of 7ms on 
the average.
  - fq_codel honestly works miracles already. classification is the knob 
people had to use previously, who had enough time to twiddle it.
  - on netperf-runner plots the "banding" doesn't look brilliant on slow 
links anyway


> Regards Sebastian
>
>>
>> Best Regards Sebastian
>>
>> P.S.: It turns out, at least on my link, that for shaping on
>> pppoe-ge00 the kernel does not account for any header
>> automatically, so I need to specify a per-packet-overhead (PPOH) of
>> 40 bytes (an an ADSL2+ link with ATM linklayer); when shaping on
>> ge00 however (with the kernel still terminating the PPPoE link to
>> my ISP) I only need to specify an PPOH of 26 as the kernel already
>> adds the 14 bytes for the ethernet header…

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Cerowrt-devel] SQM and PPPoE, more questions than answers...
  2015-03-18 22:14   ` Alan Jenkins
@ 2015-03-19  2:43     ` David Lang
  2015-03-19  3:11       ` Dave Taht
  2015-03-19  8:37       ` Sebastian Moeller
  2015-03-19  8:29     ` Sebastian Moeller
  1 sibling, 2 replies; 16+ messages in thread
From: David Lang @ 2015-03-19  2:43 UTC (permalink / raw)
  To: Alan Jenkins; +Cc: cerowrt-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3735 bytes --]

On Wed, 18 Mar 2015, Alan Jenkins wrote:

>> Once SQM on ge00 actually dives into the PPPoE packets and
>> applies/tests u32 filters the LUL increases to be almost identical to
>> pppoe-ge00’s if both ingress and egress classification are active and
>> do work. So it looks like the u32 filters I naively set up are quite
>> costly. Maybe there is a better way to set these up...
>
> Later you mentioned testing for coupling with egress rate.  But you didn't 
> test coupling with classification!
>
> I switched from simple.qos to simplest.qos, and that achieved the lower 
> latency on pppoe-wan.  So I think your naive u32 filter setup wasn't the real 
> problem.
>
> I did think ECN wouldn't be applied on eth1, and that would be the cause of 
> the latency.  But disabling ECN didn't affect it.  See files 3 to 6:
>
> https://www.dropbox.com/sh/shwz0l7j4syp2ea/AAAxrhDkJ3TTy_Mq5KiFF3u2a?dl=0
>
> I also admit surprise at fq_codel working within 20%/10ms on eth1.  I thought 
> it'd really hurt, by breaking the FQ part.  Now I guess it doesn't.  I still 
> wonder about ECN marking, though I didn't check my endpoint is using ECN.

ECN should never increase latency, if it has any effect it should improve 
latency because you slow down sending packets when some hop along the path is 
overloaded rather than sending the packets anyway and having them sit in a 
buffer for a while. This doesn't decrease actual throughput either (although if 
you are doing a test that doesn't actually wait for all the packets to arrive at 
the far end, it will look like it decreases throughput)

>>> 
>>> 3) SQM on pppoe-ge00 has a rough 20% higher egress rate than SQM on
>>> ge00 (with ingress more or less identical between the two). Also 2)
>>> and 3) do not seem to be coupled, artificially reducing the egress
>>> rate on pppoe-ge00 to yield the same egress rate as seen on ge00
>>> does not reduce the LULI to the ge00 typical 10ms, but it stays at
>>> 20ms.
>>> 
>>> For this I also have no good hypothesis, any ideas?
>> 
>> With classification fixed the difference in egress rate shrinks to
>> ~10% instead of 20, so this partly seems related to the
>> classification issue as well.
>
> My tests look like simplest.qos gives a lower egress rate, but not as low as 
> eth1.  (Like 20% vs 40%).  So that's also similar.
>
>>> So the current choice is either to accept a noticeable increase in
>>> LULI (but note some years ago even an average of 20ms most likely
>>> was rare in the real life) or a equally noticeable decrease in
>>> egress bandwidth…
>> 
>> I guess it is back to the drawing board to figure out how to speed up
>> the classification… and then revisit the PPPoE question again…
>
> so maybe the question is actually classification v.s. not?
>
> + IMO slow asymmetric links don't want to lose more upload bandwidth than 
> necessary.  And I'm losing a *lot* in this test.
> + As you say, having only 20ms excess would still be a big improvement.  We 
> could ignore the bait of 10ms right now.
>
> vs
>
> - lowest latency I've seen testing my link. almost suspicious. looks close 
> to 10ms average, when the dsl rate puts a lower bound of 7ms on the average.
> - fq_codel honestly works miracles already. classification is the knob 
> people had to use previously, who had enough time to twiddle it.

That's what most people find when they try it. Classification doesn't result in 
throughput vs latency tradeoffs as much as it gives absolute priority to some 
types of traffic. But unless you are really up against your bandwidth limit, 
this seldom matters in the real world. As long as latency is kept low, 
everything works so you don't need to give VoIP priority over other traffic or 
things like that.

David Lang

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Cerowrt-devel] SQM and PPPoE, more questions than answers...
  2015-03-19  2:43     ` David Lang
@ 2015-03-19  3:11       ` Dave Taht
  2015-03-19  8:37       ` Sebastian Moeller
  1 sibling, 0 replies; 16+ messages in thread
From: Dave Taht @ 2015-03-19  3:11 UTC (permalink / raw)
  To: David Lang; +Cc: Alan Jenkins, cerowrt-devel

On Wed, Mar 18, 2015 at 7:43 PM, David Lang <david@lang.hm> wrote:
> On Wed, 18 Mar 2015, Alan Jenkins wrote:
>
>>> Once SQM on ge00 actually dives into the PPPoE packets and
>>> applies/tests u32 filters the LUL increases to be almost identical to
>>> pppoe-ge00’s if both ingress and egress classification are active and
>>> do work. So it looks like the u32 filters I naively set up are quite
>>> costly. Maybe there is a better way to set these up...
>>
>>
>> Later you mentioned testing for coupling with egress rate.  But you didn't
>> test coupling with classification!
>>
>> I switched from simple.qos to simplest.qos, and that achieved the lower
>> latency on pppoe-wan.  So I think your naive u32 filter setup wasn't the
>> real problem.
>>
>> I did think ECN wouldn't be applied on eth1, and that would be the cause
>> of the latency.  But disabling ECN didn't affect it.  See files 3 to 6:
>>
>> https://www.dropbox.com/sh/shwz0l7j4syp2ea/AAAxrhDkJ3TTy_Mq5KiFF3u2a?dl=0
>>
>> I also admit surprise at fq_codel working within 20%/10ms on eth1.  I
>> thought it'd really hurt, by breaking the FQ part.  Now I guess it doesn't.
>> I still wonder about ECN marking, though I didn't check my endpoint is using
>> ECN.
>
>
> ECN should never increase latency, if it has any effect it should improve
> latency because you slow down sending packets when some hop along the path
> is overloaded rather than sending the packets anyway and having them sit in
> a buffer for a while. This doesn't decrease actual throughput either
> (although if you are doing a test that doesn't actually wait for all the
> packets to arrive at the far end, it will look like it decreases throughput)

ECN does, provably, increase latency (and loss) for other non-ecn marked flows.

Not by a lot, but it does. In the case of a malignantly mis-marked
flow, the present
codel aqm algorithm does pretty bad things to itself and to other
non-ecn marked packets.

(have fixes for codel, but fq_codel doesn't have this problem, pie
somewhat has it)


>>>>
>>>> 3) SQM on pppoe-ge00 has a rough 20% higher egress rate than SQM on
>>>> ge00 (with ingress more or less identical between the two). Also 2)
>>>> and 3) do not seem to be coupled, artificially reducing the egress
>>>> rate on pppoe-ge00 to yield the same egress rate as seen on ge00
>>>> does not reduce the LULI to the ge00 typical 10ms, but it stays at
>>>> 20ms.
>>>>
>>>> For this I also have no good hypothesis, any ideas?
>>>
>>>
>>> With classification fixed the difference in egress rate shrinks to
>>> ~10% instead of 20, so this partly seems related to the
>>> classification issue as well.

One of the things we really have to get around to doing is more high
rate testing,
and actually measuring how much latency the tcp flows are experiencing.

>>
>> My tests look like simplest.qos gives a lower egress rate, but not as low
>> as eth1.  (Like 20% vs 40%).  So that's also similar.
>>
>>>> So the current choice is either to accept a noticeable increase in
>>>> LULI (but note some years ago even an average of 20ms most likely
>>>> was rare in the real life) or a equally noticeable decrease in
>>>> egress bandwidth…
>>>
>>>
>>> I guess it is back to the drawing board to figure out how to speed up
>>> the classification… and then revisit the PPPoE question again…
>>
>>
>> so maybe the question is actually classification v.s. not?
>>
>> + IMO slow asymmetric links don't want to lose more upload bandwidth than
>> necessary.  And I'm losing a *lot* in this test.
>> + As you say, having only 20ms excess would still be a big improvement.
>> We could ignore the bait of 10ms right now.
>>
>> vs
>>
>> - lowest latency I've seen testing my link. almost suspicious. looks close
>> to 10ms average, when the dsl rate puts a lower bound of 7ms on the average.
>> - fq_codel honestly works miracles already. classification is the knob
>> people had to use previously, who had enough time to twiddle it.
>
>
> That's what most people find when they try it. Classification doesn't result
> in throughput vs latency tradeoffs as much as it gives absolute priority to
> some types of traffic. But unless you are really up against your bandwidth
> limit, this seldom matters in the real world. As long as latency is kept
> low, everything works so you don't need to give VoIP priority over other
> traffic or things like that.

+10.

>
> David Lang
> _______________________________________________
> Cerowrt-devel mailing list
> Cerowrt-devel@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/cerowrt-devel
>



-- 
Dave Täht
Let's make wifi fast, less jittery and reliable again!

https://plus.google.com/u/0/107942175615993706558/posts/TVX3o84jjmb

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Cerowrt-devel] SQM and PPPoE, more questions than answers...
  2015-03-18 22:14   ` Alan Jenkins
  2015-03-19  2:43     ` David Lang
@ 2015-03-19  8:29     ` Sebastian Moeller
  2015-03-19  9:42       ` Alan Jenkins
  2015-03-19 13:49       ` Alan Jenkins
  1 sibling, 2 replies; 16+ messages in thread
From: Sebastian Moeller @ 2015-03-19  8:29 UTC (permalink / raw)
  To: Alan Jenkins; +Cc: cerowrt-devel

Hi Alan,


On Mar 18, 2015, at 23:14 , Alan Jenkins <alan.christopher.jenkins@gmail.com> wrote:

> Hi Seb
> 
> I tested shaping on eth1 vs pppoe-wan, as it applies to ADSL.  (On Barrier Breaker + sqm-scripts).  Maybe this is going back a bit & no longer interesting to read.  But it seemed suspicious & interesting enough that I wanted to test it.
> 
> My conclusion was 1) I should stick with pppoe-wan,

	Not a bad decision, especially given the recent changes to SQM to make it survive transient pppoe-interface disappearances. Before those changes the beauty of shaping on the ethernet device was that pppoe could come and go, but SQM stayed active and working. But due to your help this problem seems fixed now.

> 2) the question really means do you want to disable classification
> 3) I personally want to preserve the upload bandwidth and accept slightly higher latency.

	My question still is, is the bandwidth sacrifice really necessary or is this test just showing a corner case in simple.qos that can be fixed. I currently lack enough time to tackle this effectively.

> 
> 
> On 15/10/14 01:03, Sebastian Moeller wrote:
>> Hi All,
>> 
>> some more testing: On Oct 12, 2014, at 01:12 , Sebastian Moeller
>> <moeller0@gmx.de> wrote:
> 
>>> 1) SQM on ge00 does not show a working egress classification in the
>>> RRUL test (no visible “banding”/stratification of the 4 different
>>> priority TCP flows), while SQM on pppoe-ge00 does show this
>>> stratification.
> 
>> Usind tc filters u32 filter makes it possible to actually dive into
>> PPPoE encapsulated ipv4 and ipv6 packets and perform classification
>> on “pass-through” PPPoE packets (as encountered when starting SQM on
>> ge00 instead of pppoe-ge00, if the latter actually handles the wan
>> connection), so that one is solved (but see below).
>> 
>>> 
>>> 2) SQM on ge00 shows better latency under load (LUL), the LUL
>>> increases for ~2*fq_codels target so 10ms, while SQM on pppeo-ge00
>>> shows a LUL-increase (LULI) roughly twice as large or around 20ms.
>>> 
>>> I have no idea why that is, if anybody has an idea please chime
>>> in.
> 
> I saw the same, though with higher difference for egress rate.  See first three files here:
> 
> https://www.dropbox.com/sh/shwz0l7j4syp2ea/AAAxrhDkJ3TTy_Mq5KiFF3u2a?dl=0
> 
> [netperf-wrapper noob puzzle: most of the ping lines vanish part-way through.  Maybe I failed it somehow.]

	This is not your fault, the UDP probes net-perf wrapper uses do not accept packet loss, once a packet (I believe) is lost the stream stops. This is not ideal, but it gives a good quick indicator of packet loss for sparse streams ;)

> 
>> Once SQM on ge00 actually dives into the PPPoE packets and
>> applies/tests u32 filters the LUL increases to be almost identical to
>> pppoe-ge00’s if both ingress and egress classification are active and
>> do work. So it looks like the u32 filters I naively set up are quite
>> costly. Maybe there is a better way to set these up...
> 
> Later you mentioned testing for coupling with egress rate.  But you didn't test coupling with classification!

	True, I was interesting in getting the 3-tier shaper to behave sanely, so I did not look at the 1-tier simplest.qos.

> 
> I switched from simple.qos to simplest.qos, and that achieved the lower latency on pppoe-wan.  So I think your naive u32 filter setup wasn't the real problem.

	Erm, but simplest.qos is not using the relevant tc filters, so the these could still account for the issue; that or some loss due to the 3 htb shapers...
> 
> I did think ECN wouldn't be applied on eth1, and that would be the cause of the latency.  But disabling ECN didn't affect it.  See files 3 to 6:
> 
> https://www.dropbox.com/sh/shwz0l7j4syp2ea/AAAxrhDkJ3TTy_Mq5KiFF3u2a?dl=0

	We typically only enable ECN on the downlink so far (under the assumption that this is a faster congestion signal to the receiver than dropping the packet and then having to wait for the next packet to create dupACKs; typically the router is close to the end-hosts and the packets already cleared the real bottleneck, so dropping them is not going to help the effective bandwidth use); on the uplink the reasoning reverses, here dropping instead of marking saves bandwidth for other packets (also often uplink bandwidth is more precious) and the packets basically just started their journey so the control loop still can take a long time to complete and other hops can drop the packet. (I guess my current link is fast enough to activate ECN on uplink as well to see how that behaves, so I will try that for a bit...)

> 
> I also admit surprise at fq_codel working within 20%/10ms on eth1.  I thought it'd really hurt, by breaking the FQ part.  Now I guess it doesn't.  I still wonder about ECN marking, though I didn't check my endpoint is using ECN.
> 
>>> 
>>> 3) SQM on pppoe-ge00 has a rough 20% higher egress rate than SQM on
>>> ge00 (with ingress more or less identical between the two). Also 2)
>>> and 3) do not seem to be coupled, artificially reducing the egress
>>> rate on pppoe-ge00 to yield the same egress rate as seen on ge00
>>> does not reduce the LULI to the ge00 typical 10ms, but it stays at
>>> 20ms.
>>> 
>>> For this I also have no good hypothesis, any ideas?
>> 
>> With classification fixed the difference in egress rate shrinks to
>> ~10% instead of 20, so this partly seems related to the
>> classification issue as well.
> 
> My tests look like simplest.qos gives a lower egress rate, but not as low as eth1.  (Like 20% vs 40%).  So that's also similar.
> 
>>> So the current choice is either to accept a noticeable increase in
>>> LULI (but note some years ago even an average of 20ms most likely
>>> was rare in the real life) or a equally noticeable decrease in
>>> egress bandwidth…
>> 
>> I guess it is back to the drawing board to figure out how to speed up
>> the classification… and then revisit the PPPoE question again…
> 
> so maybe the question is actually classification v.s. not?
> 
> + IMO slow asymmetric links don't want to lose more upload bandwidth than necessary.  And I'm losing a *lot* in this test.
> + As you say, having only 20ms excess would still be a big improvement.  We could ignore the bait of 10ms right now.
> 
> vs
> 
> - lowest latency I've seen testing my link. almost suspicious. looks close to 10ms average, when the dsl rate puts a lower bound of 7ms on the average.

	Curious: what is your link speed?

> - fq_codel honestly works miracles already. classification is the knob people had to use previously, who had enough time to twiddle it.
> - on netperf-runner plots the "banding" doesn't look brilliant on slow links anyway

	On slow links I always used to add “-s 0.8” with higher numbers the slower the link to increase the temporal averaging window, this reduces accuracy of the display for the downlink, but at least allows better understanding of the uplink. I always wanted to see whether I could treach netperf-wrapper to allow larger averaging windows after measurements, just for display purposes, but I am a total beginner with python...

> 
> 
>> Regards Sebastian
>> 
>>> 
>>> Best Regards Sebastian
>>> 
>>> P.S.: It turns out, at least on my link, that for shaping on
>>> pppoe-ge00 the kernel does not account for any header
>>> automatically, so I need to specify a per-packet-overhead (PPOH) of
>>> 40 bytes (an an ADSL2+ link with ATM linklayer); when shaping on
>>> ge00 however (with the kernel still terminating the PPPoE link to
>>> my ISP) I only need to specify an PPOH of 26 as the kernel already
>>> adds the 14 bytes for the ethernet header…

	Please disregard this part, I need to implement better tests for this instead on only relaying on netperf-wrapper results ;)



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Cerowrt-devel] SQM and PPPoE, more questions than answers...
  2015-03-19  2:43     ` David Lang
  2015-03-19  3:11       ` Dave Taht
@ 2015-03-19  8:37       ` Sebastian Moeller
  1 sibling, 0 replies; 16+ messages in thread
From: Sebastian Moeller @ 2015-03-19  8:37 UTC (permalink / raw)
  To: David Lang; +Cc: Alan Jenkins, cerowrt-devel

Hi David,

On Mar 19, 2015, at 03:43 , David Lang <david@lang.hm> wrote:

> On Wed, 18 Mar 2015, Alan Jenkins wrote:
> 
>>> Once SQM on ge00 actually dives into the PPPoE packets and
>>> applies/tests u32 filters the LUL increases to be almost identical to
>>> pppoe-ge00’s if both ingress and egress classification are active and
>>> do work. So it looks like the u32 filters I naively set up are quite
>>> costly. Maybe there is a better way to set these up...
>> 
>> Later you mentioned testing for coupling with egress rate.  But you didn't test coupling with classification!
>> 
>> I switched from simple.qos to simplest.qos, and that achieved the lower latency on pppoe-wan.  So I think your naive u32 filter setup wasn't the real problem.
>> 
>> I did think ECN wouldn't be applied on eth1, and that would be the cause of the latency.  But disabling ECN didn't affect it.  See files 3 to 6:
>> 
>> https://www.dropbox.com/sh/shwz0l7j4syp2ea/AAAxrhDkJ3TTy_Mq5KiFF3u2a?dl=0
>> 
>> I also admit surprise at fq_codel working within 20%/10ms on eth1.  I thought it'd really hurt, by breaking the FQ part.  Now I guess it doesn't.  I still wonder about ECN marking, though I didn't check my endpoint is using ECN.
> 
> ECN should never increase latency, if it has any effect it should improve latency because you slow down sending packets when some hop along the path is overloaded rather than sending the packets anyway and having them sit in a buffer for a while. This doesn't decrease actual throughput either (although if you are doing a test that doesn't actually wait for all the packets to arrive at the far end, it will look like it decreases throughput)
> 
>>>> 3) SQM on pppoe-ge00 has a rough 20% higher egress rate than SQM on
>>>> ge00 (with ingress more or less identical between the two). Also 2)
>>>> and 3) do not seem to be coupled, artificially reducing the egress
>>>> rate on pppoe-ge00 to yield the same egress rate as seen on ge00
>>>> does not reduce the LULI to the ge00 typical 10ms, but it stays at
>>>> 20ms.
>>>> For this I also have no good hypothesis, any ideas?
>>> With classification fixed the difference in egress rate shrinks to
>>> ~10% instead of 20, so this partly seems related to the
>>> classification issue as well.
>> 
>> My tests look like simplest.qos gives a lower egress rate, but not as low as eth1.  (Like 20% vs 40%).  So that's also similar.
>> 
>>>> So the current choice is either to accept a noticeable increase in
>>>> LULI (but note some years ago even an average of 20ms most likely
>>>> was rare in the real life) or a equally noticeable decrease in
>>>> egress bandwidth…
>>> I guess it is back to the drawing board to figure out how to speed up
>>> the classification… and then revisit the PPPoE question again…
>> 
>> so maybe the question is actually classification v.s. not?
>> 
>> + IMO slow asymmetric links don't want to lose more upload bandwidth than necessary.  And I'm losing a *lot* in this test.
>> + As you say, having only 20ms excess would still be a big improvement.  We could ignore the bait of 10ms right now.
>> 
>> vs
>> 
>> - lowest latency I've seen testing my link. almost suspicious. looks close to 10ms average, when the dsl rate puts a lower bound of 7ms on the average.
>> - fq_codel honestly works miracles already. classification is the knob people had to use previously, who had enough time to twiddle it.
> 
> That's what most people find when they try it. Classification doesn't result in throughput vs latency tradeoffs as much as it gives absolute priority to some types of traffic. But unless you are really up against your bandwidth limit, this seldom matters in the real world. As long as latency is kept low, everything works so you don't need to give VoIP priority over other traffic or things like that.

	But note, not all traffic is equal ;) Take the example from the mail Alan was quoting from, shaping on an ethernet interface that handles pppoe traffic: the shaper sees all packets including the packets PPP uses to establish and maintain the link, I would argue that these actually need a guaranteed delivery as dropping them can take out the pop link and hence the internet connection. I admit it is rare for home users to actually encounter such drop-averse packets, but they at least justify the use of classification/priorities. Whether VoIP makes the cut really depends on its drop probability on each end link (I just want to note that commercial VoIP system at least use precedence and EF markings on their packets, so classification of these is a) easy and b) is actually performed by many ISP’s home router offerings for that ISP’s brand of VoIP)

Best Regards
	Sebastian

> 
> David Lang


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Cerowrt-devel] SQM and PPPoE, more questions than answers...
  2015-03-19  8:29     ` Sebastian Moeller
@ 2015-03-19  9:42       ` Alan Jenkins
  2015-03-19  9:58         ` Sebastian Moeller
  2015-03-19 13:49       ` Alan Jenkins
  1 sibling, 1 reply; 16+ messages in thread
From: Alan Jenkins @ 2015-03-19  9:42 UTC (permalink / raw)
  To: Sebastian Moeller; +Cc: cerowrt-devel

On 19/03/15 08:29, Sebastian Moeller wrote:
> Hi Alan,
>
>
> On Mar 18, 2015, at 23:14 , Alan Jenkins <alan.christopher.jenkins@gmail.com> wrote:
>
>> Hi Seb
>>
>> I tested shaping on eth1 vs pppoe-wan, as it applies to ADSL.  (On Barrier Breaker + sqm-scripts).  Maybe this is going back a bit & no longer interesting to read.  But it seemed suspicious & interesting enough that I wanted to test it.
>>
>> My conclusion was 1) I should stick with pppoe-wan,
> 	Not a bad decision, especially given the recent changes to SQM to make it survive transient pppoe-interface disappearances. Before those changes the beauty of shaping on the ethernet device was that pppoe could come and go, but SQM stayed active and working. But due to your help this problem seems fixed now.
I'd say your help and my selfish prodding :).

>> 2) the question really means do you want to disable classification
>> 3) I personally want to preserve the upload bandwidth and accept slightly higher latency.
> 	My question still is, is the bandwidth sacrifice really necessary or is this test just showing a corner case in simple.qos that can be fixed. I currently lack enough time to tackle this effectively.
Yep ok (no complaint).

>> [netperf-wrapper noob puzzle: most of the ping lines vanish part-way through.  Maybe I failed it somehow.]
> 	This is not your fault, the UDP probes net-perf wrapper uses do not accept packet loss, once a packet (I believe) is lost the stream stops. This is not ideal, but it gives a good quick indicator of packet loss for sparse streams ;)
Heh, thanks.

>> My tests look like simplest.qos gives a lower egress rate, but not as low as eth1.  (Like 20% vs 40%).  So that's also similar.
>>
>>>> So the current choice is either to accept a noticeable increase in
>>>> LULI (but note some years ago even an average of 20ms most likely
>>>> was rare in the real life) or a equally noticeable decrease in
>>>> egress bandwidth…
>>> I guess it is back to the drawing board to figure out how to speed up
>>> the classification… and then revisit the PPPoE question again…
>> so maybe the question is actually classification v.s. not?
>>
>> + IMO slow asymmetric links don't want to lose more upload bandwidth than necessary.  And I'm losing a *lot* in this test.
>> + As you say, having only 20ms excess would still be a big improvement.  We could ignore the bait of 10ms right now.
>>
>> vs
>>
>> - lowest latency I've seen testing my link. almost suspicious. looks close to 10ms average, when the dsl rate puts a lower bound of 7ms on the average.
> 	Curious: what is your link speed?

dsl sync 912k up
shaped at 850
fq_codel auto target says => 14.5ms <=

MTU time is
912kbps / (1500*8)b = 0.0132s
so if the link is filled with MTU packets, there's a hard 7ms lower 
bound, on average icmp ping increase v.s. an empty link
and the same logic says on achieving that average, you have >= 7ms jitter


(or 6.5ms, but since my download rate is about 10x better, 6.5 + 0.65 ~= 7).

>> - fq_codel honestly works miracles already. classification is the knob people had to use previously, who had enough time to twiddle it.
>> - on netperf-runner plots the "banding" doesn't look brilliant on slow links anyway
> 	On slow links I always used to add “-s 0.8” with higher numbers the slower the link to increase the temporal averaging window, this reduces accuracy of the display for the downlink, but at least allows better understanding of the uplink. I always wanted to see whether I could treach netperf-wrapper to allow larger averaging windows after measurements, just for display purposes, but I am a total beginner with python...
>
>>>> P.S.: It turns out, at least on my link, that for shaping on
>>>> pppoe-ge00 the kernel does not account for any header
>>>> automatically, so I need to specify a per-packet-overhead (PPOH) of
>>>> 40 bytes (an an ADSL2+ link with ATM linklayer); when shaping on
>>>> ge00 however (with the kernel still terminating the PPPoE link to
>>>> my ISP) I only need to specify an PPOH of 26 as the kernel already
>>>> adds the 14 bytes for the ethernet header…
> 	Please disregard this part, I need to implement better tests for this instead on only relaying on netperf-wrapper results ;)
</troll-for-information>.  Apart from kernel code, I did wonder how this 
was tested :).

Thanks again
Alan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Cerowrt-devel] SQM and PPPoE, more questions than answers...
  2015-03-19  9:42       ` Alan Jenkins
@ 2015-03-19  9:58         ` Sebastian Moeller
  0 siblings, 0 replies; 16+ messages in thread
From: Sebastian Moeller @ 2015-03-19  9:58 UTC (permalink / raw)
  To: Alan Jenkins; +Cc: cerowrt-devel

HI Alan,

On Mar 19, 2015, at 10:42 , Alan Jenkins <alan.christopher.jenkins@gmail.com> wrote:

> On 19/03/15 08:29, Sebastian Moeller wrote:
>> Hi Alan,
>> 
>> 
>> On Mar 18, 2015, at 23:14 , Alan Jenkins <alan.christopher.jenkins@gmail.com> wrote:
>> 
>>> Hi Seb
>>> 
>>> I tested shaping on eth1 vs pppoe-wan, as it applies to ADSL.  (On Barrier Breaker + sqm-scripts).  Maybe this is going back a bit & no longer interesting to read.  But it seemed suspicious & interesting enough that I wanted to test it.
>>> 
>>> My conclusion was 1) I should stick with pppoe-wan,
>> 	Not a bad decision, especially given the recent changes to SQM to make it survive transient pppoe-interface disappearances. Before those changes the beauty of shaping on the ethernet device was that pppoe could come and go, but SQM stayed active and working. But due to your help this problem seems fixed now.
> I'd say your help and my selfish prodding :).
> 
>>> 2) the question really means do you want to disable classification
>>> 3) I personally want to preserve the upload bandwidth and accept slightly higher latency.
>> 	My question still is, is the bandwidth sacrifice really necessary or is this test just showing a corner case in simple.qos that can be fixed. I currently lack enough time to tackle this effectively.
> Yep ok (no complaint).
> 
>>> [netperf-wrapper noob puzzle: most of the ping lines vanish part-way through.  Maybe I failed it somehow.]
>> 	This is not your fault, the UDP probes net-perf wrapper uses do not accept packet loss, once a packet (I believe) is lost the stream stops. This is not ideal, but it gives a good quick indicator of packet loss for sparse streams ;)
> Heh, thanks.
> 
>>> My tests look like simplest.qos gives a lower egress rate, but not as low as eth1.  (Like 20% vs 40%).  So that's also similar.
>>> 
>>>>> So the current choice is either to accept a noticeable increase in
>>>>> LULI (but note some years ago even an average of 20ms most likely
>>>>> was rare in the real life) or a equally noticeable decrease in
>>>>> egress bandwidth…
>>>> I guess it is back to the drawing board to figure out how to speed up
>>>> the classification… and then revisit the PPPoE question again…
>>> so maybe the question is actually classification v.s. not?
>>> 
>>> + IMO slow asymmetric links don't want to lose more upload bandwidth than necessary.  And I'm losing a *lot* in this test.
>>> + As you say, having only 20ms excess would still be a big improvement.  We could ignore the bait of 10ms right now.
>>> 
>>> vs
>>> 
>>> - lowest latency I've seen testing my link. almost suspicious. looks close to 10ms average, when the dsl rate puts a lower bound of 7ms on the average.
>> 	Curious: what is your link speed?
> 
> dsl sync 912k up
> shaped at 850
> fq_codel auto target says => 14.5ms <=
> 
> MTU time is
> 912kbps / (1500*8)b = 0.0132s
> so if the link is filled with MTU packets, there's a hard 7ms lower bound, on average icmp ping increase v.s. an empty link
> and the same logic says on achieving that average, you have >= 7ms jitter

	Ah I see, 50% chance of getting the link immediately versus having to wait for a full packet transmit time.

> 
> 
> (or 6.5ms, but since my download rate is about 10x better, 6.5 + 0.65 ~= 7).
> 
>>> - fq_codel honestly works miracles already. classification is the knob people had to use previously, who had enough time to twiddle it.
>>> - on netperf-runner plots the "banding" doesn't look brilliant on slow links anyway
>> 	On slow links I always used to add “-s 0.8” with higher numbers the slower the link to increase the temporal averaging window, this reduces accuracy of the display for the downlink, but at least allows better understanding of the uplink. I always wanted to see whether I could treach netperf-wrapper to allow larger averaging windows after measurements, just for display purposes, but I am a total beginner with python...
>> 
>>>>> P.S.: It turns out, at least on my link, that for shaping on
>>>>> pppoe-ge00 the kernel does not account for any header
>>>>> automatically, so I need to specify a per-packet-overhead (PPOH) of
>>>>> 40 bytes (an an ADSL2+ link with ATM linklayer); when shaping on
>>>>> ge00 however (with the kernel still terminating the PPPoE link to
>>>>> my ISP) I only need to specify an PPOH of 26 as the kernel already
>>>>> adds the 14 bytes for the ethernet header…
>> 	Please disregard this part, I need to implement better tests for this instead on only relaying on netperf-wrapper results ;)
> </troll-for-information>.  Apart from kernel code, I did wonder how this was tested :).

	Oh, quite roughly… at that time I was only limited by my DSLAM (now I have a lower throttle in the BRAS that is somewhat hard to measure), I realized I could get decent RRUL results with egress shaping at 100% if the encapsulation and per packet overhead was set correctly. Increasing the per packet overhead above theoretical value did not affect latency and bandwidth (it should have affected bandwidth but the change was too small to measure). Decreasing the per packet overhead below the correct value noticeably increased the LULI during RRUL runs. The issue is I did not collect enough runs to be certain about the LULI I measured, even though my current hypothesis is that the kernel does not account for the ethernet header on an pppoe interface… Also This can partly be tested on router itself with a bit of tc magic that someone used to show me that the kernel does account for the 14 bytes for ethernet interfaces; I just need to find my notes from that experiment again (I fear it was lost by my btrfs raid5 disintegrating… they call btrfs raid5 experimental for a reason ;) )

Best Regards
	Sebastian

> 
> Thanks again
> Alan


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Cerowrt-devel] SQM and PPPoE, more questions than answers...
  2015-03-19  8:29     ` Sebastian Moeller
  2015-03-19  9:42       ` Alan Jenkins
@ 2015-03-19 13:49       ` Alan Jenkins
  2015-03-19 13:59         ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 16+ messages in thread
From: Alan Jenkins @ 2015-03-19 13:49 UTC (permalink / raw)
  To: Sebastian Moeller; +Cc: cerowrt-devel

Hi Seb, I have one last suspicion on this topic

On 19/03/15 08:29, Sebastian Moeller wrote:
> My question still is, is the bandwidth sacrifice really necessary or is this test just showing a corner case in simple.qos that can be fixed. I currently lack enough time to tackle this effectively.

>>>> 2) SQM on ge00 shows better latency under load (LUL), the LUL
>>>> increases for ~2*fq_codels target so 10ms, while SQM on pppeo-ge00
>>>> shows a LUL-increase (LULI) roughly twice as large or around 20ms.
>>>>
>>>> I have no idea why that is, if anybody has an idea please chime
>>>> in.
>> I saw the same, though with higher difference for egress rate.  See first three files here:
>>
>> https://www.dropbox.com/sh/shwz0l7j4syp2ea/AAAxrhDkJ3TTy_Mq5KiFF3u2a?dl=0
>>
>> [netperf-wrapper noob puzzle: most of the ping lines vanish part-way through.  Maybe I failed it somehow.]
> 	This is not your fault, the UDP probes net-perf wrapper uses do not accept packet loss, once a packet (I believe) is lost the stream stops. This is not ideal, but it gives a good quick indicator of packet loss for sparse streams ;)
Thinking about this, I remembered the issue that sqm de-priotises ICMP 
ping.  (Back when I used betterspeedtest and netperf-runner, I did 
assume this would be an issue).

I also notice that my test with eth1 (disabling classification) is the 
only one where UDP ping (including UDP EF) is visible for any time at 
all.  (ok, pppoe-wan shows UDP BK, and it very clearly gets higher 
latency as I would expect).

So I don't know if your results were clearer, but the results I showed 
so far should be treated as a measurement problem.

>>> Once SQM on ge00 actually dives into the PPPoE packets and
>>> applies/tests u32 filters the LUL increases to be almost identical to
>>> pppoe-ge00’s if both ingress and egress classification are active and
>>> do work. So it looks like the u32 filters I naively set up are quite
>>> costly. Maybe there is a better way to set these up...
>> Later you mentioned testing for coupling with egress rate.  But you didn't test coupling with classification!
> 	True, I was interesting in getting the 3-tier shaper to behave sanely, so I did not look at the 1-tier simplest.qos.
>
>> I switched from simple.qos to simplest.qos, and that achieved the lower latency on pppoe-wan.  So I think your naive u32 filter setup wasn't the real problem.
> 	Erm, but simplest.qos is not using the relevant tc filters, so the these could still account for the issue; that or some loss due to the 3 htb shapers...


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Cerowrt-devel] SQM and PPPoE, more questions than answers...
  2015-03-19 13:49       ` Alan Jenkins
@ 2015-03-19 13:59         ` Toke Høiland-Jørgensen
  2015-03-19 14:01           ` Dave Taht
  0 siblings, 1 reply; 16+ messages in thread
From: Toke Høiland-Jørgensen @ 2015-03-19 13:59 UTC (permalink / raw)
  To: Alan Jenkins; +Cc: cerowrt-devel

Alan Jenkins <alan.christopher.jenkins@gmail.com> writes:

> I also notice that my test with eth1 (disabling classification) is the
> only one where UDP ping (including UDP EF) is visible for any time at
> all. (ok, pppoe-wan shows UDP BK, and it very clearly gets higher
> latency as I would expect).

FYI the svn version of netperf has a feature to restart the UDP
measurement flows after a timeout. If you build that (don't forget the
--enable-demo switch to ./configure) and stick it in your $PATH,
netperf-wrapper should pick it up automatically and use the option. This
might get you better results on the UDP flows...

-Toke

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Cerowrt-devel] SQM and PPPoE, more questions than answers...
  2015-03-19 13:59         ` Toke Høiland-Jørgensen
@ 2015-03-19 14:01           ` Dave Taht
  0 siblings, 0 replies; 16+ messages in thread
From: Dave Taht @ 2015-03-19 14:01 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen; +Cc: Alan Jenkins, cerowrt-devel

On Thu, Mar 19, 2015 at 6:59 AM, Toke Høiland-Jørgensen <toke@toke.dk> wrote:
> Alan Jenkins <alan.christopher.jenkins@gmail.com> writes:
>
>> I also notice that my test with eth1 (disabling classification) is the
>> only one where UDP ping (including UDP EF) is visible for any time at
>> all. (ok, pppoe-wan shows UDP BK, and it very clearly gets higher
>> latency as I would expect).
>
> FYI the svn version of netperf has a feature to restart the UDP
> measurement flows after a timeout. If you build that (don't forget the
> --enable-demo switch to ./configure) and stick it in your $PATH,
> netperf-wrapper should pick it up automatically and use the option. This
> might get you better results on the UDP flows...

I note that time of first loss is a valuable statistic in itself, and
I would like
to see it called out more on the measurement flows on the graph.

> -Toke
> _______________________________________________
> Cerowrt-devel mailing list
> Cerowrt-devel@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/cerowrt-devel



-- 
Dave Täht
Let's make wifi fast, less jittery and reliable again!

https://plus.google.com/u/0/107942175615993706558/posts/TVX3o84jjmb

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2015-03-19 14:01 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-11 23:12 [Cerowrt-devel] SQM and PPPoE, more questions than answers Sebastian Moeller
2014-10-15  0:03 ` Sebastian Moeller
2014-10-15 12:02   ` Török Edwin
2014-10-15 13:39     ` Sebastian Moeller
2014-10-15 17:28       ` Dave Taht
2014-10-15 19:55         ` Sebastian Moeller
2015-03-18 22:14   ` Alan Jenkins
2015-03-19  2:43     ` David Lang
2015-03-19  3:11       ` Dave Taht
2015-03-19  8:37       ` Sebastian Moeller
2015-03-19  8:29     ` Sebastian Moeller
2015-03-19  9:42       ` Alan Jenkins
2015-03-19  9:58         ` Sebastian Moeller
2015-03-19 13:49       ` Alan Jenkins
2015-03-19 13:59         ` Toke Høiland-Jørgensen
2015-03-19 14:01           ` Dave Taht

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox