From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-x244.google.com (mail-wm0-x244.google.com [IPv6:2a00:1450:400c:c09::244]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by lists.bufferbloat.net (Postfix) with ESMTPS id 8D1193B2A3 for ; Sun, 12 Feb 2017 07:43:14 -0500 (EST) Received: by mail-wm0-x244.google.com with SMTP id u63so14168323wmu.2 for ; Sun, 12 Feb 2017 04:43:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:subject:from:in-reply-to:date:cc:message-id:references :to; bh=drBix3PIQS8y3kKWwU3rQ9ghdKgVQQr6426gjze9TFA=; b=CpB4TuijpEl96qYhsbBDX+Zpvqskxui49NOjRf5qxCictfO3wsM597aVA9BvNvGW4+ U70z/ZM4TCBEP0QxN0HuulwDGfp0rjHXIfeqQ5vDdwCg5OLNwNOBvXW9DtSKN0k8HQ6G Tb3eHIwmdCxT6p/QJUesD2FWMRVSMMmb4BRvpFbYEG8cS04akPQzvvIDAtTafhr8UwTO Wdmnbuhb0ILWeCYOBxnjl6fBrKEinjR4D/Mh3+SyU0mBb0220FI7V2+rJSW61Mhp7x2l baeG3nH6VLOUh1/0gIIYVnxfkGGMI0z3pbYegMe8mP68lM2PJJReWuFK7yKtPTKSGsnf Je3Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :message-id:references:to; bh=drBix3PIQS8y3kKWwU3rQ9ghdKgVQQr6426gjze9TFA=; b=qYF72cHixDJ18iBLWhd6BgWF64tAUudh4fz6Bhcg2WBY58maOJPCquOSByiOfWl4jC AHSQPRWZ366n8z/PxiT/W27FNzOoM2aIqaY2u3yAMXA2wsbXvpYpyS+qMuN4tbqQLROV 8rN4rS6ip9PIJ9XCXHR8KwLjnhHjYiGFwl5T+8BahUcjCIBm7eKLL5MfF6tcy388dzm3 lArD6Ql9wrt4tcHHniU5Kv0TAZZNxRV0fyk284hKSpkNTSN3riMeFg6Vx4a1VTPc6Tx5 TCzEGbgj5TzQVjmCUWI5H1y28IUhPv9eoF4+cyKjCwM/INN7Dgq7Q9sh6e5zvcKoBqQj MDnQ== X-Gm-Message-State: AMke39kWd73y9R4NV7LD3kzl837LHtSEBCbB7SEuSLMhWRDGQ/b4AyS51IcuxOYILb6JXQ== X-Received: by 10.28.180.132 with SMTP id d126mr34875810wmf.123.1486903393027; Sun, 12 Feb 2017 04:43:13 -0800 (PST) Received: from [10.72.0.34] (h-1169.lbcfree.net. [185.99.119.68]) by smtp.gmail.com with ESMTPSA id 10sm1262230wmi.23.2017.02.12.04.43.11 (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Sun, 12 Feb 2017 04:43:12 -0800 (PST) Content-Type: multipart/alternative; boundary="Apple-Mail=_3BC1CC13-B3DA-4682-97D3-4BF36FB877DB" Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) From: Pete Heist In-Reply-To: <531AF998-D1B2-43A6-A55B-F0471C0164E1@gmail.com> Date: Sun, 12 Feb 2017 13:43:21 +0100 Cc: Dave Taht , cake@lists.bufferbloat.net Message-Id: <967D2491-5DB5-45CC-B4C5-E8FF48743504@gmail.com> References: <459B9F17-317F-465E-8D2F-361CF47E5F32@gmail.com> <3D9E1A43-0182-4A1F-8262-6F587A79254E@gmail.com> <830143EE-20F2-42A5-A4FC-ECE7DF50C632@gmail.com> <652AA7A2-60C5-460F-AE60-CF4CB1D1D781@gmail.com> <5BE2A225-4B9C-4F0F-ACC5-C23CCC873DF5@gmail.com> <4B18C549-4CEF-4275-B9B3-CB8A046EB4EC@gmail.com> <856BB65A-569E-4633-B104-5E3BD15B649F@gmail.com> <70520D3D-D381-44DC-A789-BB1E24FBE3F4@gmx.de> <531AF998-D1B2-43A6-A55B-F0471C0164E1@gmail.com> To: Jonathan Morton X-Mailer: Apple Mail (2.3124) Subject: Re: [Cake] Cake latency update X-BeenThere: cake@lists.bufferbloat.net X-Mailman-Version: 2.1.20 Precedence: list List-Id: Cake - FQ_codel the next generation List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 12 Feb 2017 12:43:14 -0000 --Apple-Mail=_3BC1CC13-B3DA-4682-97D3-4BF36FB877DB Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 As an update on this, I now suspect a problem with either the Ethernet = hardware or (more likely) sky2 driver on =E2=80=98mbp=E2=80=99, my 2007 = MBP that acts as Flent server and where I=E2=80=99m often using a qdisc. = I should have looked at dmesg earlier, as there are log entries like = this: ----- [ 221.478753] eth0: hw csum failure [ 221.478756] CPU: 1 PID: 1890 Comm: netserver Tainted: G W = 4.8.0-37-generic #39-Ubuntu [ 221.478757] Hardware name: Apple Inc. MacBookPro4,1/Mac-F42C89C8, = BIOS MBP41.88Z.00C1.B03.0802271651 02/27/08 [ 221.478762] 0000000000000286 000000003844a735 ffff9c293fd03ba8 = ffffffffb5c30e12 [ 221.478765] ffff9c293a505000 ffffffffb66fb5c0 ffff9c293fd03bc0 = ffffffffb5f7c028 [ 221.478769] ffff9c29399ea800 ffff9c293fd03be0 ffffffffb5f71f26 = af75267500000000 [ 221.478770] Call Trace: [ 221.478775] [] dump_stack+0x63/0x81 [ 221.478778] [] netdev_rx_csum_fault+0x38/0x40 [ 221.478781] [] __skb_checksum_complete+0xb6/0xc0 =E2=80=A6 [ 226.478373] net_ratelimit: 386 callbacks suppressed [ 226.478378] eth0: hw csum failure [ 226.479523] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G W = 4.8.0-37-generic #39-Ubuntu [ 226.479527] Hardware name: Apple Inc. MacBookPro4,1/Mac-F42C89C8, = BIOS MBP41.88Z.00C1.B03.0802271651 02/27/08 [ 226.479533] 0000000000000286 f78e43dca42a09d0 ffff9c293fd03b88 = ffffffffb5c30e12 [ 226.479542] ffff9c293a505000 ffffffffb66fb5c0 ffff9c293fd03ba0 = ffffffffb5f7c028 [ 226.479549] ffff9c2932093b00 ffff9c293fd03bc0 ffffffffb5f71f26 = 46898f6100000000 [ 226.479557] Call Trace: [ 226.479560] [] dump_stack+0x63/0x81 [ 226.479581] [] netdev_rx_csum_fault+0x38/0x40 ----- What=E2=80=99s interesting is that they only occur during testing, and = when QoS with rate limiting is applied (Cake or HTB+X also). It=E2=80=99s = also interesting that they occur on exactly 5 second intervals, not = every 5 seconds, but sometimes after 10, or 15 seconds, but on 5 second = intervals. I went back and looked at my results, and realized that a = very large number of the latency and throughput shifts I saw are also = quantized to 5 second intervals. I don=E2=80=99t think that=E2=80=99s a = coincidence. I saw Dave posted something that he saw a similar 'hw csum failure' on = raspi earlier in 2016: https://github.com/raspberrypi/linux/issues/1371 = but since I=E2=80=99ve also seen more reports of this over the years = with no clear solution. Why I saw it more with Cake than other qdiscs I don=E2=80=99t know, but = I think it=E2=80=99s safe to say there=E2=80=99s no point in you trying = to reproduce this until I can get past this with my hardware, and also = I=E2=80=99m likely going to have to do a re-run of all of my tests after = this is sorted out. Pete > On Feb 10, 2017, at 1:21 PM, Pete Heist wrote: >=20 >=20 >> On Feb 10, 2017, at 12:35 PM, Sebastian Moeller > wrote: >>=20 >> Hi Pete, >>=20 >>> On Feb 10, 2017, at 12:08, Pete Heist > wrote: >>>=20 >>> Not a problem. I=E2=80=99ll run a spread of Cake and fq_codel over = Ethernet at various bandwidths. It will be through their Apple USB = Ethernet adapters (used now for management), which are also connected = through a switch, but I think that setup should be fine for this = purpose. Should be done in a hour or so and we=E2=80=99ll see=E2=80=A6 >>=20 >> I believe the Apple USB dongles are fastEthernet only, at least = the USB2 types I have available here, which for your tested bandwidth = would work, but it will not allow you test at what shaper rate things go = pear shaped=E2=80=A6 Also it wifi creates a bit more CPU load than wired = ethernet, it _might_ make sense to concurrently excercise the WIFI cards = just to re-create the SIRQ load (but probably not as the first = experiment ;) ). >>=20 >> Best Regards >> Sebastian=20 >=20 > Hi Sebastian, yes, they=E2=80=99re only 100 Mbit, but that=E2=80=99s = enough to cover the rates where I was seeing the problem with Wi-Fi. = Also in my test setup there are four nodes connected as described under = Configuration #1: >=20 > http://www.drhleny.cz/bufferbloat/wifi_bufferbloat.html = >=20 > I=E2=80=99m running Cake on =E2=80=98mini=E2=80=99 and =E2=80=98mbp=E2=80= =99, and the Wi-Fi radios are only on =E2=80=98om1=E2=80=99 and = =E2=80=98om2=E2=80=99, so the CPU load shouldn=E2=80=99t be different = for mini and mbp when connected directly via Ethernet, instead of via = Ethernet and a Wi-Fi link, I suppose. >=20 > I think we just wanted to see if the throughput shifting would = reproduce over Ethernet at the same rates, but so far it didn=E2=80=99t = for me, although there are other anomalies that don=E2=80=99t look like = the throughput shifts I sent before (there=E2=80=99s a throughput = anomaly for Cake 20Mbit and latency anomalies for fq_codel 60Mbit and = 90Mbit): >=20 > http://www.drhleny.cz/bufferbloat/cake_hd-eth_10mbit/index.html = >=20 > http://www.drhleny.cz/bufferbloat/cake_hd-eth_20mbit/index.html = >=20 > http://www.drhleny.cz/bufferbloat/cake_hd-eth_30mbit/index.html = >=20 > http://www.drhleny.cz/bufferbloat/cake_hd-eth_40mbit/index.html = >=20 > http://www.drhleny.cz/bufferbloat/cake_hd-eth_50mbit/index.html = >=20 > http://www.drhleny.cz/bufferbloat/cake_hd-eth_60mbit/index.html = >=20 > http://www.drhleny.cz/bufferbloat/cake_hd-eth_70mbit/index.html = >=20 > http://www.drhleny.cz/bufferbloat/cake_hd-eth_75mbit/index.html = >=20 > http://www.drhleny.cz/bufferbloat/cake_hd-eth_80mbit/index.html = >=20 > http://www.drhleny.cz/bufferbloat/cake_hd-eth_85mbit/index.html = >=20 > http://www.drhleny.cz/bufferbloat/cake_hd-eth_90mbit/index.html = >=20 > http://www.drhleny.cz/bufferbloat/cake_hd-eth_100mbit/index.html = >=20 > fq_codel: >=20 > http://www.drhleny.cz/bufferbloat/fq_codel_hd-eth_10mbit/index.html = >=20 > http://www.drhleny.cz/bufferbloat/fq_codel_hd-eth_20mbit/index.html = >=20 > http://www.drhleny.cz/bufferbloat/fq_codel_hd-eth_30mbit/index.html = >=20 > http://www.drhleny.cz/bufferbloat/fq_codel_hd-eth_40mbit/index.html = >=20 > http://www.drhleny.cz/bufferbloat/fq_codel_hd-eth_50mbit/index.html = >=20 > http://www.drhleny.cz/bufferbloat/fq_codel_hd-eth_60mbit/index.html = >=20 > http://www.drhleny.cz/bufferbloat/fq_codel_hd-eth_70mbit/index.html = >=20 > http://www.drhleny.cz/bufferbloat/fq_codel_hd-eth_75mbit/index.html = >=20 > http://www.drhleny.cz/bufferbloat/fq_codel_hd-eth_80mbit/index.html = >=20 > http://www.drhleny.cz/bufferbloat/fq_codel_hd-eth_85mbit/index.html = >=20 > http://www.drhleny.cz/bufferbloat/fq_codel_hd-eth_90mbit/index.html = >=20 > http://www.drhleny.cz/bufferbloat/fq_codel_hd-eth_100mbit/index.html = >=20 > So that suggests that the throughput shifting problem may also be = somehow related to Wi-Fi. I=E2=80=99m still going to be testing Chaos = Calmer, as well as two Ubiquiti NanoStation M5=E2=80=99s, though this = will take some more time. We might learn some more from this, or if you = can reproduce it with ath9k hardware that would be good too... >=20 > Thanks, > Pete >=20 --Apple-Mail=_3BC1CC13-B3DA-4682-97D3-4BF36FB877DB Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8
As an update on this, I now suspect a problem = with either the Ethernet hardware or (more likely) sky2 driver on = =E2=80=98mbp=E2=80=99, my 2007 MBP that acts as Flent server and where = I=E2=80=99m often using a qdisc. I should have looked at dmesg earlier, = as there are log entries like this:

-----
[  221.478753] eth0: hw csum failure
[  221.478756] CPU: 1 PID: 1890 Comm: = netserver Tainted: G        W   =     4.8.0-37-generic #39-Ubuntu
[  221.478757] Hardware name: Apple Inc. = MacBookPro4,1/Mac-F42C89C8, BIOS  =   MBP41.88Z.00C1.B03.0802271651 02/27/08
[  221.478762]  0000000000000286 = 000000003844a735 ffff9c293fd03ba8 ffffffffb5c30e12
[  221.478765]  ffff9c293a505000 = ffffffffb66fb5c0 ffff9c293fd03bc0 ffffffffb5f7c028
[  221.478769]  ffff9c29399ea800 = ffff9c293fd03be0 ffffffffb5f71f26 af75267500000000
[  221.478770] Call Trace:
[  221.478775]  <IRQ>  [<= ;ffffffffb5c30e12>] dump_stack+0x63/0x81
[  221.478778]  [<ffffffffb5f7c028>] = netdev_rx_csum_fault+0x38/0x40
[  221.478781]  [<ffffffffb5f71f26>] = __skb_checksum_complete+0xb6/0xc0
=E2=80=A6
[  226.478373] net_ratelimit: 386 callbacks = suppressed
[  226.478378] eth0: hw csum = failure
[  226.479523] CPU: 1 PID: 0 Comm: = swapper/1 Tainted: G        W   =     4.8.0-37-generic #39-Ubuntu
[  226.479527] Hardware name: Apple Inc. = MacBookPro4,1/Mac-F42C89C8, BIOS  =   MBP41.88Z.00C1.B03.0802271651 02/27/08
[  226.479533]  0000000000000286 = f78e43dca42a09d0 ffff9c293fd03b88 ffffffffb5c30e12
[  226.479542]  ffff9c293a505000 = ffffffffb66fb5c0 ffff9c293fd03ba0 ffffffffb5f7c028
[  226.479549]  ffff9c2932093b00 = ffff9c293fd03bc0 ffffffffb5f71f26 46898f6100000000
[  226.479557] Call Trace:
[  226.479560]  <IRQ>  [<= ;ffffffffb5c30e12>] dump_stack+0x63/0x81
[  226.479581]  [<ffffffffb5f7c028>] = netdev_rx_csum_fault+0x38/0x40
-----

What=E2=80=99s = interesting is that they only occur during testing, and when QoS with = rate limiting is applied (Cake or HTB+X also). It=E2=80=99s also = interesting that they occur on exactly 5 second intervals, not every 5 = seconds, but sometimes after 10, or 15 seconds, but on 5 second = intervals. I went back and looked at my results, and realized that a = very large number of the latency and throughput shifts I saw are also = quantized to 5 second intervals. I don=E2=80=99t think that=E2=80=99s a = coincidence.

I = saw Dave posted something that he saw a similar 'hw csum failure' on = raspi earlier in 2016:


but since I=E2=80=99ve = also seen more reports of this over the years with no clear = solution.

Why = I saw it more with Cake than other qdiscs I don=E2=80=99t know, but I = think it=E2=80=99s safe to say there=E2=80=99s no point in you trying to = reproduce this until I can get past this with my hardware, and also = I=E2=80=99m likely going to have to do a re-run of all of my tests after = this is sorted out.

Pete

On Feb 10, 2017, at 1:21 PM, Pete Heist = <peteheist@gmail.com> wrote:


On Feb 10, 2017, at 12:35 PM, Sebastian Moeller <moeller0@gmx.de> = wrote:

Hi Pete,

On Feb = 10, 2017, at 12:08, Pete Heist <peteheist@gmail.com> wrote:

Not a problem. I=E2=80=99ll run a spread of Cake and fq_codel = over Ethernet at various bandwidths. It will be through their Apple USB = Ethernet adapters (used now for management), which are also connected = through a switch, but I think that setup should be fine for this = purpose. Should be done in a hour or so and we=E2=80=99ll see=E2=80=A6

I believe the Apple USB dongles are fastEthernet = only, at least the USB2 types I have available here, which for your = tested bandwidth would work, but it will not allow you test at what = shaper rate things go pear shaped=E2=80=A6 Also it wifi creates a bit = more CPU load than wired ethernet, it _might_ make sense to concurrently = excercise the WIFI cards just to re-create the SIRQ load (but probably = not as the first experiment ;) ).

Best Regards
= Sebastian 

Hi Sebastian, yes, they=E2=80=99re only 100 = Mbit, but that=E2=80=99s enough to cover the rates where I was seeing = the problem with Wi-Fi. Also in my test setup there are four nodes = connected as described under Configuration #1:

http://www.drhleny.cz/bufferbloat/wifi_bufferbloat.html

I=E2=80=99m = running Cake on =E2=80=98mini=E2=80=99 and =E2=80=98mbp=E2=80=99, and = the Wi-Fi radios are only on =E2=80=98om1=E2=80=99 and =E2=80=98om2=E2=80=99= , so the CPU load shouldn=E2=80=99t be different for mini and mbp when = connected directly via Ethernet, instead of via Ethernet and a Wi-Fi = link, I suppose.

I think we just wanted to see if the throughput shifting = would reproduce over Ethernet at the same rates, but so far it didn=E2=80=99= t for me, although there are other anomalies that don=E2=80=99t look = like the throughput shifts I sent before (there=E2=80=99s a throughput = anomaly for Cake 20Mbit and latency anomalies for fq_codel 60Mbit and = 90Mbit):













fq_codel:













So = that suggests that the throughput shifting problem may also be somehow = related to Wi-Fi. I=E2=80=99m still going to be testing Chaos Calmer, as = well as two Ubiquiti NanoStation M5=E2=80=99s, though this will take = some more time. We might learn some more from this, or if you can = reproduce it with ath9k hardware that would be good too...

Thanks,
Pete


= --Apple-Mail=_3BC1CC13-B3DA-4682-97D3-4BF36FB877DB--