From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-x232.google.com (mail-wg0-x232.google.com [IPv6:2a00:1450:400c:c00::232]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by huchra.bufferbloat.net (Postfix) with ESMTPS id B908121F233; Tue, 8 Apr 2014 12:50:26 -0700 (PDT) Received: by mail-wg0-f50.google.com with SMTP id x13so1505244wgg.9 for ; Tue, 08 Apr 2014 12:50:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:cc:content-type; bh=N7unXHWLJOUTDupnMrN7h6XY07cujzvfeq+GHOc8Xjo=; b=iSR39YdiIaUH/zv0gHWmXAdsqSqxaIMd/ntAu4wiZ8qyt9qcTaE+jxzCzv3B9Op+9v 3GS3Rlh2IG+VLfBt2OgLfpUVwN45pBJJmsdEdVJG/X+cFtNgdTptZoVsSmREYYf+TjP/ 0tACChpK6mPk5Es3FDhrRxUtgaqQSVozV/2i3YoK7Hos6xi3RPzDP38YSkmkIoyM1P0l zzW1tIKQWncU1bZl5WAwhpe2tI5RQeX6M1BYQ+mmtrdTpil1g6F/xQnStz7829HgZAUw W6n6xjJMVLlHup8IwI+0mb8xeQVcGcTUP16eWS7ml4PQYeKc9i0jNp6Ir6TFHTSNkduc FjSw== MIME-Version: 1.0 X-Received: by 10.180.37.178 with SMTP id z18mr33527240wij.46.1396986624430; Tue, 08 Apr 2014 12:50:24 -0700 (PDT) Received: by 10.216.177.10 with HTTP; Tue, 8 Apr 2014 12:50:23 -0700 (PDT) Date: Tue, 8 Apr 2014 12:50:23 -0700 Message-ID: From: Dave Taht To: Neil Shepperd Content-Type: text/plain; charset=ISO-8859-1 Cc: cerowrt@lists.bufferbloat.net, "cerowrt-devel@lists.bufferbloat.net" Subject: Re: [Cerowrt-devel] [Bug #442] smoking gun found for wifi hang X-BeenThere: cerowrt-devel@lists.bufferbloat.net X-Mailman-Version: 2.1.13 Precedence: list List-Id: Development issues regarding the cerowrt test router project List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 08 Apr 2014 19:50:27 -0000 Finally found the smoke, from a gun still offstage. The background wifi queue (1:40) gets wedged. This explains why this only seemed to happen on comcast (Which re-marks a LOT of traffic background that it shouldn't, and yes we should start mangling packets back to "be" in sqm as an option), and why local traffic seemed to mostly work when stuff coming back from the internet didn't. As to *why* it happens, don't know. I'm sitting in the #bufferbloat channel scratching my head as to means to explore the problem without unwedging the interface. It seems plausible we can MUCH more easily reproduce this now by flooding the background queues with traffic (netperf can do this). It's not clear you can trigger it with just tcp however or if multiple hops are required, etc, etc. root@cerowrt:/mnt/disk1# tc -s qdisc show dev sw00 qdisc mq 1: root Sent 3926131082 bytes 2998293 pkt (dropped 91657, overlimits 0 requeues 70095) backlog 77608b 1000p requeues 70095 qdisc fq_codel 10: parent 1:1 limit 800p flows 1024 quantum 500 target 10.0ms interval 100.0ms Sent 110555 bytes 771 pkt (dropped 0, overlimits 0 requeues 5) backlog 0b 0p requeues 5 maxpacket 256 drop_overlimit 0 new_flow_count 2 ecn_mark 0 new_flows_len 0 old_flows_len 0 qdisc fq_codel 20: parent 1:2 limit 800p flows 1024 quantum 300 target 5.0ms interval 100.0ms ecn Sent 2526448 bytes 17982 pkt (dropped 1, overlimits 0 requeues 31) backlog 0b 0p requeues 31 maxpacket 929 drop_overlimit 0 new_flow_count 71 ecn_mark 0 new_flows_len 0 old_flows_len 0 qdisc fq_codel 30: parent 1:3 limit 1000p flows 1024 quantum 300 target 5.0ms interval 100.0ms ecn Sent 15145657 bytes 106290 pkt (dropped 0, overlimits 0 requeues 179) backlog 0b 0p requeues 179 maxpacket 256 drop_overlimit 0 new_flow_count 0 ecn_mark 0 new_flows_len 0 old_flows_len 0 qdisc fq_codel 40: parent 1:4 limit 1000p flows 1024 quantum 300 target 5.0ms interval 100.0ms Sent 3908348422 bytes 2873250 pkt (dropped 91656, overlimits 0 requeues 69880) backlog 77608b 1000p requeues 69880 ^^^^^!!!!! maxpacket 1514 drop_overlimit 72128 new_flow_count 85727 ecn_mark 0 new_flows_len 238 old_flows_len 1 I got the "wedged" interface to work again re-marking all tcp traffic as best effort" iptables -A FORWARD -o sw00 -t mangle -p tcp -m tcp -j DSCP --set-dscp-class be thus moving traffic into 1:3 above. (can probably improve on this iptables thing, but it's just a workaround and for all I know we can also trigger this on the be queue) icmp replies however, seems to want to always go into the background queue for some reason. (?) We did have this happen earlier on this run [31325.589843] ath: phy0: Failed to stop TX DMA, queues=0x008! [32380.960937] ath: phy0: Failed to stop TX DMA, queues=0x008! [32381.035156] ath: phy0: Failed to stop TX DMA, queues=0x008! [32381.140625] ath: phy0: Failed to stop TX DMA, queues=0x008! [32381.242187] ath: phy0: Failed to stop TX DMA, queues=0x008! [32381.343750] ath: phy0: Failed to stop TX DMA, queues=0x008! [32418.824218] ath: phy0: Failed to stop TX DMA, queues=0x008! [32445.863281] ath: phy0: Failed to stop TX DMA, queues=0x108! [32445.960937] ath: phy0: Failed to stop TX DMA, queues=0x008! [32446.062500] ath: phy0: Failed to stop TX DMA, queues=0x008! [32446.164062] ath: phy0: Failed to stop TX DMA, queues=0x008! [32446.265625] ath: phy0: Failed to stop TX DMA, queues=0x008! [32446.367187] ath: phy0: Failed to stop TX DMA, queues=0x008! [32446.472656] ath: phy0: Failed to stop TX DMA, queues=0x008! [32446.574218] ath: phy0: Failed to stop TX DMA, queues=0x008! [32446.683593] ath: phy0: Failed to stop TX DMA, queues=0x00c! [32446.777343] ath: phy0: Failed to stop TX DMA, queues=0x008! [32446.886718] ath: phy0: Failed to stop TX DMA, queues=0x009! [34701.062500] ath: phy0: Failed to stop TX DMA, queues=0x008! [34701.140625] ath: phy0: Failed to stop TX DMA, queues=0x008! [34701.242187] ath: phy0: Failed to stop TX DMA, queues=0x008!