some notes re wireless classification and queue length

Wed Nov 2 07:04:42 EDT 2011

This message starts off talking about queueing, then diverges into 
classification

Cerowrt is at present defaulting to a 40 txqueuelen. With the driver set 
(ar71xx and ath9k) this appears to be fairly optimal (less than this 
messes with packet aggregation significantly), although I may increase 
the dma tx queue somewhat on the ethernet side (from 4 to at least 8). 
I'm also running various wireless-n capable stations at the same settings.

With the pfifo_fast queue discipline, it appears to be fairly optimal 
over a 150Mbit wireless link, bounding RTT for pings between 7 and 15ms 
during a TCP elephant flow on a sta->AP test under good conditions....

(please note ALL the qualifications above. As jim stresses and I feel I 
must, too, there IS NO RIGHT ANSWER FOR BUFFERING. Over poorer links 
we'll get bloated again, but you know, you have take factor of 10-20 
improvements in latency and user-friendlinesss of multiple tcp streams 
one step at a time...)

However, the work so far strongly suggests that there is a useful upper 
bound for wireless txqueue buffering that is far, far less than the 
current Linux default, that can reasonably be set by the core technology 
of the connection (b,g,n,n+HT40+amount of MIMO) and then reasonably 
managed with some other queue management technology if available.

(have I added enough qualifications to what I'm saying yet?)

40 is probably about right for a 802.11n station, 120 is probably too 
big for an 150Mbit AP (again, using pfifo_fast, which I'm trying to get 
rid of, and AQM of something like fair queuing of something like QFQ + 
red would probably do better than tail drop on 120 - and doing byte 
limits on ethernet makes more sense and changing the amount of 
txqueuelen on a per connect would help -

and what I think would work 'right' is a per destination (post routing) 
queue living closer to the mac802.11 layer that would then fq and 
aggregate based on it's percieved completion rate (keeping a backlog of, 
say 3 'bundles') and time based queueing especially for VO and VI traffic...

(Seriously, have I added enough qualifications to what I'm saying yet?)

but like I said, I'll take 10-20x improvements in latency at a small 
sacrifice in bandwidth at this point and move on. Do try reducing 
txqueuelen. NetworkManager on a Linux desktop would have enough 
information to make the core decision as to an outer bound for 
txqueuelen and people would be happier....

To me, the next step is to add FQ into the mix, which is turning out to 
be hard. I'm burning brain cells on it now... more brain cells welcomed.

One thing that bugs me is that all the packet scheduling algorithms 
(none of which quite do what I want) trumpet their cpu efficiency as 
they are targetted (I suppose) at core routers. Home routers and 
stations *have cpu to burn*. And at least the wndr3700 and 3800 have 
memory to burn... so an cpu/memory inefficient but mo' betta algorithm 
for queue management, is, like totally fine. The QFQ paper, for example, 
shows flat scaling factors for up to 32k flows, where I have a hard time 
envisioning a typical home network getting much past 2k flows, and 
that's only with bittorrent usage.

As for how to do nice things with QFQ, my brain crashed while repeatedly 
crashing router last night.

So, I digress into the glorious, simple benefits of classification. I 
tossed a not quite complete implementation of diffserv classification 
into cerowrt's mac80211's queue classifier as it seems inevitable that 
to get best latencies from wireless, (regardless of what other queuing 
mechanisms are needed) we need to use the native 802.11e classification 
schemes intelligently - and also fixed ipv6 support for that in mac80211.

Mostly implemented is the following: 
https://github.com/dtaht/Diffserv/issues/5#issuecomment-2133541

and I'm pleased with the results thus far. One totally unanticipated 
side effect is that I'm now tossing stuff with the immediate bit set 
into the wireless VI queue, and OH BOY, does that help interactive ssh 
sessions when the wireless network is under load. Latency and jitter 
drop enormously....

openssh head and dropbear's head src trees now do mostly the right thing 
for interactive ssh.

Babel and AHCP now uses CS6 by default. I toss these into the VI queue.

of course, adding dscp bits as simple as the two-four basic setsockopt 
options (SO_PRIORITY helps too) to other processes that emit 'ANT's 
turned out to be mildly more difficult than I wanted.

Ahh... the aristotlean rathole...

dnsmasq - uses the same routine for all socket generation. Arguably 
dhcp, tftp, and dns should all have
different dscp bits set.

bind - tcp zone transfers should be bulk (CS1), udp dns should be 
something like 'immediate' instead of OAM (I think). I dislike the OAM 
class intensely. DNS should end up in the VI queue, regardless.

ARP - arp is not an IP based packet so you could set it's priority when 
generated inside the kernel - AND match against it in the same mac80211 
layer as I am doing dscp now... but I would argue against setting arp's 
priority up/classification up in normal pcs as that makes a DOS attack 
easier....

IPv6 icmp messages (except ping/pong) - universally should end up in the 
VI queue.

There are numerous other processes that might benefit from 
classification - polipo for example could use AF21-AF23 but I mostly 
care about fixing the ANTs above.

perhaps those more familiar with those source bases can take a shot at 
it. I'm off to wrestling with QFQ some more. With good fair queuing most 
classification of the ants would be rendered unessessary you gotta start 
somewhere.

-- 
Dave Täht

-------------- next part --------------
A non-text attachment was scrubbed...
Name: dave_taht.vcf
Type: text/x-vcard
Size: 204 bytes
Desc: not available
URL: <https://lists.bufferbloat.net/pipermail/bloat-devel/attachments/20111102/39040d14/attachment-0002.vcf>