* [Bloat] some (very good) preliminary results from fiddling with byte queue limits on 100Mbit ethernet
@ 2011-11-19 20:33 Dave Taht
2011-11-19 21:53 ` Tom Herbert
2011-11-21 15:08 ` John W. Linville
0 siblings, 2 replies; 5+ messages in thread
From: Dave Taht @ 2011-11-19 20:33 UTC (permalink / raw)
To: Tom Herbert, bloat
[-- Attachment #1: Type: text/plain, Size: 6496 bytes --]
Dear Tom (author of the byte queue limits patch)
I have finally got out of 'embedded computing' mode and more into a
place where I can hack on the kernel.
(Not that I'm any good at it)
So I extracted the current set of BQL related patches from the
debloat-testing kernel and applied them to a recent linus-head
(3.2-rc2 + a little)
(they are at: http://www.teklibre.com/~d/tnq )
Now, the behavior that I had hoped for was that tx rate would be
closely tied to completion rate, and the buffers on the device driver
would more rarely fill.
(it's a e1000e in my case - tx ring set to 256 by default, only
reducable to 64 via ethtool)
this is my standard latency under load test for a gigE ethernet card
stepped down to 100Mbit.
ethtool -s eth0 advertise 0x008 # force the device to 100Mbit
ethtool -G eth0 tx 64 # Knock the ring buffer down as far as it can go
# Plug the device in (which runs the attached script to configure the
interface to ~'good' values)
netperf -l 60 -H some_other_server
# in my case cerowrt - which has a 4 buffer tx ring and a 8 buffer
txqueuelen set at the moment - far too low)
and ping some_other_server in another window.
AND - YES! YES! YES!
SUB 10ms inter-stream latencies. Ranging from 1.3ms to about 6ms,
median around 4ms.
I haven't seen latencies under load this good on 100Mbit ethernet
since the DECchip tulip 21140!
This is within the range you'd expect for SFQ's 'typical' bunching of
packets. And a tiny fraction of tcp's
speed is lost in the general case. I mean, it's so close to to what
I'd get without the script as to be
statistically insignificant. CPU load, hardly measurable...
Now. Look at the script.
When a link speed of < 101 mbit is detected:
I set the byte queue limit to 3*mtu # lower and latencies get mildly
lower and more unstable
When without BQL... I tried to use cbq to set a bandwidth limit at 92mbit
I added in sfq on top of that# the documentation for which is now
wrong, there's no way to set a packet limit
Without byte queue limits, latency under load goes to 130ms and stays
there. EG - the
default buffering in the ethernet driver defeats my attempt at
controlling bandwidth with CBQ + SFQ entirely.
With byte queue limits alone and the default pfifo fast qdisc...
... at mtu*3, we still end up with 130ms latency under load. :(
With byte queue limits at mtu*3 + the SFQ qdisc, latency under load
can be hammered
down below 6ms when running at a 100Mbit line rate. No CBQ needed.
When doing a reverse test (mostly data) - with cerowrt set to the
above (insanely low values)
I see similar response times to the above.
netperf -l 60 -H 172.30.42.1 -t TCP_MAERTS
Anyway, script could use improvement, and I'm busily patching BQL into
the ag71xx driver as I write.
Sorry it's taken me so long to get to this since your bufferbloat
talks at linux plumbers. APPLAUSE.
It's looking like BQL + SFQ is an effective means of improving
fairness and reducing latency on drivers
that can support it. Even if they have large tx rings that the
hardware demands.
More testing on more stuff is needed of course... I'd like to convince
QFQ to work...
#!/bin/sh
# Starving the beast on ethernet v.000001
# Place this file in /etc/network/if-up.d NetworkManager will call it
# for you automagically when the interface is brought up.
# Today's ethernet device drivers are over-optimized for 1000Mbit
# If you are unfortunate enough to run at less than that
# you are going to lose on latency. As one example you will
# have over 130ms latency under load with the default settings in the e1000e
# driver - common to many laptops.
# To force your network device to 100Mbit
# (so you can test and then bitch about bloat in your driver)
# ethtool -s your_device advertise 0x008
# It will then stay stuck at 100Mbit until you change it back.
# It also helps to lower your ring buffer as far as it will go
# ethtool -G your_device tx 64 # or lower if you can
# And after doing all that you wil be lucky to get 120ms latency under load.
# So I have also built byte queue limits into my kernels at
# http://www.teklibre.com/~d/tnq
# Adding in the below, without byte queue limits enabled, and cbq, gets you to
# around 12ms. With byte queue limits, I can get to ~4-6 ms latency under load.
# However, (less often of late), I sometimes end up at 130ms.
# It would be my hope, with some more tuning (QFQ?), better SFQ setup?
# to get below 1 ms.
debloat_ethernet() {
percent=92
txqueuelen=100
bytelimit=64000
speed=`cat /sys/class/net/$IFACE/speed`
mtu=`ip -o link show dev $IFACE | awk '{print $5;}'`
bytelimit=`expr $mtu '*' 3`
[ $speed -lt 1001 ] && { percent=94; txqueuelen=100; }
if [ $speed -lt 101 ]
then
percent=92;
txqueuelen=50;
fi
#[ $speed -lt 11 ] && { percent=90; txqueuelen=20; }
newspeed=`expr $speed \* $percent / 100`
modprobe sch_cbq
modprobe sch_sfq
modprobe sch_qfq # I can't get QFQ to work
# Doing this twice kicks the driver harder. Sometimes it gets stuck otherwise
ifconfig $IFACE txqueuelen $txqueuelen
tc qdisc del dev $IFACE root
ifconfig $IFACE txqueuelen $txqueuelen
tc qdisc del dev $IFACE root
#tc qdisc add dev $IFACE root handle 1 cbq bandwidth ${newspeed}mbit avpkt 1524
#tc qdisc add dev $IFACE parent 1: handle 10 sfq
if [ -e /sys/class/net/$IFACE/queues/tx-0/byte_queue_limits ]
then
for i in /sys/class/net/$IFACE/queues/tx-*/byte_queue_limits
do
echo $bytelimit > $i/limit_max
done
tc qdisc add dev $IFACE handle 1 root sfq
else
tc qdisc add dev $IFACE root handle 1 cbq bandwidth ${newspeed}mbit avpkt 1524
tc qdisc add dev $IFACE parent 1: handle 10 sfq
fi
}
debloat_wireless() {
# HAH. Like any of this helps wireless
exit
percent=92
txqueuelen=100
speed=`cat /sys/class/net/$IFACE/speed`
[ $speed -lt 1001 ] && { percent=94; txqueuelen=100; }
[ $speed -lt 101 ] && { percent=93; txqueuelen=50; }
[ $speed -lt 11 ] && { percent=90; txqueuelen=20; }
newspeed=`expr $speed \* $percent / 100`
#echo $newspeed
modprobe sch_cbq
modprobe sch_sfq
modprobe sch_qfq
# Just this much would help. If wireless had a 'speed'
ifconfig $IFACE txqueuelen $txqueuelen
}
if [ -h /sys/class/net/$IFACE/phy80211 ]
then
debloat_wireless
else
debloat_ethernet
fi
--
Dave Täht
SKYPE: davetaht
US Tel: 1-239-829-5608
FR Tel: 0638645374
http://www.bufferbloat.net
[-- Attachment #2: debloat --]
[-- Type: application/octet-stream, Size: 3067 bytes --]
#!/bin/sh
# Starving the beast on ethernet v.000001
# Place this file in /etc/network/if-up.d NetworkManager will call it
# for you automagically when the interface is brought up.
# Today's ethernet device drivers are over-optimized for 1000Mbit
# If you are unfortunate enough to run at less than that
# you are going to lose on latency. As one example you will
# have over 130ms latency under load with the default settings in the e1000e
# driver - common to many laptops.
# To force your network device to 100Mbit
# (so you can test and then bitch about bloat in your driver)
# ethtool -s your_device advertise 0x008
# It will then stay stuck at 100Mbit until you change it back.
# It also helps to lower your ring buffer as far as it will go
# ethtool -G your_device tx 64 # or lower if you can
# And after doing all that you wil be lucky to get 120ms latency under load.
# So I have also built byte queue limits into my kernels at
# http://www.teklibre.com/~d/tnq
# Adding in the below, without byte queue limits enabled, and cbq, gets you to
# around 12ms. With byte queue limits, I can get to ~4-6 ms latency under load.
# However, (less often of late), I sometimes end up at 130ms.
# It would be my hope, with some more tuning (QFQ?), better SFQ setup?
# to get below 1 ms.
debloat_ethernet() {
percent=92
txqueuelen=100
bytelimit=64000
speed=`cat /sys/class/net/$IFACE/speed`
mtu=`ip -o link show dev $IFACE | awk '{print $5;}'`
bytelimit=`expr $mtu '*' 3`
[ $speed -lt 1001 ] && { percent=94; txqueuelen=100; }
if [ $speed -lt 101 ]
then
percent=92;
txqueuelen=50;
fi
#[ $speed -lt 11 ] && { percent=90; txqueuelen=20; }
newspeed=`expr $speed \* $percent / 100`
modprobe sch_cbq
modprobe sch_sfq
modprobe sch_qfq # I can't get QFQ to work
# Doing this twice kicks the driver harder. Sometimes it gets stuck otherwise
ifconfig $IFACE txqueuelen $txqueuelen
tc qdisc del dev $IFACE root
ifconfig $IFACE txqueuelen $txqueuelen
tc qdisc del dev $IFACE root
#tc qdisc add dev $IFACE root handle 1 cbq bandwidth ${newspeed}mbit avpkt 1524
#tc qdisc add dev $IFACE parent 1: handle 10 sfq
if [ -e /sys/class/net/$IFACE/queues/tx-0/byte_queue_limits ]
then
for i in /sys/class/net/$IFACE/queues/tx-*/byte_queue_limits
do
echo $bytelimit > $i/limit_max
done
tc qdisc add dev $IFACE handle 1 root sfq
else
tc qdisc add dev $IFACE root handle 1 cbq bandwidth ${newspeed}mbit avpkt 1524
tc qdisc add dev $IFACE parent 1: handle 10 sfq
fi
}
debloat_wireless() {
# HAH. Like any of this helps wireless
exit
percent=92
txqueuelen=100
speed=`cat /sys/class/net/$IFACE/speed`
[ $speed -lt 1001 ] && { percent=94; txqueuelen=100; }
[ $speed -lt 101 ] && { percent=93; txqueuelen=50; }
[ $speed -lt 11 ] && { percent=90; txqueuelen=20; }
newspeed=`expr $speed \* $percent / 100`
#echo $newspeed
modprobe sch_cbq
modprobe sch_sfq
modprobe sch_qfq
# Just this much would help. If wireless had a 'speed'
ifconfig $IFACE txqueuelen $txqueuelen
}
if [ -h /sys/class/net/$IFACE/phy80211 ]
then
debloat_wireless
else
debloat_ethernet
fi
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Bloat] some (very good) preliminary results from fiddling with byte queue limits on 100Mbit ethernet
2011-11-19 20:33 [Bloat] some (very good) preliminary results from fiddling with byte queue limits on 100Mbit ethernet Dave Taht
@ 2011-11-19 21:53 ` Tom Herbert
2011-11-19 22:47 ` Dave Taht
2011-11-21 15:08 ` John W. Linville
1 sibling, 1 reply; 5+ messages in thread
From: Tom Herbert @ 2011-11-19 21:53 UTC (permalink / raw)
To: Dave Taht; +Cc: bloat
Thanks for trying this out Dave!
> With byte queue limits at mtu*3 + the SFQ qdisc, latency under load
> can be hammered
> down below 6ms when running at a 100Mbit line rate. No CBQ needed.
>
I'm hoping that we didn't have to set the BQL max_limit. I would
guess that this might indicate some periodic spikes in interrupt
latency (BQL will increase limit aggressively in that case). You
might want to try adjusting the hold_time to a lower value. Also,
disabling TSO might lower the limit.
Without lowering the max_limit, what values so you see for limit and
inflight? If you set min_limit to a really big number (effectively
turn of BQL), what does inflight grow to?
> Anyway, script could use improvement, and I'm busily patching BQL into
> the ag71xx driver as I write.
>
Cool, I look forward to those results!
> Sorry it's taken me so long to get to this since your bufferbloat
> talks at linux plumbers. APPLAUSE.
> It's looking like BQL + SFQ is an effective means of improving
> fairness and reducing latency on drivers
> that can support it. Even if they have large tx rings that the
> hardware demands.
>
Great. I actually got back to looking at this a little last week.
AFAICT the overhead of BQL is < 1% CPU and throughput (still need more
testing to verify that). There are some (very) minor performance
improvements that might be possible, but I don't have any major
modifications pending at this point.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Bloat] some (very good) preliminary results from fiddling with byte queue limits on 100Mbit ethernet
2011-11-19 21:53 ` Tom Herbert
@ 2011-11-19 22:47 ` Dave Taht
0 siblings, 0 replies; 5+ messages in thread
From: Dave Taht @ 2011-11-19 22:47 UTC (permalink / raw)
To: Tom Herbert; +Cc: bloat
On Sat, Nov 19, 2011 at 10:53 PM, Tom Herbert <therbert@google.com> wrote:
> Thanks for trying this out Dave!
I note that there was MAJOR churn in the 3.2 directory layouts and
if you could rebase that patchset on 3.2 it would be good.
>
>> With byte queue limits at mtu*3 + the SFQ qdisc, latency under load
>> can be hammered
>> down below 6ms when running at a 100Mbit line rate. No CBQ needed.
>>
> I'm hoping that we didn't have to set the BQL max_limit. I would
> guess that this might indicate some periodic spikes in interrupt
> latency (BQL will increase limit aggressively in that case). You
> might want to try adjusting the hold_time to a lower value. Also,
> disabling TSO might lower the limit.
You will find it helpful in debugging (and the results more pleasing)
to artificially lower your line rate to 100Mbit as per the ethtool trick
noted in the email prior.
This also disables TSO at least on the e1000e.
> Without lowering the max_limit, what values so you see for limit and
> inflight? If you set min_limit to a really big number (effectively
> turn of BQL), what does inflight grow to?
It is very late in paris right now. I'll apply your suggestions in the morning.
>> Anyway, script could use improvement, and I'm busily patching BQL into
>> the ag71xx driver as I write.
I wish I could make QFQ work without a CBQ. So far no luck. It should be
better than SFQ, with the right classifier. SFQ might be better with a
different classifier... finally, to have options higher in the stack!
>>
> Cool, I look forward to those results!
>
>> Sorry it's taken me so long to get to this since your bufferbloat
>> talks at linux plumbers. APPLAUSE.
>> It's looking like BQL + SFQ is an effective means of improving
>> fairness and reducing latency on drivers
>> that can support it. Even if they have large tx rings that the
>> hardware demands.
>>
> Great. I actually got back to looking at this a little last week.
> AFAICT the overhead of BQL is < 1% CPU and throughput (still need more
> testing to verify that).
Seeing it work well at 100Mbit (which much of the world still runs at - notably
most ADSL and cable modems are running at that (or less), as do all 3
of my laptops)
*really* made my night. I've been having a losing battle with the
wireless stack
architecture of late...
You don't get a factor of ~50 improvement in something every day at
nearly zero cost!
I mean, with BQL, a saturated 100Mbit system will start new tcp
connects ~50x faster,
do local dns lookups in roughly 22 ms rather than 140, and so on and
so on, and so on.
At the moment I don't care if it eats 10% of CPU! so long as it saves the most
important component of the network - the user - time. :) (thus my
interest in QFQ now)
(particularly as I assume your < 1% of cpu is for gige speeds?)
And being self clocked, BQL can handle scary hardware things like
pause frames better, too.
Win all the way across the board.
Effectively tying the driver to the line rate as BQL seems to do moves the need
for more intelligence in queue management back up into the qdisc layer.
I recently learned that with multi-cores it's actually possible to
have more than one
packet in the qdisc even at gigE speeds so a better qdisc up there may help even
at that speed, assuming BQL scales up right.
>There are some (very) minor performance
> improvements that might be possible, but I don't have any major
> modifications pending at this point.
My major thought is that bytes on the wire is a proxy for 'time'. If
you did a smoothed
ewma based on bytes/each time interval, you might be able to hold
latencies down even
better, and still use a lightweight timesource like jiffies for the calculation.
All the same, the BQL API is wonderfully clean and you can fiddle as
much as you want with
the core algorithm without exposing the actual scheme elsewhere in the stack.
My hat is off to you. I HATED the artificially low tx queue rings I
was using in cerowrt...
>
--
Dave Täht
SKYPE: davetaht
US Tel: 1-239-829-5608
FR Tel: 0638645374
http://www.bufferbloat.net
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Bloat] some (very good) preliminary results from fiddling with byte queue limits on 100Mbit ethernet
2011-11-19 20:33 [Bloat] some (very good) preliminary results from fiddling with byte queue limits on 100Mbit ethernet Dave Taht
2011-11-19 21:53 ` Tom Herbert
@ 2011-11-21 15:08 ` John W. Linville
2011-11-22 5:36 ` Simon Barber
1 sibling, 1 reply; 5+ messages in thread
From: John W. Linville @ 2011-11-21 15:08 UTC (permalink / raw)
To: Dave Taht; +Cc: bloat, Tom Herbert
On Sat, Nov 19, 2011 at 09:33:49PM +0100, Dave Taht wrote:
> So I extracted the current set of BQL related patches from the
> debloat-testing kernel and applied them to a recent linus-head
> (3.2-rc2 + a little)
> (they are at: http://www.teklibre.com/~d/tnq )
FWIW, debloat-testing literally is 3.2-rc2 + the BQL patches. I hope
you didn't spend too much time recreating that!
I'm coming to the conclusion that maintaining debloat-testing is
largely a waste of time. Does anyone (besides maybe jg) actually
run it?
John
--
John W. Linville Someday the world will need a hero, and you
linville@tuxdriver.com might be all we have. Be ready.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Bloat] some (very good) preliminary results from fiddling with byte queue limits on 100Mbit ethernet
2011-11-21 15:08 ` John W. Linville
@ 2011-11-22 5:36 ` Simon Barber
0 siblings, 0 replies; 5+ messages in thread
From: Simon Barber @ 2011-11-22 5:36 UTC (permalink / raw)
To: bloat
I'm running something called debloat-testing from the Ubuntu kernel PPA
on my laptop.
Simon
On 11/21/2011 07:08 AM, John W. Linville wrote:
> On Sat, Nov 19, 2011 at 09:33:49PM +0100, Dave Taht wrote:
>
>> So I extracted the current set of BQL related patches from the
>> debloat-testing kernel and applied them to a recent linus-head
>> (3.2-rc2 + a little)
>> (they are at: http://www.teklibre.com/~d/tnq )
>
> FWIW, debloat-testing literally is 3.2-rc2 + the BQL patches. I hope
> you didn't spend too much time recreating that!
>
> I'm coming to the conclusion that maintaining debloat-testing is
> largely a waste of time. Does anyone (besides maybe jg) actually
> run it?
>
> John
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2011-11-22 5:36 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-11-19 20:33 [Bloat] some (very good) preliminary results from fiddling with byte queue limits on 100Mbit ethernet Dave Taht
2011-11-19 21:53 ` Tom Herbert
2011-11-19 22:47 ` Dave Taht
2011-11-21 15:08 ` John W. Linville
2011-11-22 5:36 ` Simon Barber
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox