* Re: [RFC] mac80211: implement eBDP algorithm to fight bufferbloat
2011-02-17 1:49 ` [RFC] mac80211: implement eBDP algorithm to fight bufferbloat John W. Linville
@ 2011-02-17 3:31 ` Ben Greear
2011-02-17 4:26 ` Nathaniel Smith
` (2 subsequent siblings)
3 siblings, 0 replies; 23+ messages in thread
From: Ben Greear @ 2011-02-17 3:31 UTC (permalink / raw)
To: John W. Linville
Cc: bloat-devel, johannes, linux-wireless, Nathaniel J. Smith
On 02/16/2011 05:49 PM, John W. Linville wrote:
> This is an implementation of the eBDP algorithm as documented in
> Section IV of "Buffer Sizing for 802.11 Based Networks" by Tianji Li,
> et al.
>
> http://www.hamilton.ie/tianji_li/buffersizing.pdf
>
> This implementation timestamps an skb before handing it to the hardware
> driver, then computes the service time when the transmit status is
> reported back from the driver. An exponentially weighted moving
> average of per packet service times is used to restrict queueing
> delays in hopes of achieving a target packet transmission latency.
>
> Signed-off-by: John W. Linville<linville@tuxdriver.com>
> ---
> This is preliminary, but it seems to put some limits on latencies
> for me. I haven't even really done much testing, so YMMV...
>
> I'm sure this isn't ideal. This should be combined with the ALT
> algorithm to yield the A* algorithm. There are parameters that
> probably should be tunable (or at least better researched). This may
> not be ideal for 802.11n -- it may even be detrimental to it.
>
> Still, it is an attempt at addressing buffer bloat. Plus, it should
> pertain to all mac80211-based drivers. So, feel free to test it,
> suggest different parameters, report real numbers, post patches,
> etc... :-)
> + /* test for queue admission qualifications */
> + tserv_ns_avg = ewma_read(&sdata->tserv_ns_avg);
> + /* constants 2 msec and offset 5 should be tunable? */
> + max_enqueued = 2 * NSEC_PER_MSEC / tserv_ns_avg + 5;
> + if (atomic_read(&sdata->enqueued)> max_enqueued) {
> + /* silently drop */
> + dev_kfree_skb(skb);
> + return IEEE80211_TX_OK;
> + }
We should have some visible stat to increment when you drop on xmit,
tx-fifo error or similar?
Thanks,
Ben
--
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc http://www.candelatech.com
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC] mac80211: implement eBDP algorithm to fight bufferbloat
2011-02-17 1:49 ` [RFC] mac80211: implement eBDP algorithm to fight bufferbloat John W. Linville
2011-02-17 3:31 ` Ben Greear
@ 2011-02-17 4:26 ` Nathaniel Smith
2011-02-17 8:31 ` Johannes Berg
2011-02-18 21:21 ` [RFC v2] " John W. Linville
3 siblings, 0 replies; 23+ messages in thread
From: Nathaniel Smith @ 2011-02-17 4:26 UTC (permalink / raw)
To: John W. Linville; +Cc: bloat-devel, johannes, linux-wireless
On Wed, Feb 16, 2011 at 5:49 PM, John W. Linville
<linville@tuxdriver.com> wrote:
> I'm sure this isn't ideal. This should be combined with the ALT
> algorithm to yield the A* algorithm. There are parameters that
> probably should be tunable (or at least better researched). This may
> not be ideal for 802.11n -- it may even be detrimental to it.
Excellent!
General thoughts:
I think you're wiping out any interrupt mitigation any drivers are
doing, because they'll never get filled up to even their low water
mark? (http://article.gmane.org/gmane.linux.kernel.wireless.general/64843
has a scheme for adapting the eBDP idea to interrupt mitigation)
It's important to keep in mind the distinction between:
-- a host's total tx buffer
-- the individual queues that make up that buffer
In Linux we have two queues in serial, the net subsystem's Qdisc
thing, which feeds the driver's tx queue. It's this distinction that
makes it reasonable to shrink the tx queue down to really tiny sizes
(a few ms), because while a router needs a few hundred milliseconds
(~a RTT) of *total* buffering to absorb bursty packet arrivals, we
want as much of that buffering as possible to be happening in the
Qdisc, where it can have AMQ and QoS and applied.
A* is an algorithm for estimating the right total host buffer size.
(Of course, we might want to just disable the Qdisc buffering and move
everything inside the driver -- Felix Fietkau is considering this[1],
because when aggregating you really want a separate buffer per
destination STA, and there's no easy way to do this with the current
system -- but obviously that raises its own issues...)
[1] https://lists.bufferbloat.net/pipermail/bloat-devel/2011-February/000013.html
> @@ -212,6 +216,11 @@ void ieee80211_tx_status(struct ieee80211_hw *hw, struct sk_buff *skb)
> if (memcmp(hdr->addr2, sta->sdata->vif.addr, ETH_ALEN))
> continue;
>
> + atomic_dec(&sta->sdata->enqueued);
> +
> + /* factor current tserv into service time estimate */
> + ewma_add(&sta->sdata->tserv_ns_avg, ktime_to_ns(tserv));
> +
I think you're calculating the total time that the packet was resident
in the queue, and treating it like the time to service a single
packet. In my patch I also stored the current queue length at the time
the packet was enqueued, and then divided the time delta by the number
of packets that were serviced in that time.
> @@ -1323,6 +1325,20 @@ static int __ieee80211_tx(struct ieee80211_local *local,
>
> sdata = vif_to_sdata(info->control.vif);
>
> + /* test for queue admission qualifications */
> + tserv_ns_avg = ewma_read(&sdata->tserv_ns_avg);
> + /* constants 2 msec and offset 5 should be tunable? */
> + max_enqueued = 2 * NSEC_PER_MSEC / tserv_ns_avg + 5;
5 packets worth of fudge factor seems high. I can measure 15-20 ms
single packet service times here just by turning down to 1Mbs on an
uncontended network; the Li et al paper you link has a graph
suggesting that on contented networks, 50-100 ms/packet is not
uncommon. (And even if I don't force my card to 1Mbs, with my patch
I'm still seeing appropriate buffer sizes in the 1-2 packet range.) So
you may be unconditionally adding a few hundred milliseconds of
latency here.
They make a good point that you might want some extra space to absorb
short-term fluctuations in packet service times, but I think it'd make
more sense to do that by clamping the queue length to some minimum
value (2 packets?), and possibly bumping up the magic "2 ms" constant.
Or be even more clever and estimate the standard deviation of the
single packet service times...
(http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm)
> + if (atomic_read(&sdata->enqueued) > max_enqueued) {
> + /* silently drop */
> + dev_kfree_skb(skb);
> + return IEEE80211_TX_OK;
> + }
Shouldn't you be stopping the queue, too?
I think by leaving the queue open and discarding everything, you
effectively disable the net layer's Qdisc buffering, which might also
be a factor in your observations of reduced bufferbloat :-).
-- Nathaniel
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC] mac80211: implement eBDP algorithm to fight bufferbloat
2011-02-17 1:49 ` [RFC] mac80211: implement eBDP algorithm to fight bufferbloat John W. Linville
2011-02-17 3:31 ` Ben Greear
2011-02-17 4:26 ` Nathaniel Smith
@ 2011-02-17 8:31 ` Johannes Berg
2011-02-18 21:21 ` [RFC v2] " John W. Linville
3 siblings, 0 replies; 23+ messages in thread
From: Johannes Berg @ 2011-02-17 8:31 UTC (permalink / raw)
To: John W. Linville; +Cc: bloat-devel, linux-wireless, Nathaniel J. Smith
On Wed, 2011-02-16 at 20:49 -0500, John W. Linville wrote:
> @@ -212,6 +216,11 @@ void ieee80211_tx_status(struct ieee80211_hw *hw, struct sk_buff *skb)
> if (memcmp(hdr->addr2, sta->sdata->vif.addr, ETH_ALEN))
> continue;
>
> + atomic_dec(&sta->sdata->enqueued);
Note that TX status isn't reliable under any circumstances. Many drivers
will reliably return the SKB if it was enqueued correctly to the device,
but that may already fail (failure to allocate URBs, DMA memory etc.) In
this case, the counter will continually be too high, leading the code
below to drop many more packets than necessary. This is what I was
referring to earlier :-)
> @@ -1323,6 +1325,20 @@ static int __ieee80211_tx(struct ieee80211_local *local,
>
> sdata = vif_to_sdata(info->control.vif);
>
> + /* test for queue admission qualifications */
> + tserv_ns_avg = ewma_read(&sdata->tserv_ns_avg);
> + /* constants 2 msec and offset 5 should be tunable? */
> + max_enqueued = 2 * NSEC_PER_MSEC / tserv_ns_avg + 5;
> + if (atomic_read(&sdata->enqueued) > max_enqueued) {
> + /* silently drop */
> + dev_kfree_skb(skb);
> + return IEEE80211_TX_OK;
> + }
As Nathaniel pointed out, this should instead stop the queue to give
normal QoS mechanisms a chance (this one packet can be kept for later,
or transmitted anyway, or the code to test all this can be after the
packet is sent).
However, since "sdata->enqueued" will drift as explained above, this
isn't feasible anyhow.
> + /* timestamp queue entry and increment queue length */
> + skb->tstamp = ktime_get();
> + atomic_inc(&sdata->enqueued);
Additionally, this assumes that tx() cannot fail -- it can until my void
patch.
Sorry, this approach just won't work because we don't have perfect skb
accounting across the stack.
johannes
^ permalink raw reply [flat|nested] 23+ messages in thread
* [RFC v2] mac80211: implement eBDP algorithm to fight bufferbloat
2011-02-17 1:49 ` [RFC] mac80211: implement eBDP algorithm to fight bufferbloat John W. Linville
` (2 preceding siblings ...)
2011-02-17 8:31 ` Johannes Berg
@ 2011-02-18 21:21 ` John W. Linville
2011-02-19 3:44 ` Nathaniel Smith
` (2 more replies)
3 siblings, 3 replies; 23+ messages in thread
From: John W. Linville @ 2011-02-18 21:21 UTC (permalink / raw)
To: linux-wireless; +Cc: johannes, bloat-devel, Nathaniel J. Smith
This is an implementation of the eBDP algorithm as documented in
Section IV of "Buffer Sizing for 802.11 Based Networks" by Tianji Li,
et al.
http://www.hamilton.ie/tianji_li/buffersizing.pdf
This implementation timestamps an skb before handing it to the
hardware driver, then computes the service time when the frame is
freed by the driver. An exponentially weighted moving average of per
fragment service times is used to restrict queueing delays in hopes
of achieving a target fragment transmission latency.
Signed-off-by: John W. Linville <linville@tuxdriver.com>
---
v1 -> v2:
- execute algorithm separately for each WMM queue
- change ewma scaling parameters
- calculate max queue len only when new latency data is received
- stop queues when occupancy limit is reached rather than dropping
- use skb->destructor for tracking queue occupancy
Johannes' comment about tx status reporting being unreliable (and what
he was really saying) finally sunk-in. So, this version uses
skb->destructor to track in-flight fragments. That should handle
fragments that get silently dropped in the driver for whatever reason
without leaking queue capacity. Correct me if I'm wrong!
This version runs separate instances of the algorithm for each WMM
queue -- I reckon that makes sense. I still ignore HT aggregation, as
I'm pretty sure I don't understand how it realy effects the issue.
I still haven't done any objective testing, so still no numbers -- feel
free to provide your own! ;-) But for me, this version "feels" pretty
good where I meet the chair.
There are still bits that I would like to be adjustable. But since I'm
still not sure what this ultimately becomes, I'm in no hurry to finalize
a UI. I would be happy to consider a contribution from someone else in
that regard.
No doubt that this is all still quite preliminary! I'm pretty sure
it still eats puppies. So, don't get it wet! And never, ever feed
it after midnight...
net/mac80211/ieee80211_i.h | 12 ++++++++++++
net/mac80211/iface.c | 12 ++++++++++++
net/mac80211/status.c | 23 +++++++++++++++++++++++
net/mac80211/tx.c | 26 ++++++++++++++++++++++++++
4 files changed, 73 insertions(+), 0 deletions(-)
diff --git a/net/mac80211/ieee80211_i.h b/net/mac80211/ieee80211_i.h
index f2ef15d..47e65bd 100644
--- a/net/mac80211/ieee80211_i.h
+++ b/net/mac80211/ieee80211_i.h
@@ -604,6 +604,18 @@ struct ieee80211_sub_if_data {
struct dentry *default_mgmt_key;
} debugfs;
#endif
+
+ struct {
+ /* number of packets currently in flight */
+ atomic_t enqueued;
+
+ /* computed maximum packets in flight */
+ int max_enqueued;
+
+ /* average packet service time */
+ struct ewma tserv_ns_avg;
+ } qdata[4];
+
/* must be last, dynamically sized area in this! */
struct ieee80211_vif vif;
};
diff --git a/net/mac80211/iface.c b/net/mac80211/iface.c
index 5a4e19b..3e04ddb 100644
--- a/net/mac80211/iface.c
+++ b/net/mac80211/iface.c
@@ -839,6 +839,8 @@ static void ieee80211_iface_work(struct work_struct *work)
static void ieee80211_setup_sdata(struct ieee80211_sub_if_data *sdata,
enum nl80211_iftype type)
{
+ int i;
+
/* clear type-dependent union */
memset(&sdata->u, 0, sizeof(sdata->u));
@@ -857,6 +859,16 @@ static void ieee80211_setup_sdata(struct ieee80211_sub_if_data *sdata,
skb_queue_head_init(&sdata->skb_queue);
INIT_WORK(&sdata->work, ieee80211_iface_work);
+ for (i=0; i<4; i++) {
+ /* initialize service time estimate for buffer control */
+ ewma_init(&sdata->qdata[i].tserv_ns_avg, 1, 64);
+
+ atomic_set(&sdata->qdata[i].enqueued, 0);
+
+ /* default max should be ??? */
+ sdata->qdata[i].max_enqueued = 2;
+ }
+
switch (type) {
case NL80211_IFTYPE_P2P_GO:
type = NL80211_IFTYPE_AP;
diff --git a/net/mac80211/status.c b/net/mac80211/status.c
index 010a559..8cb6539 100644
--- a/net/mac80211/status.c
+++ b/net/mac80211/status.c
@@ -172,6 +172,7 @@ static void ieee80211_frame_acked(struct sta_info *sta, struct sk_buff *skb)
void ieee80211_tx_status(struct ieee80211_hw *hw, struct sk_buff *skb)
{
+ ktime_t tserv;
struct sk_buff *skb2;
struct ieee80211_hdr *hdr = (struct ieee80211_hdr *) skb->data;
struct ieee80211_local *local = hw_to_local(hw);
@@ -188,6 +189,9 @@ void ieee80211_tx_status(struct ieee80211_hw *hw, struct sk_buff *skb)
bool send_to_cooked;
bool acked;
+ /* grab timestamp info for buffer control estimates */
+ tserv = ktime_sub(ktime_get(), skb->tstamp);
+
for (i = 0; i < IEEE80211_TX_MAX_RATES; i++) {
/* the HW cannot have attempted that rate */
if (i >= hw->max_report_rates) {
@@ -208,10 +212,29 @@ void ieee80211_tx_status(struct ieee80211_hw *hw, struct sk_buff *skb)
fc = hdr->frame_control;
for_each_sta_info(local, hdr->addr1, sta, tmp) {
+ int q = skb_get_queue_mapping(skb);
+ unsigned long tserv_ns_avg;
+
/* skip wrong virtual interface */
if (memcmp(hdr->addr2, sta->sdata->vif.addr, ETH_ALEN))
continue;
+ /* factor current tserv into service time estimate */
+ ewma_add(&sta->sdata->qdata[q].tserv_ns_avg,
+ ktime_to_ns(tserv));
+
+ /* determine queue admission qualifications */
+ tserv_ns_avg = ewma_read(&sta->sdata->qdata[q].tserv_ns_avg);
+ if (tserv_ns_avg) {
+ /* latency target and min limit should be tunable? */
+ sta->sdata->qdata[q].max_enqueued =
+ max_t(int, 2, 2 * NSEC_PER_MSEC / tserv_ns_avg);
+ }
+ if (atomic_read(&sta->sdata->qdata[q].enqueued) >=
+ sta->sdata->qdata[q].max_enqueued &&
+ !ieee80211_queue_stopped(&local->hw, q))
+ ieee80211_stop_queue(&local->hw, q);
+
acked = !!(info->flags & IEEE80211_TX_STAT_ACK);
if (!acked && test_sta_flags(sta, WLAN_STA_PS_STA)) {
/*
diff --git a/net/mac80211/tx.c b/net/mac80211/tx.c
index 17ef4f4..cba9060 100644
--- a/net/mac80211/tx.c
+++ b/net/mac80211/tx.c
@@ -1284,6 +1284,20 @@ ieee80211_tx_prepare(struct ieee80211_sub_if_data *sdata,
return TX_CONTINUE;
}
+static void ieee80211_tx_skb_free(struct sk_buff *skb)
+{
+ struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(skb->dev);
+ struct ieee80211_local *local = sdata->local;
+ int q = skb_get_queue_mapping(skb);
+
+ /* decrement enqueued count */
+ atomic_dec(&sdata->qdata[q].enqueued);
+
+ /* if queue stopped, wake it */
+ if (ieee80211_queue_stopped(&local->hw, q))
+ ieee80211_wake_queue(&local->hw, q);
+}
+
static int __ieee80211_tx(struct ieee80211_local *local,
struct sk_buff **skbp,
struct sta_info *sta,
@@ -1323,6 +1337,18 @@ static int __ieee80211_tx(struct ieee80211_local *local,
sdata = vif_to_sdata(info->control.vif);
+ /* timestamp queue entry and increment queue length */
+ skb->tstamp = ktime_get();
+ atomic_inc(&sdata->qdata[q].enqueued);
+
+ /* take ownership of skb */
+ skb_orphan(skb);
+ skb->dev = sdata->dev;
+ skb->destructor = ieee80211_tx_skb_free;
+ if (atomic_read(&sdata->qdata[q].enqueued) >=
+ sdata->qdata[q].max_enqueued)
+ ieee80211_stop_queue(&local->hw, q);
+
switch (sdata->vif.type) {
case NL80211_IFTYPE_MONITOR:
info->control.vif = NULL;
--
1.7.4
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC v2] mac80211: implement eBDP algorithm to fight bufferbloat
2011-02-18 21:21 ` [RFC v2] " John W. Linville
@ 2011-02-19 3:44 ` Nathaniel Smith
2011-02-21 18:47 ` John W. Linville
2011-02-20 0:37 ` Nathaniel Smith
2011-02-21 15:28 ` Johannes Berg
2 siblings, 1 reply; 23+ messages in thread
From: Nathaniel Smith @ 2011-02-19 3:44 UTC (permalink / raw)
To: John W. Linville; +Cc: bloat-devel, johannes, linux-wireless
On Fri, Feb 18, 2011 at 1:21 PM, John W. Linville
<linville@tuxdriver.com> wrote:
> + /* grab timestamp info for buffer control estimates */
> + tserv = ktime_sub(ktime_get(), skb->tstamp);
[...]
> + ewma_add(&sta->sdata->qdata[q].tserv_ns_avg,
> + ktime_to_ns(tserv));
I think you're still measuring how long it takes one packet to get
from the end of the queue to the beginning, rather than measuring how
long it takes each packet to go out?
-- Nathaniel
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC v2] mac80211: implement eBDP algorithm to fight bufferbloat
2011-02-19 3:44 ` Nathaniel Smith
@ 2011-02-21 18:47 ` John W. Linville
2011-02-21 23:26 ` Nathaniel Smith
0 siblings, 1 reply; 23+ messages in thread
From: John W. Linville @ 2011-02-21 18:47 UTC (permalink / raw)
To: Nathaniel Smith; +Cc: bloat-devel, johannes, linux-wireless
On Fri, Feb 18, 2011 at 07:44:30PM -0800, Nathaniel Smith wrote:
> On Fri, Feb 18, 2011 at 1:21 PM, John W. Linville
> <linville@tuxdriver.com> wrote:
> > + /* grab timestamp info for buffer control estimates */
> > + tserv = ktime_sub(ktime_get(), skb->tstamp);
> [...]
> > + ewma_add(&sta->sdata->qdata[q].tserv_ns_avg,
> > + ktime_to_ns(tserv));
>
> I think you're still measuring how long it takes one packet to get
> from the end of the queue to the beginning, rather than measuring how
> long it takes each packet to go out?
Yes, I am measuring how long the driver+device takes to release each
skb back to me (using that as a proxy for how long it takes to get
the fragment to the next hop). Actually, FWIW I'm only measuring
that time for those skb's that result in a tx status report.
I tried to see how your measurement would be useful, but I just don't
see how the number of frames ahead of me in the queue is relevant to
the measured link latency? I mean, I realize that having more packets
ahead of me in the queue is likely to increase the latency for this
frame, but I don't understand why I should use that information to
discount the measured latency...?
John
--
John W. Linville Someday the world will need a hero, and you
linville@tuxdriver.com might be all we have. Be ready.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC v2] mac80211: implement eBDP algorithm to fight bufferbloat
2011-02-21 18:47 ` John W. Linville
@ 2011-02-21 23:26 ` Nathaniel Smith
2011-02-23 22:28 ` John W. Linville
0 siblings, 1 reply; 23+ messages in thread
From: Nathaniel Smith @ 2011-02-21 23:26 UTC (permalink / raw)
To: John W. Linville; +Cc: bloat-devel, johannes, linux-wireless
On Mon, Feb 21, 2011 at 10:47 AM, John W. Linville
<linville@tuxdriver.com> wrote:
> On Fri, Feb 18, 2011 at 07:44:30PM -0800, Nathaniel Smith wrote:
>> On Fri, Feb 18, 2011 at 1:21 PM, John W. Linville
>> <linville@tuxdriver.com> wrote:
>> > + /* grab timestamp info for buffer control estimates */
>> > + tserv = ktime_sub(ktime_get(), skb->tstamp);
>> [...]
>> > + ewma_add(&sta->sdata->qdata[q].tserv_ns_avg,
>> > + ktime_to_ns(tserv));
>>
>> I think you're still measuring how long it takes one packet to get
>> from the end of the queue to the beginning, rather than measuring how
>> long it takes each packet to go out?
>
> Yes, I am measuring how long the driver+device takes to release each
> skb back to me (using that as a proxy for how long it takes to get
> the fragment to the next hop). Actually, FWIW I'm only measuring
> that time for those skb's that result in a tx status report.
>
> I tried to see how your measurement would be useful, but I just don't
> see how the number of frames ahead of me in the queue is relevant to
> the measured link latency? I mean, I realize that having more packets
> ahead of me in the queue is likely to increase the latency for this
> frame, but I don't understand why I should use that information to
> discount the measured latency...?
It depends on which latency you want to measure. The way that I
reasoned was, suppose that at some given time, the card is able to
transmit 1 fragment every T nanoseconds. Then it can transmit n
fragments in n*T nanoseconds, so if we want the queue depth to be 2
ms, then we have
n * T = 2 * NSEC_PER_MSEC
n = 2 * NSEC_PER_MSEC / T
Which is the calculation that you're doing:
+ sta->sdata->qdata[q].max_enqueued =
+ max_t(int, 2, 2 * NSEC_PER_MSEC / tserv_ns_avg);
But for this calculation to make sense, we need T to be the time it
takes the card to transmit 1 fragment. In your patch, you're not
measuring that. You're measuring the total time between when a packet
is enqueued and when it is transmitted; if there were K packets in the
queue ahead of it, then this is the time to send *all* of them --
you're measuring (K+1)*T. That's why in my patch, I recorded the
current size of the queue when each packet is enqueued, so I could
compute T = total_time / (K+1).
Under saturation conditions, K+1 will always equal max_enqueued, so I
guess in your algorithm, at the steady state we have
max_enqueued = K+1 = 2 * NSEC_PER_MSEC / ((K+1) * T)
(K+1)^2 = 2 * NSEC_PER_MSEC / T
K+1 = sqrt(2 * NSEC_PER_MSEC / T)
So I think under saturation, you converge to setting the queue to the
square root of the appropriate size?
-- Nathaniel
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC v2] mac80211: implement eBDP algorithm to fight bufferbloat
2011-02-21 23:26 ` Nathaniel Smith
@ 2011-02-23 22:28 ` John W. Linville
2011-02-25 18:21 ` Nathaniel Smith
0 siblings, 1 reply; 23+ messages in thread
From: John W. Linville @ 2011-02-23 22:28 UTC (permalink / raw)
To: Nathaniel Smith; +Cc: bloat-devel, johannes, linux-wireless
On Mon, Feb 21, 2011 at 03:26:32PM -0800, Nathaniel Smith wrote:
> On Mon, Feb 21, 2011 at 10:47 AM, John W. Linville
> <linville@tuxdriver.com> wrote:
> > I tried to see how your measurement would be useful, but I just don't
> > see how the number of frames ahead of me in the queue is relevant to
> > the measured link latency? I mean, I realize that having more packets
> > ahead of me in the queue is likely to increase the latency for this
> > frame, but I don't understand why I should use that information to
> > discount the measured latency...?
>
> It depends on which latency you want to measure. The way that I
> reasoned was, suppose that at some given time, the card is able to
> transmit 1 fragment every T nanoseconds. Then it can transmit n
> fragments in n*T nanoseconds, so if we want the queue depth to be 2
> ms, then we have
> n * T = 2 * NSEC_PER_MSEC
> n = 2 * NSEC_PER_MSEC / T
>
> Which is the calculation that you're doing:
>
> + sta->sdata->qdata[q].max_enqueued =
> + max_t(int, 2, 2 * NSEC_PER_MSEC / tserv_ns_avg);
>
> But for this calculation to make sense, we need T to be the time it
> takes the card to transmit 1 fragment. In your patch, you're not
> measuring that. You're measuring the total time between when a packet
> is enqueued and when it is transmitted; if there were K packets in the
> queue ahead of it, then this is the time to send *all* of them --
> you're measuring (K+1)*T. That's why in my patch, I recorded the
> current size of the queue when each packet is enqueued, so I could
> compute T = total_time / (K+1).
Thanks for the math! I think I see what you are saying now. Since the
measured time is being used to determine the queue size, we need to
factor-in the length of the queue that resulted in that measurment.
Unfortunately, I'm not sure how to apply this with the technique I
am using for the timing measurements. :-( I'll have to think about
this some more...
John
--
John W. Linville Someday the world will need a hero, and you
linville@tuxdriver.com might be all we have. Be ready.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC v2] mac80211: implement eBDP algorithm to fight bufferbloat
2011-02-23 22:28 ` John W. Linville
@ 2011-02-25 18:21 ` Nathaniel Smith
2011-02-25 18:27 ` Nathaniel Smith
0 siblings, 1 reply; 23+ messages in thread
From: Nathaniel Smith @ 2011-02-25 18:21 UTC (permalink / raw)
To: John W. Linville; +Cc: bloat-devel, johannes, linux-wireless
On Wed, Feb 23, 2011 at 2:28 PM, John W. Linville
<linville@tuxdriver.com> wrote:
> On Mon, Feb 21, 2011 at 03:26:32PM -0800, Nathaniel Smith wrote:
>> But for this calculation to make sense, we need T to be the time it
>> takes the card to transmit 1 fragment. In your patch, you're not
>> measuring that. You're measuring the total time between when a packet
>> is enqueued and when it is transmitted; if there were K packets in the
>> queue ahead of it, then this is the time to send *all* of them --
>> you're measuring (K+1)*T. That's why in my patch, I recorded the
>> current size of the queue when each packet is enqueued, so I could
>> compute T = total_time / (K+1).
>
> Thanks for the math! I think I see what you are saying now. Since the
> measured time is being used to determine the queue size, we need to
> factor-in the length of the queue that resulted in that measurment.
>
> Unfortunately, I'm not sure how to apply this with the technique I
> am using for the timing measurements. :-( I'll have to think about
> this some more...
I guess an ugly but effective approach would be to steal some bits
from the timestamp... something like (untested, and probably needs
some sign-extension bugs fixed):
#define LATENCY_BITS 8
#define ENQUEUED_BITS 24
BUILD_BUG_ON(LATENCY_BITS + ENQUEUED_BITS != 32);
static ktime_t _ktime_get_and_stash(int enqueued) {
timespec now = ktime_to_timespec(ktime_get());
now.sec &= (1 << LATENCY_BITS) - 1;
BUG_ON(enqueued > (1 << ENQUEUED_BITS));
now.sec |= enqueued << LATENCY_BITS;
return timespec_to_ktime(now);
}
static u64 _ktime_diff_to_now_and_unstash(ktime_t then, int * enqueued) {
timespec ts_then = ktime_to_timespec(then);
timespec ts_now = ktime_to_timespec(ktime_get());
*enqueued = ts_then.tv_sec >> LATENCY_BITS;
ts_then.tv_sec &= (1 << LATENCY_BITS) - 1;
ts_now.tv_sec &= (1 << LATENCY_BITS) - 1;
if (ts_now.tv_sec < ts_then.tv_sec)
ts_now.tv_sec += (1 << LATENCY_BITS);
timespec_sub(ts_now, ts_then);
}
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC v2] mac80211: implement eBDP algorithm to fight bufferbloat
2011-02-25 18:21 ` Nathaniel Smith
@ 2011-02-25 18:27 ` Nathaniel Smith
0 siblings, 0 replies; 23+ messages in thread
From: Nathaniel Smith @ 2011-02-25 18:27 UTC (permalink / raw)
To: John W. Linville; +Cc: bloat-devel, johannes, linux-wireless
On Fri, Feb 25, 2011 at 10:21 AM, Nathaniel Smith <njs@pobox.com> wrote:
> static u64 _ktime_diff_to_now_and_unstash(ktime_t then, int * enqueued) {
>
> timespec ts_then = ktime_to_timespec(then);
> timespec ts_now = ktime_to_timespec(ktime_get());
> *enqueued = ts_then.tv_sec >> LATENCY_BITS;
> ts_then.tv_sec &= (1 << LATENCY_BITS) - 1;
> ts_now.tv_sec &= (1 << LATENCY_BITS) - 1;
> if (ts_now.tv_sec < ts_then.tv_sec)
> ts_now.tv_sec += (1 << LATENCY_BITS);
> timespec_sub(ts_now, ts_then);
> }
Err, plus the 'return timespec_to_ns(...)' on the last line, that I
was trying to add when my computer suddenly decided I wanted to send
the message. How embarrassing.
Anyway, not sure this is a *good* idea, but it should work. Hopefully
we don't actually need to measure latencies > 256 seconds in this
context...
-- Nathaniel
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC v2] mac80211: implement eBDP algorithm to fight bufferbloat
2011-02-18 21:21 ` [RFC v2] " John W. Linville
2011-02-19 3:44 ` Nathaniel Smith
@ 2011-02-20 0:37 ` Nathaniel Smith
2011-02-20 0:51 ` Jim Gettys
2011-02-21 18:52 ` John W. Linville
2011-02-21 15:28 ` Johannes Berg
2 siblings, 2 replies; 23+ messages in thread
From: Nathaniel Smith @ 2011-02-20 0:37 UTC (permalink / raw)
To: John W. Linville; +Cc: bloat-devel, johannes, linux-wireless
Actually, a few more comments just occurred to me...
On Fri, Feb 18, 2011 at 1:21 PM, John W. Linville
<linville@tuxdriver.com> wrote:
> Johannes' comment about tx status reporting being unreliable (and what
> he was really saying) finally sunk-in. So, this version uses
> skb->destructor to track in-flight fragments. That should handle
> fragments that get silently dropped in the driver for whatever reason
> without leaking queue capacity. Correct me if I'm wrong!
Should we be somehow filtering out and ignoring the packets that get
dropped, when we're calculating the average packet transmission rate?
Presumably they're getting dropped almost instantly, so they don't
really take up queue space and they have abnormally fast transmission
times, which will tend to cause us to overestimate max_enqueued? They
should be rare, though, at least. (And presumably we *should* include
packets that get dropped because their retry timer ran out, since they
were sitting in the queue for that long.) Possibly we should just
ignore any packet that was handled in less than, oh, say, a few
microseconds?
Alternatively, if we aren't worried about those packets, then is there
any reason this patch should be mac80211 specific?
> +static void ieee80211_tx_skb_free(struct sk_buff *skb)
> +{
> + struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(skb->dev);
> + struct ieee80211_local *local = sdata->local;
> + int q = skb_get_queue_mapping(skb);
> +
> + /* decrement enqueued count */
> + atomic_dec(&sdata->qdata[q].enqueued);
> +
> + /* if queue stopped, wake it */
> + if (ieee80211_queue_stopped(&local->hw, q))
> + ieee80211_wake_queue(&local->hw, q);
> +}
I think you need to check that .enqueued is < max_enqueued here,
instead of waking the queue unconditionally.
Suppose the data rate drops while there's a steady flow -- our
max_enqueued value will drop below the current queue size, but because
we wake the queue unconditionally after each packet goes out, and then
only stop it again after we've queued at least one new packet, we
might get 'stuck' with an over-large queue.
-- Nathaniel
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC v2] mac80211: implement eBDP algorithm to fight bufferbloat
2011-02-20 0:37 ` Nathaniel Smith
@ 2011-02-20 0:51 ` Jim Gettys
2011-02-20 15:24 ` Dave Täht
2011-02-21 18:52 ` John W. Linville
1 sibling, 1 reply; 23+ messages in thread
From: Jim Gettys @ 2011-02-20 0:51 UTC (permalink / raw)
To: bloat-devel
On 02/19/2011 07:37 PM, Nathaniel Smith wrote:
> Actually, a few more comments just occurred to me...
>
> On Fri, Feb 18, 2011 at 1:21 PM, John W. Linville
> <linville@tuxdriver.com> wrote:
>> Johannes' comment about tx status reporting being unreliable (and what
>> he was really saying) finally sunk-in. So, this version uses
>> skb->destructor to track in-flight fragments. That should handle
>> fragments that get silently dropped in the driver for whatever reason
>> without leaking queue capacity. Correct me if I'm wrong!
>
> Should we be somehow filtering out and ignoring the packets that get
> dropped, when we're calculating the average packet transmission rate?
> Presumably they're getting dropped almost instantly, so they don't
> really take up queue space and they have abnormally fast transmission
> times, which will tend to cause us to overestimate max_enqueued? They
> should be rare, though, at least. (And presumably we *should* include
> packets that get dropped because their retry timer ran out, since they
> were sitting in the queue for that long.) Possibly we should just
> ignore any packet that was handled in less than, oh, say, a few
> microseconds?
I will note that AQM algorithms that are likely to work will need to
know the actual "goodput" of the link. (This from talking to Van
Jacobson about the problem we face.)
- Jim
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC v2] mac80211: implement eBDP algorithm to fight bufferbloat
2011-02-20 0:51 ` Jim Gettys
@ 2011-02-20 15:24 ` Dave Täht
0 siblings, 0 replies; 23+ messages in thread
From: Dave Täht @ 2011-02-20 15:24 UTC (permalink / raw)
To: Jim Gettys; +Cc: bloat-devel
Jim Gettys <jg@freedesktop.org> writes:
> On 02/19/2011 07:37 PM, Nathaniel Smith wrote:
>> Actually, a few more comments just occurred to me...
>>
>> On Fri, Feb 18, 2011 at 1:21 PM, John W. Linville
>> <linville@tuxdriver.com> wrote:
>>> Johannes' comment about tx status reporting being unreliable (and what
>>> he was really saying) finally sunk-in. So, this version uses
>>> skb->destructor to track in-flight fragments. That should handle
>>> fragments that get silently dropped in the driver for whatever reason
>>> without leaking queue capacity. Correct me if I'm wrong!
>>
>> Should we be somehow filtering out and ignoring the packets that get
>> dropped, when we're calculating the average packet transmission rate?
>> Presumably they're getting dropped almost instantly, so they don't
>> really take up queue space and they have abnormally fast transmission
>> times, which will tend to cause us to overestimate max_enqueued? They
>> should be rare, though, at least. (And presumably we *should* include
>> packets that get dropped because their retry timer ran out, since they
>> were sitting in the queue for that long.) Possibly we should just
>> ignore any packet that was handled in less than, oh, say, a few
>> microseconds?
The perfect is the enemy of the good. POGE[1] applies. After getting 2
orders of magnitude improvement with the patches thus far...
My reaction to the above is that I'd like to get what is being
implemented tested by more folk, across more drivers. Me being one of
them! I hope to get builds done today for several openwrt devices and my
laggy-lagn laptop.
> I will note that AQM algorithms that are likely to work will need to
> know the actual "goodput" of the link. (This from talking to Van
> Jacobson about the problem we face.)
I care about AQM, I really do[2]... Certainly exposing more hooks and data
to the next layer in the stack from the driver would be good... But I'd
like mostly simply to get:
A) People using and loving debloated drivers, & more people aware of and
fixing the problem...
B) The dynamic range available via ethtool to userspace to be vastly
increased
C) Buffer sizes defaulting to much less, across the board.
These are the simplest, most effective, low hanging fruit/fixes. [1]
We're making great progress on wireless, how do we make progress on wired?
--
Dave Taht
http://nex-6.taht.net
1: http://en.wikipedia.org/wiki/Principle_of_good_enough
2: I'll patch SFB into my build and any other of the new AQMs
anyone can suggest
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC v2] mac80211: implement eBDP algorithm to fight bufferbloat
2011-02-20 0:37 ` Nathaniel Smith
2011-02-20 0:51 ` Jim Gettys
@ 2011-02-21 18:52 ` John W. Linville
1 sibling, 0 replies; 23+ messages in thread
From: John W. Linville @ 2011-02-21 18:52 UTC (permalink / raw)
To: Nathaniel Smith; +Cc: bloat-devel, johannes, linux-wireless
On Sat, Feb 19, 2011 at 04:37:00PM -0800, Nathaniel Smith wrote:
> Actually, a few more comments just occurred to me...
>
> On Fri, Feb 18, 2011 at 1:21 PM, John W. Linville
> <linville@tuxdriver.com> wrote:
> > Johannes' comment about tx status reporting being unreliable (and what
> > he was really saying) finally sunk-in. So, this version uses
> > skb->destructor to track in-flight fragments. That should handle
> > fragments that get silently dropped in the driver for whatever reason
> > without leaking queue capacity. Correct me if I'm wrong!
>
> Should we be somehow filtering out and ignoring the packets that get
> dropped, when we're calculating the average packet transmission rate?
> Presumably they're getting dropped almost instantly, so they don't
> really take up queue space and they have abnormally fast transmission
> times, which will tend to cause us to overestimate max_enqueued? They
> should be rare, though, at least. (And presumably we *should* include
> packets that get dropped because their retry timer ran out, since they
> were sitting in the queue for that long.) Possibly we should just
> ignore any packet that was handled in less than, oh, say, a few
> microseconds?
If you look, I only do the timing measurement for frames that
result in a tx status report. Frames that are dropped will only hit
ieee80211_tx_skb_free (which reduces the enqueued count but doesn't
recalculate max_enqueue).
> Alternatively, if we aren't worried about those packets, then is there
> any reason this patch should be mac80211 specific?
No, in fact I was thinking the same thing. Some thought needs to be
put to whether or not we can ignore the effects of letting dropped
packets effect the latency estimate...
> > +static void ieee80211_tx_skb_free(struct sk_buff *skb)
> > +{
> > + struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(skb->dev);
> > + struct ieee80211_local *local = sdata->local;
> > + int q = skb_get_queue_mapping(skb);
> > +
> > + /* decrement enqueued count */
> > + atomic_dec(&sdata->qdata[q].enqueued);
> > +
> > + /* if queue stopped, wake it */
> > + if (ieee80211_queue_stopped(&local->hw, q))
> > + ieee80211_wake_queue(&local->hw, q);
> > +}
>
> I think you need to check that .enqueued is < max_enqueued here,
> instead of waking the queue unconditionally.
>
> Suppose the data rate drops while there's a steady flow -- our
> max_enqueued value will drop below the current queue size, but because
> we wake the queue unconditionally after each packet goes out, and then
> only stop it again after we've queued at least one new packet, we
> might get 'stuck' with an over-large queue.
Yes, thanks for pointing that out. My non-thorough tests seem to do
a better job at holding the latency lower with that change.
Thanks for taking a look!
John
--
John W. Linville Someday the world will need a hero, and you
linville@tuxdriver.com might be all we have. Be ready.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC v2] mac80211: implement eBDP algorithm to fight bufferbloat
2011-02-18 21:21 ` [RFC v2] " John W. Linville
2011-02-19 3:44 ` Nathaniel Smith
2011-02-20 0:37 ` Nathaniel Smith
@ 2011-02-21 15:28 ` Johannes Berg
2011-02-21 16:12 ` Jim Gettys
2011-02-21 19:06 ` John W. Linville
2 siblings, 2 replies; 23+ messages in thread
From: Johannes Berg @ 2011-02-21 15:28 UTC (permalink / raw)
To: John W. Linville; +Cc: bloat-devel, linux-wireless, Nathaniel J. Smith
On Fri, 2011-02-18 at 16:21 -0500, John W. Linville wrote:
> This is an implementation of the eBDP algorithm as documented in
> Section IV of "Buffer Sizing for 802.11 Based Networks" by Tianji Li,
> et al.
>
> http://www.hamilton.ie/tianji_li/buffersizing.pdf
>
> This implementation timestamps an skb before handing it to the
> hardware driver, then computes the service time when the frame is
> freed by the driver. An exponentially weighted moving average of per
> fragment service times is used to restrict queueing delays in hopes
> of achieving a target fragment transmission latency.
>
> Signed-off-by: John W. Linville <linville@tuxdriver.com>
> ---
> v1 -> v2:
> - execute algorithm separately for each WMM queue
> - change ewma scaling parameters
> - calculate max queue len only when new latency data is received
> - stop queues when occupancy limit is reached rather than dropping
> - use skb->destructor for tracking queue occupancy
>
> Johannes' comment about tx status reporting being unreliable (and what
> he was really saying) finally sunk-in. So, this version uses
> skb->destructor to track in-flight fragments. That should handle
> fragments that get silently dropped in the driver for whatever reason
> without leaking queue capacity. Correct me if I'm wrong!
Yeah, I had that idea as well. Could unify the existing skb_orphan()
call though :-)
However, Nathaniel is right -- if the skb is freed right away during
tx() you kinda estimate its queue time to be virtually zero. That
doesn't make a lot of sense and might in certain conditions exacerbate
the problem, for example if the system is out of memory more packets
might be allowed through than in normal operation etc.
Also, for some USB drivers I believe SKB lifetime has no relation to
queue size at all because the data is just shuffled into an URB. I'm not
sure we can solve this generically. I'm not really sure how this works
for USB drivers, I think they queue up frames with the HCI controller
rather than directly with the device.
Finally, this isn't taking into account any of the issues about
aggregation and AP mode. Remember that both with multiple streams (on
different ACs) and even more so going to different stations
(AP/IBSS/mesh modes, and likely soon even in STA mode with (T)DLS, and
let's not forget 11ac/ad) there may be vast differences in the time
different frames spend on a queue which are not just due to bloated
queues. I'm concerned about this since none of it has been taken into
account in the paper you're basing this on, all evaluations seem to be
pretty much based on a single traffic stream.
Overall, I think there should be some more research first. This might
help in some cases, but do we know it won't completely break throughput
in other cases?
johannes
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC v2] mac80211: implement eBDP algorithm to fight bufferbloat
2011-02-21 15:28 ` Johannes Berg
@ 2011-02-21 16:12 ` Jim Gettys
2011-02-21 19:15 ` John W. Linville
2011-02-21 19:06 ` John W. Linville
1 sibling, 1 reply; 23+ messages in thread
From: Jim Gettys @ 2011-02-21 16:12 UTC (permalink / raw)
To: bloat-devel; +Cc: Dan Williams, Javier Cardona, Dave Woodhouse
On 02/21/2011 10:28 AM, Johannes Berg wrote:
> On Fri, 2011-02-18 at 16:21 -0500, John W. Linville wrote:
>> This is an implementation of the eBDP algorithm as documented in
>> Section IV of "Buffer Sizing for 802.11 Based Networks" by Tianji Li,
>> et al.
>>
>> http://www.hamilton.ie/tianji_li/buffersizing.pdf
>>
>> This implementation timestamps an skb before handing it to the
>> hardware driver, then computes the service time when the frame is
>> freed by the driver. An exponentially weighted moving average of per
>> fragment service times is used to restrict queueing delays in hopes
>> of achieving a target fragment transmission latency.
>>
>> Signed-off-by: John W. Linville<linville@tuxdriver.com>
>> ---
>> v1 -> v2:
>> - execute algorithm separately for each WMM queue
>> - change ewma scaling parameters
>> - calculate max queue len only when new latency data is received
>> - stop queues when occupancy limit is reached rather than dropping
>> - use skb->destructor for tracking queue occupancy
>>
>> Johannes' comment about tx status reporting being unreliable (and what
>> he was really saying) finally sunk-in. So, this version uses
>> skb->destructor to track in-flight fragments. That should handle
>> fragments that get silently dropped in the driver for whatever reason
>> without leaking queue capacity. Correct me if I'm wrong!
>
> Yeah, I had that idea as well. Could unify the existing skb_orphan()
> call though :-)
>
> However, Nathaniel is right -- if the skb is freed right away during
> tx() you kinda estimate its queue time to be virtually zero. That
> doesn't make a lot of sense and might in certain conditions exacerbate
> the problem, for example if the system is out of memory more packets
> might be allowed through than in normal operation etc.
>
> Also, for some USB drivers I believe SKB lifetime has no relation to
> queue size at all because the data is just shuffled into an URB. I'm not
> sure we can solve this generically. I'm not really sure how this works
> for USB drivers, I think they queue up frames with the HCI controller
> rather than directly with the device.
Let me give a concrete example:
I checked with Javier Cardona about the Marvell module (libertas driver)
used on OLPC a couple months ago.
It turns out there are 4 packets of buffering out in the wireless module
itself (clearly needed for autonomous forwarding).
There is also one packet buffer in the device driver itself; Dave
Woodhouse says it simplified the driver greatly.
I don't know if anyone has been thinking about how to manage the
buffering from top to bottom, with devices that may do internal
buffering in various places.
>
> Finally, this isn't taking into account any of the issues about
> aggregation and AP mode. Remember that both with multiple streams (on
> different ACs) and even more so going to different stations
> (AP/IBSS/mesh modes, and likely soon even in STA mode with (T)DLS, and
> let's not forget 11ac/ad) there may be vast differences in the time
> different frames spend on a queue which are not just due to bloated
> queues. I'm concerned about this since none of it has been taken into
> account in the paper you're basing this on, all evaluations seem to be
> pretty much based on a single traffic stream.
>
> Overall, I think there should be some more research first. This might
> help in some cases, but do we know it won't completely break throughput
> in other cases?
>
- Jim
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC v2] mac80211: implement eBDP algorithm to fight bufferbloat
2011-02-21 16:12 ` Jim Gettys
@ 2011-02-21 19:15 ` John W. Linville
0 siblings, 0 replies; 23+ messages in thread
From: John W. Linville @ 2011-02-21 19:15 UTC (permalink / raw)
To: Jim Gettys; +Cc: Dan Williams, Javier Cardona, bloat-devel, Dave Woodhouse
On Mon, Feb 21, 2011 at 11:12:14AM -0500, Jim Gettys wrote:
> On 02/21/2011 10:28 AM, Johannes Berg wrote:
> >However, Nathaniel is right -- if the skb is freed right away during
> >tx() you kinda estimate its queue time to be virtually zero. That
> >doesn't make a lot of sense and might in certain conditions exacerbate
> >the problem, for example if the system is out of memory more packets
> >might be allowed through than in normal operation etc.
For those only reading the bloat lists, I replied elsewhere to clarify
that the present patch is only calculating max_enqueued for frames
that result in a tx status report (i.e. not for dropped frames) --
just in case that detail is somehow relevant...
> >Also, for some USB drivers I believe SKB lifetime has no relation to
> >queue size at all because the data is just shuffled into an URB. I'm not
> >sure we can solve this generically. I'm not really sure how this works
> >for USB drivers, I think they queue up frames with the HCI controller
> >rather than directly with the device.
>
> Let me give a concrete example:
>
> I checked with Javier Cardona about the Marvell module (libertas
> driver) used on OLPC a couple months ago.
>
> It turns out there are 4 packets of buffering out in the wireless
> module itself (clearly needed for autonomous forwarding).
>
> There is also one packet buffer in the device driver itself; Dave
> Woodhouse says it simplified the driver greatly.
>
> I don't know if anyone has been thinking about how to manage the
> buffering from top to bottom, with devices that may do internal
> buffering in various places.
(FWIW, my current patch won't affect the libertas driver...)
The role I see my patch playing is to evaluate how good a job the
driver/device is doing with it's own buffers. If it is keeping its
latency low, then I will allow it more fragments to buffer. If its
latency grows too large, I throttle the number of fragments I allow
the driver to see. To a large degree, it doesn't matter how much
buffering the driver/device is doing so long as it is moving the
frames along quickly.
John
--
John W. Linville Someday the world will need a hero, and you
linville@tuxdriver.com might be all we have. Be ready.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC v2] mac80211: implement eBDP algorithm to fight bufferbloat
2011-02-21 15:28 ` Johannes Berg
2011-02-21 16:12 ` Jim Gettys
@ 2011-02-21 19:06 ` John W. Linville
2011-02-21 19:29 ` [RFC v2] mac80211: implement eBDP algorithm to fight bufferbloat - AQM on hosts Jim Gettys
` (2 more replies)
1 sibling, 3 replies; 23+ messages in thread
From: John W. Linville @ 2011-02-21 19:06 UTC (permalink / raw)
To: Johannes Berg; +Cc: bloat-devel, linux-wireless, Nathaniel J. Smith
On Mon, Feb 21, 2011 at 04:28:06PM +0100, Johannes Berg wrote:
> On Fri, 2011-02-18 at 16:21 -0500, John W. Linville wrote:
> > This is an implementation of the eBDP algorithm as documented in
> > Section IV of "Buffer Sizing for 802.11 Based Networks" by Tianji Li,
> > et al.
> >
> > http://www.hamilton.ie/tianji_li/buffersizing.pdf
> >
> > This implementation timestamps an skb before handing it to the
> > hardware driver, then computes the service time when the frame is
> > freed by the driver. An exponentially weighted moving average of per
> > fragment service times is used to restrict queueing delays in hopes
> > of achieving a target fragment transmission latency.
> >
> > Signed-off-by: John W. Linville <linville@tuxdriver.com>
> > ---
> > v1 -> v2:
> > - execute algorithm separately for each WMM queue
> > - change ewma scaling parameters
> > - calculate max queue len only when new latency data is received
> > - stop queues when occupancy limit is reached rather than dropping
> > - use skb->destructor for tracking queue occupancy
> >
> > Johannes' comment about tx status reporting being unreliable (and what
> > he was really saying) finally sunk-in. So, this version uses
> > skb->destructor to track in-flight fragments. That should handle
> > fragments that get silently dropped in the driver for whatever reason
> > without leaking queue capacity. Correct me if I'm wrong!
>
> Yeah, I had that idea as well. Could unify the existing skb_orphan()
> call though :-)
The one in ieee80211_skb_resize? Any idea how that would look?
> However, Nathaniel is right -- if the skb is freed right away during
> tx() you kinda estimate its queue time to be virtually zero. That
> doesn't make a lot of sense and might in certain conditions exacerbate
> the problem, for example if the system is out of memory more packets
> might be allowed through than in normal operation etc.
As in my reply to Nathaniel, please notice that the timing estimate
(and the max_enqueued calculation) only happens for frames that result
in a tx status report -- at least for now...
However, if this were generalized beyond mac80211 then we wouldn't
be able to rely on tx status reports. I can see that dropping frames
in the driver would lead to timing estimates that would cascade into
a wide-open queue size. But I'm not sure that would be a big deal,
since in the long run those dropped frames should still result in IP
cwnd reductions, etc...?
> Also, for some USB drivers I believe SKB lifetime has no relation to
> queue size at all because the data is just shuffled into an URB. I'm not
> sure we can solve this generically. I'm not really sure how this works
> for USB drivers, I think they queue up frames with the HCI controller
> rather than directly with the device.
How do you think the time spent handling URBs in the USB stack relates
to the time spent transmitting frames? At what point do those SKBs
get freed?
> Finally, this isn't taking into account any of the issues about
> aggregation and AP mode. Remember that both with multiple streams (on
> different ACs) and even more so going to different stations
> (AP/IBSS/mesh modes, and likely soon even in STA mode with (T)DLS, and
> let's not forget 11ac/ad) there may be vast differences in the time
> different frames spend on a queue which are not just due to bloated
> queues. I'm concerned about this since none of it has been taken into
> account in the paper you're basing this on, all evaluations seem to be
> pretty much based on a single traffic stream.
Yeah, I'm still not sure we all have our heads around these issues.
I mean, on the one hand it seems wrong to limit queueing for one
stream or station just because some other stream or station is
higher latency. But on the other hand, it seems to me that those
streams/stations still have to share the same link and that higher
real latency for one stream/station could still result in a higher
perceived latency for another stream/station sharing the same link,
since they still have to share the same air...no?
> Overall, I think there should be some more research first. This might
> help in some cases, but do we know it won't completely break throughput
> in other cases?
That's why it is posted RFC, of course. :-)
John
--
John W. Linville Someday the world will need a hero, and you
linville@tuxdriver.com might be all we have. Be ready.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC v2] mac80211: implement eBDP algorithm to fight bufferbloat - AQM on hosts.
2011-02-21 19:06 ` John W. Linville
@ 2011-02-21 19:29 ` Jim Gettys
2011-02-21 20:26 ` [RFC v2] mac80211: implement eBDP algorithm to fight bufferbloat Tianji Li
2011-02-28 13:07 ` Johannes Berg
2 siblings, 0 replies; 23+ messages in thread
From: Jim Gettys @ 2011-02-21 19:29 UTC (permalink / raw)
To: bloat-devel, bloat
On 02/21/2011 02:06 PM, John W. Linville wrote:
> On Mon, Feb 21, 2011 at 04:28:06PM +0100, Johannes Berg wrote:
>> On Fri, 2011-02-18 at 16:21 -0500, John W. Linville wrote:
>>> This is an implementation of the eBDP algorithm as documented in
>>> Section IV of "Buffer Sizing for 802.11 Based Networks" by Tianji Li,
>>> et al.
>>>
>>> http://www.hamilton.ie/tianji_li/buffersizing.pdf
>>>
>>> This implementation timestamps an skb before handing it to the
>>> hardware driver, then computes the service time when the frame is
>>> freed by the driver. An exponentially weighted moving average of per
>>> fragment service times is used to restrict queueing delays in hopes
>>> of achieving a target fragment transmission latency.
>>>
>>> Signed-off-by: John W. Linville<linville@tuxdriver.com>
>>> ---
>>> v1 -> v2:
>>> - execute algorithm separately for each WMM queue
>>> - change ewma scaling parameters
>>> - calculate max queue len only when new latency data is received
>>> - stop queues when occupancy limit is reached rather than dropping
>>> - use skb->destructor for tracking queue occupancy
>>>
>>> Johannes' comment about tx status reporting being unreliable (and what
>>> he was really saying) finally sunk-in. So, this version uses
>>> skb->destructor to track in-flight fragments. That should handle
>>> fragments that get silently dropped in the driver for whatever reason
>>> without leaking queue capacity. Correct me if I'm wrong!
>>
>> Yeah, I had that idea as well. Could unify the existing skb_orphan()
>> call though :-)
>
> The one in ieee80211_skb_resize? Any idea how that would look?
>
>> However, Nathaniel is right -- if the skb is freed right away during
>> tx() you kinda estimate its queue time to be virtually zero. That
>> doesn't make a lot of sense and might in certain conditions exacerbate
>> the problem, for example if the system is out of memory more packets
>> might be allowed through than in normal operation etc.
>
> As in my reply to Nathaniel, please notice that the timing estimate
> (and the max_enqueued calculation) only happens for frames that result
> in a tx status report -- at least for now...
>
> However, if this were generalized beyond mac80211 then we wouldn't
> be able to rely on tx status reports. I can see that dropping frames
> in the driver would lead to timing estimates that would cascade into
> a wide-open queue size. But I'm not sure that would be a big deal,
> since in the long run those dropped frames should still result in IP
> cwnd reductions, etc...?
>
>> Also, for some USB drivers I believe SKB lifetime has no relation to
>> queue size at all because the data is just shuffled into an URB. I'm not
>> sure we can solve this generically. I'm not really sure how this works
>> for USB drivers, I think they queue up frames with the HCI controller
>> rather than directly with the device.
>
> How do you think the time spent handling URBs in the USB stack relates
> to the time spent transmitting frames? At what point do those SKBs
> get freed?
>
>> Finally, this isn't taking into account any of the issues about
>> aggregation and AP mode. Remember that both with multiple streams (on
>> different ACs) and even more so going to different stations
>> (AP/IBSS/mesh modes, and likely soon even in STA mode with (T)DLS, and
>> let's not forget 11ac/ad) there may be vast differences in the time
>> different frames spend on a queue which are not just due to bloated
>> queues. I'm concerned about this since none of it has been taken into
>> account in the paper you're basing this on, all evaluations seem to be
>> pretty much based on a single traffic stream.
>
> Yeah, I'm still not sure we all have our heads around these issues.
> I mean, on the one hand it seems wrong to limit queueing for one
> stream or station just because some other stream or station is
> higher latency. But on the other hand, it seems to me that those
> streams/stations still have to share the same link and that higher
> real latency for one stream/station could still result in a higher
> perceived latency for another stream/station sharing the same link,
> since they still have to share the same air...no?
>
I've been thinking for a while that in fact some sort of AQM even on
hosts may be necessary. If you have a particular flow back up on the
host, then it will have more and more packets queued, and RED or other
AQM algorithms will start to kick in and push back on the flows involved
so they don't constipate the other flows to other hosts.
I'm widening this thread to the main bloat list to ensure that the
widest audience see it.
- Jim
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC v2] mac80211: implement eBDP algorithm to fight bufferbloat
2011-02-21 19:06 ` John W. Linville
2011-02-21 19:29 ` [RFC v2] mac80211: implement eBDP algorithm to fight bufferbloat - AQM on hosts Jim Gettys
@ 2011-02-21 20:26 ` Tianji Li
2011-02-28 13:07 ` Johannes Berg
2 siblings, 0 replies; 23+ messages in thread
From: Tianji Li @ 2011-02-21 20:26 UTC (permalink / raw)
To: John W. Linville, Johannes Berg, Nathaniel J. Smith
Cc: linux-wireless, bloat-devel
On 02/21/2011 01:06 PM, John W. Linville wrote:
> On Mon, Feb 21, 2011 at 04:28:06PM +0100, Johannes Berg wrote:
>> On Fri, 2011-02-18 at 16:21 -0500, John W. Linville wrote:
>>> This is an implementation of the eBDP algorithm as documented in
>>> Section IV of "Buffer Sizing for 802.11 Based Networks" by Tianji Li,
>>> et al.
>>>
>>> http://www.hamilton.ie/tianji_li/buffersizing.pdf
>>>
>>> This implementation timestamps an skb before handing it to the
>>> hardware driver, then computes the service time when the frame is
>>> freed by the driver. An exponentially weighted moving average of per
>>> fragment service times is used to restrict queueing delays in hopes
>>> of achieving a target fragment transmission latency.
>>>
>>> Signed-off-by: John W. Linville<linville@tuxdriver.com>
>>> ---
>>> v1 -> v2:
>>> - execute algorithm separately for each WMM queue
>>> - change ewma scaling parameters
>>> - calculate max queue len only when new latency data is received
>>> - stop queues when occupancy limit is reached rather than dropping
>>> - use skb->destructor for tracking queue occupancy
>>>
>>> Johannes' comment about tx status reporting being unreliable (and what
>>> he was really saying) finally sunk-in. So, this version uses
>>> skb->destructor to track in-flight fragments. That should handle
>>> fragments that get silently dropped in the driver for whatever reason
>>> without leaking queue capacity. Correct me if I'm wrong!
>>
>> Yeah, I had that idea as well. Could unify the existing skb_orphan()
>> call though :-)
>
> The one in ieee80211_skb_resize? Any idea how that would look?
>
>> However, Nathaniel is right -- if the skb is freed right away during
>> tx() you kinda estimate its queue time to be virtually zero. That
>> doesn't make a lot of sense and might in certain conditions exacerbate
>> the problem, for example if the system is out of memory more packets
>> might be allowed through than in normal operation etc.
>
> As in my reply to Nathaniel, please notice that the timing estimate
> (and the max_enqueued calculation) only happens for frames that result
> in a tx status report -- at least for now...
>
> However, if this were generalized beyond mac80211 then we wouldn't
> be able to rely on tx status reports. I can see that dropping frames
> in the driver would lead to timing estimates that would cascade into
> a wide-open queue size. But I'm not sure that would be a big deal,
> since in the long run those dropped frames should still result in IP
> cwnd reductions, etc...?
>
>> Also, for some USB drivers I believe SKB lifetime has no relation to
>> queue size at all because the data is just shuffled into an URB. I'm not
>> sure we can solve this generically. I'm not really sure how this works
>> for USB drivers, I think they queue up frames with the HCI controller
>> rather than directly with the device.
>
> How do you think the time spent handling URBs in the USB stack relates
> to the time spent transmitting frames? At what point do those SKBs
> get freed?
>
>> Finally, this isn't taking into account any of the issues about
>> aggregation and AP mode. Remember that both with multiple streams (on
>> different ACs) and even more so going to different stations
>> (AP/IBSS/mesh modes, and likely soon even in STA mode with (T)DLS, and
>> let's not forget 11ac/ad) there may be vast differences in the time
>> different frames spend on a queue which are not just due to bloated
>> queues. I'm concerned about this since none of it has been taken into
>> account in the paper you're basing this on, all evaluations seem to be
>> pretty much based on a single traffic stream.
>
> Yeah, I'm still not sure we all have our heads around these issues.
> I mean, on the one hand it seems wrong to limit queueing for one
> stream or station just because some other stream or station is
> higher latency. But on the other hand, it seems to me that those
> streams/stations still have to share the same link and that higher
> real latency for one stream/station could still result in a higher
> perceived latency for another stream/station sharing the same link,
> since they still have to share the same air...no?
This is a good point.
A buffer builds up when there are long-lived TCP flows. They can block
the buffer since they are elastic in the sense that they send more
packets when previous are acknowledged, which means that if the flow is
long, lots of packets will arrive at the buffer almost at the same time.
If the buffer is large, the waiting to be serviced can be long. This is
fine for long-lived flows since when we are download a large file, we do
not quite care if we are done in 2 minutes or 3. However, if there are a
couple of email checks, no one can tolerate a 'fresh' click takes 3-2=1
minute.
To mitigate, we shorten the buffer sizes (by dropping) so that the
waiting can be shorter. Since the long-lived flows are dominating,
dropping happens much more likely on them, some packets from short-lived
can also be dropped too if they are not lucky. Still due to the elastic
(it is both bad and good :-)) nature of TCP, the dropping on long-lived
flows makes them backoff, which gives time to short ones.
If a buffer is not used by elastic traffic, there is no need to do
buffering. (Note that UDP can be elastic as well. The application layer
of UDP normally has some logic to backoff if waiting is too long)
In 802.11 standard, there is only one queue. While in 802.11e/n, there
are four, and by default only one of which is used for TCP (but this can
be changed). There are some other queues in the drive for control
purposes, but they do not count.
In our paper, we were doing buffersizing on the 802.11e/n TCP queue
only. For 802.11, we need a few buffers on top of the 802.11 standard
one to mimic those of 802.11e, and use sizing on the TCP buffer only.
The scheduling of which queues should be active at the MAC layer is
another issue, which can not be solved with the sizing logic.
AQM may not may not be a better issue, but the issue is that it is not
enabled even if so well known.
My 2 cents,
Tianji
>
>> Overall, I think there should be some more research first. This might
>> help in some cases, but do we know it won't completely break throughput
>> in other cases?
>
> That's why it is posted RFC, of course. :-)
>
> John
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC v2] mac80211: implement eBDP algorithm to fight bufferbloat
2011-02-21 19:06 ` John W. Linville
2011-02-21 19:29 ` [RFC v2] mac80211: implement eBDP algorithm to fight bufferbloat - AQM on hosts Jim Gettys
2011-02-21 20:26 ` [RFC v2] mac80211: implement eBDP algorithm to fight bufferbloat Tianji Li
@ 2011-02-28 13:07 ` Johannes Berg
2 siblings, 0 replies; 23+ messages in thread
From: Johannes Berg @ 2011-02-28 13:07 UTC (permalink / raw)
To: John W. Linville; +Cc: bloat-devel, linux-wireless, Nathaniel J. Smith
On Mon, 2011-02-21 at 14:06 -0500, John W. Linville wrote:
> > Yeah, I had that idea as well. Could unify the existing skb_orphan()
> > call though :-)
>
> The one in ieee80211_skb_resize? Any idea how that would look?
Yeah. I think it'd have to be moved out of _skb_resize and made
unconditional in that path, since eventually with this patch you'd do it
anyway.
> As in my reply to Nathaniel, please notice that the timing estimate
> (and the max_enqueued calculation) only happens for frames that result
> in a tx status report -- at least for now...
Oops, right.
> However, if this were generalized beyond mac80211 then we wouldn't
> be able to rely on tx status reports. I can see that dropping frames
> in the driver would lead to timing estimates that would cascade into
> a wide-open queue size. But I'm not sure that would be a big deal,
> since in the long run those dropped frames should still result in IP
> cwnd reductions, etc...?
I don't think we can generically rely on skb_orphan() in the network
stack since that will make socket buffer limits meaningless. In fact, it
pains me a bit that we had to do this in wireless before buffering the
skb, and doing it unconditionally may be worse?
> How do you think the time spent handling URBs in the USB stack relates
> to the time spent transmitting frames? At what point do those SKBs
> get freed?
I honestly don't know. I would hope they are only freed when the URB was
processed (i.e. at least DMA'd to the target device) but I suppose a
driver might also copy the TX frame completely.
> Yeah, I'm still not sure we all have our heads around these issues.
> I mean, on the one hand it seems wrong to limit queueing for one
> stream or station just because some other stream or station is
> higher latency. But on the other hand, it seems to me that those
> streams/stations still have to share the same link and that higher
> real latency for one stream/station could still result in a higher
> perceived latency for another stream/station sharing the same link,
> since they still have to share the same air...no?
Yeah, but retries (robustness) and aggregation (throughput) will
invariably affect latency for everybody else using the shared medium. I
suppose it would be better if queueing would be limited to a certain
amount of air time use *per peer station*, so that each connection can
have fairly low latency. However, this seems much harder to do. But what
could happen here is that bursty traffic to a far-away (slow) station
severely affects latency for and because there's also high traffic to a
closer station that caused a buffering increase.
johannes
^ permalink raw reply [flat|nested] 23+ messages in thread