From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <kyan@google.com>
Received: from mail-lj1-x241.google.com (mail-lj1-x241.google.com
 [IPv6:2a00:1450:4864:20::241])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by lists.bufferbloat.net (Postfix) with ESMTPS id 9999E3B29E
 for <make-wifi-fast@lists.bufferbloat.net>;
 Fri, 20 Sep 2019 18:38:24 -0400 (EDT)
Received: by mail-lj1-x241.google.com with SMTP id n14so3475729ljj.10
 for <make-wifi-fast@lists.bufferbloat.net>;
 Fri, 20 Sep 2019 15:38:24 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025;
 h=mime-version:references:in-reply-to:from:date:message-id:subject:to
 :cc; bh=6064d4ukzWinn4d0IqFHYJ5AtGV54DjV7e5iAGFbDR0=;
 b=QOD0arVUg5X7pKTgMqqmn5IFUrvrtMp3GLcf8R1sz4VGMRIe5fV3XIQwthxCyG6IVT
 FMWIYCBlgzfRb4Q49m/oVVC1cN9fOlwZc8A3OM/YrsBEDGBnSV063//1qLj0/3KwAcz4
 vWpmNTLs5f2BOFsEOwqAw68rOS2XeX0mz7AGzg8flbVmfPlqgWSGuone4CZdbekx+1tx
 h386HQNovVDiSqtYcQVNK2BCWkzRI0yVd8QVjBOhkVwvmYNg9MI9nPw3l4Ni0HfHBNge
 8wI0UxmV13SLJaY+RJPkJnBRp/ysv3soWs7Ui96rhEoVhAmnCiukJVjruBIt3nj1mouL
 bjkA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=6064d4ukzWinn4d0IqFHYJ5AtGV54DjV7e5iAGFbDR0=;
 b=X6exTod24gf2ak6AUrjY3+Li7wOqCKWtgYjek+g5UX7DAreE4VQKU8/BNA5oeq5wGG
 2ck86Ik3QgMVP4GyxicV6M8b0AchvwLm63sdxHjTIFfdZw6hgAZN9LYSkvxFFjIY7LaQ
 PVA9+YgjnKXkraxUmxRM2s3i/eOJ8ucF4w4xr0vMrSGQmKSVdFRLBn1i4vwcKNf2125O
 6usl1w2zMiJczx7f17/x/BAyV0WsSOh8Pp20+JdRR605+sh0LyP1HWphrdK4FqK2rtgQ
 jsI0jXorG59ZM9aU1YHWfJqsXKH/dto4/7A7gskH03qTdhNnui7vE6aNMlnZ7XdOSi8E
 YZaQ==
X-Gm-Message-State: APjAAAXem5fZ6T5ZTpPEENoZLPmtXSkG1smVdoHszR42H+6Crz12Lxq1
 KZnk3ykrAtQOE8sQUf5fq/tLkIv0eHiH9be1qShDJA==
X-Google-Smtp-Source: APXvYqy7Gtj//CtPaOsh8olChwmX7bOXJ6SgWIlerlEqbO1/VTs1+KUnR93odfSLvd85QwHezwMOV+GUKF88M1SF2Hs=
X-Received: by 2002:a2e:5456:: with SMTP id y22mr6831526ljd.60.1569019103067; 
 Fri, 20 Sep 2019 15:38:23 -0700 (PDT)
MIME-Version: 1.0
References: <156889576422.191202.5906619710809654631.stgit@alrua-x1>
 <156889576869.191202.510507546538322707.stgit@alrua-x1>
 <20190920120639.GA6456@localhost.localdomain>
 <87k1a39lgt.fsf@toke.dk> <20190920130604.GB6456@localhost.localdomain>
 <87h8579jpj.fsf@toke.dk>
In-Reply-To: <87h8579jpj.fsf@toke.dk>
From: Kan Yan <kyan@google.com>
Date: Fri, 20 Sep 2019 15:38:11 -0700
Message-ID: <CA+iem5uFmOkgQriSjra05pzigXb_Akz0Wy4B3C8aFxD30S5q=g@mail.gmail.com>
To: =?UTF-8?B?VG9rZSBIw7hpbGFuZC1Kw7hyZ2Vuc2Vu?= <toke@redhat.com>
Cc: Lorenzo Bianconi <lorenzo@kernel.org>,
 Johannes Berg <johannes@sipsolutions.net>, 
 linux-wireless@vger.kernel.org, make-wifi-fast@lists.bufferbloat.net, 
 John Crispin <john@phrozen.org>, Felix Fietkau <nbd@nbd.name>
Content-Type: multipart/alternative; boundary="00000000000049266d059303ba43"
Subject: Re: [Make-wifi-fast] [PATCH RFC/RFT 4/4] mac80211: Apply
 Airtime-based Queue Limit (AQL) on packet dequeue
X-BeenThere: make-wifi-fast@lists.bufferbloat.net
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: <make-wifi-fast.lists.bufferbloat.net>
List-Unsubscribe: <https://lists.bufferbloat.net/options/make-wifi-fast>,
 <mailto:make-wifi-fast-request@lists.bufferbloat.net?subject=unsubscribe>
List-Archive: <https://lists.bufferbloat.net/pipermail/make-wifi-fast>
List-Post: <mailto:make-wifi-fast@lists.bufferbloat.net>
List-Help: <mailto:make-wifi-fast-request@lists.bufferbloat.net?subject=help>
List-Subscribe: <https://lists.bufferbloat.net/listinfo/make-wifi-fast>,
 <mailto:make-wifi-fast-request@lists.bufferbloat.net?subject=subscribe>
X-List-Received-Date: Fri, 20 Sep 2019 22:38:25 -0000

--00000000000049266d059303ba43
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hi Toke,

There is an updated version of AQL in the chromiumos tree implemented in
the mac80211 driver, instead in the ath10k driver as the original version:

https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/=
1703105/7

https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/=
1703106/6

It is based on a more recent kernel (4.14) and integrated with the airtime
fairness tx scheduler in mac80211. This version has been tested rather
extensively.  I intended to use it as the basis for my effort to bring AQL
upstream, but get sidetracked by other things. I can clean it up and send a
patchset next week if you think that is the right path. Sorry for the long
delay and slack off on the upstream effort.

There are some concerns in this thread regarding the accuracy of the
estimated airtime using the last reported TX rate. It is indeed a rather
crude method and did not include retries in the calculation. Besides, there
are lags between firmware changing rate and host driver get the rate
update. The 16us IFS overhead is only correct for 5G and it is actually
10us for 2.4 G. However, that hardly matters. The goal of AQL is to prevent
the firmware/hardware queue from getting bloated or starved. AQL doesn't
control the fine grained TX packet scheduling. It is handled by the airtime
fairness scheduler and ultimately firmware. There is a lot of headroom in
the queue length limit (8-10 ms) to tolerate inaccuracy in the estimate
airtime.

There are two TX airtimes in the newer version (chromiumos 4.14 kernel):
The estimated airtime for frames pending in the queue and the airtime
reported by the firmware for the frame transmitted, which should be
accurate as firmware supposed to take retries and aggregation into account.
The airtime fairness scheduler that does the fine grained packet scheduling
should used the "accurate" airtime reported by the firmware. That's the
reason why the original implementation in the ChromiumOS tree tries to take
aggregation size into account when estimate the airtime overhead and the
later version doesn't even bother with that.

Regards,

Kan


On Fri, Sep 20, 2019 at 6:32 AM Toke H=C3=B8iland-J=C3=B8rgensen <toke@redh=
at.com>
wrote:

> Lorenzo Bianconi <lorenzo@kernel.org> writes:
>
> >> Lorenzo Bianconi <lorenzo@kernel.org> writes:
> >>
> >> >> From: Toke H=C3=B8iland-J=C3=B8rgensen <toke@redhat.com>
> >> >>
> >> >> Some devices have deep buffers in firmware and/or hardware which
> prevents
> >> >> the FQ structure in mac80211 from effectively limiting bufferbloat
> on the
> >> >> link. For Ethernet devices we have BQL to limit the lower-level
> queues, but
> >> >> this cannot be applied to mac80211 because transmit rates can vary
> wildly
> >> >> between packets depending on which station we are transmitting it t=
o.
> >> >>
> >> >> To overcome this, we can use airtime-based queue limiting (AQL),
> where we
> >> >> estimate the transmission time for each packet before dequeueing it=
,
> and
> >> >> use that to limit the amount of data in-flight to the hardware. Thi=
s
> idea
> >> >> was originally implemented as part of the out-of-tree airtime
> fairness
> >> >> patch to ath10k[0] in chromiumos.
> >> >>
> >> >> This patch ports that idea over to mac80211. The basic idea is simp=
le
> >> >> enough: Whenever we dequeue a packet from the TXQs and send it to t=
he
> >> >> driver, we estimate its airtime usage, based on the last recorded T=
X
> rate
> >> >> of the station that packet is destined for. We keep a running per-A=
C
> total
> >> >> of airtime queued for the whole device, and when that total climbs
> above 8
> >> >> ms' worth of data (corresponding to two maximum-sized aggregates), =
we
> >> >> simply throttle the queues until it drops down again.
> >> >>
> >> >> The estimated airtime for each skb is stored in the tx_info, so we
> can
> >> >> subtract the same amount from the running total when the skb is
> freed or
> >> >> recycled. The throttling mechanism relies on this accounting to be
> >> >> accurate (i.e., that we are not freeing skbs without subtracting an=
y
> >> >> airtime they were accounted for), so we put the subtraction into
> >> >> ieee80211_report_used_skb().
> >> >>
> >> >> This patch does *not* include any mechanism to wake a throttled TXQ
> again,
> >> >> on the assumption that this will happen anyway as a side effect of
> whatever
> >> >> freed the skb (most commonly a TX completion).
> >> >>
> >> >> The throttling mechanism only kicks in if the queued airtime total
> goes
> >> >> above the limit. Since mac80211 calculates the time based on the
> reported
> >> >> last_tx_time from the driver, the whole throttling mechanism only
> kicks in
> >> >> for drivers that actually report this value. With the exception of
> >> >> multicast, where we always calculate an estimated tx time on the
> assumption
> >> >> that multicast is transmitted at the lowest (6 Mbps) rate.
> >> >>
> >> >> The throttling added in this patch is in addition to any throttling
> already
> >> >> performed by the airtime fairness mechanism, and in principle the t=
wo
> >> >> mechanisms are orthogonal (and currently also uses two different
> sources of
> >> >> airtime). In the future, we could amend this, using the airtime
> estimates
> >> >> calculated by this mechanism as a fallback input to the airtime
> fairness
> >> >> scheduler, to enable airtime fairness even on drivers that don't
> have a
> >> >> hardware source of airtime usage for each station.
> >> >>
> >> >> [0]
> https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/=
+/588190/13/drivers/net/wireless-4.2/ath/ath10k/mac.c#3845
> >> >>
> >> >> Signed-off-by: Toke H=C3=B8iland-J=C3=B8rgensen <toke@redhat.com>
> >> >> ---
> >> >>  net/mac80211/debugfs.c     |   24 ++++++++++++++++++++++++
> >> >>  net/mac80211/ieee80211_i.h |    7 +++++++
> >> >>  net/mac80211/status.c      |   22 ++++++++++++++++++++++
> >> >>  net/mac80211/tx.c          |   38
> +++++++++++++++++++++++++++++++++++++-
> >> >>  4 files changed, 90 insertions(+), 1 deletion(-)
> >> >
> >> > Hi Toke,
> >> >
> >> > Thx a lot for working on this. Few comments inline.
> >> >
> >> > Regards,
> >> > Lorenzo
> >> >
> >> >>
> >> >> diff --git a/net/mac80211/debugfs.c b/net/mac80211/debugfs.c
> >> >> index 568b3b276931..c846c6e7f3e3 100644
> >> >> --- a/net/mac80211/debugfs.c
> >> >> +++ b/net/mac80211/debugfs.c
> >> >> @@ -148,6 +148,29 @@ static const struct file_operations aqm_ops =
=3D {
> >> >>   .llseek =3D default_llseek,
> >> >>  };
> >> >>
> >> >
> >> > [...]
> >> >
> >> >> @@ -3581,8 +3591,19 @@ struct sk_buff *ieee80211_tx_dequeue(struct
> ieee80211_hw *hw,
> >> >>   tx.skb =3D skb;
> >> >>   tx.sdata =3D vif_to_sdata(info->control.vif);
> >> >>
> >> >> - if (txq->sta)
> >> >> + pktlen =3D skb->len + 38;
> >> >> + if (txq->sta) {
> >> >>           tx.sta =3D container_of(txq->sta, struct sta_info, sta);
> >> >> +         if (tx.sta->last_tx_bitrate) {
> >> >> +                 airtime =3D (pktlen * 8 * 1000 *
> >> >> +                            tx.sta->last_tx_bitrate_reciprocal) >>
> IEEE80211_RECIPROCAL_SHIFT;
> >> >> +                 airtime +=3D IEEE80211_AIRTIME_OVERHEAD;
> >> >
> >> > Here we are not taking into account aggregation burst size (it is do=
ne
> >> > in a rough way in chromeos implementation) and tx retries. I have no=
t
> >> > carried out any tests so far but I think IEEE80211_AIRTIME_OVERHEAD
> >> > will led to a significant airtime overestimation. Do you think this
> >> > can be improved? (..I agree this is not a perfect world, but .. :))
> >>
> >> Hmm, yeah, looking at this again, the way I'm going this now, I should
> >> probably have used the low 16-us IFS overhead for every packet.
> >>
> >> I guess we could do something similar to what the chromeos thing is
> >> doing. I.e., adding a single "large" overhead value when we think the
> >> packet is the first of a burst, and using the smaller value for the
> >> rest.
> >>
> >> One approach could be to couple the switch to the "scheduling rounds" =
we
> >> already have. I.e., first packet after a call to
> >> ieee8021_txq_schedule_start() will get the 100-us overhead, and every
> >> subsequent one will get the low one. Not sure how this will fit with
> >> what the driver actually does, though, so I guess some experimentation
> >> is in order.
> >>
> >> Ultimately,  I'm not sure it matters that much whether occasionally ad=
d
> >> 80 us extra to the estimate. But as you say, adding 100 us to every
> >> packet is probably a bit much ;)
> >
> > Would it be possible to use the previous tx airtime reported by the
> > driver? (not sure if it is feasible). Some drivers can report airtime
> > compute in hw, the issue is it can be no not linked to the given skb
> > or aggregation burst, so we should take into account burst size
>
> That's what we do for the fairness scheduler. And yeah, if the HW can
> report after-the-fact airtime usage that is bound to be more accurate,
> so I think we should keep using that for fairness.
>
> But for this AQL thing, we really need it ahead of time. However, I
> don't think it's as important that it is super accurate. As long as we
> have a reasonable estimate I think we'll be fine. We can solve any
> inaccuracies by fiddling with the limit, I think. Similar to what BQL
> does; dynamically adjusting it up and down.
>
> So for a first pass, we can just err on the side of having the limit
> higher, and then iterate from there.
>
> >> > Moreover, can this approach be affected by some interrupt coalescing
> >> > implemented by the chipset?
> >>
> >> Probably? Ultimately we don't really know what exactly the chipset is
> >> doing, so we're guessing here, no?
> >
> > Here I mean if the hw relies on a 1:n tx interrupt/packet ratio (I
> > guess most driver do), it would probably affect throughput, right?
> > (e.g TCP)
>
> Yeah, this is what I alluded to above: If we set the limit too low, were
> are going to kill TCP throughput. Ideally, we want the limit to be as
> low as we can get it without hurting TCP (too much), but no lower. Just
> doing the conversion to airtime is a way to achieve this: This will
> scale the actual queue length with the achievable throughput as long as
> the tx rate estimate is reasonably accurate. If needed, we can add
> another layer of dynamic tuning on top using the existing BQL logic; but
> I'd like to get the basic case working first...
>
> -Toke
>
>

--00000000000049266d059303ba43
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><span id=3D"gmail-docs-internal-guid-b3a47ab7-7fff-cdd2-90=
be-1bc5b65dcd1e"><p dir=3D"ltr" style=3D"line-height:1.38;margin-top:0pt;ma=
rgin-bottom:0pt"><span style=3D"font-size:11pt;font-family:Arial;color:rgb(=
0,0,0);background-color:transparent;font-variant-numeric:normal;font-varian=
t-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">Hi Toke,<=
/span></p><br><p dir=3D"ltr" style=3D"line-height:1.38;margin-top:0pt;margi=
n-bottom:0pt"><span style=3D"font-size:11pt;font-family:Arial;color:rgb(0,0=
,0);background-color:transparent;font-variant-numeric:normal;font-variant-e=
ast-asian:normal;vertical-align:baseline;white-space:pre-wrap">There is an =
updated version of AQL in the chromiumos tree implemented in the mac80211 d=
river, instead in the ath10k driver as the original version:</span></p><p d=
ir=3D"ltr" style=3D"line-height:1.38;margin-top:0pt;margin-bottom:0pt"><a h=
ref=3D"https://chromium-review.googlesource.com/c/chromiumos/third_party/ke=
rnel/+/1703105/7" style=3D"text-decoration-line:none"><span style=3D"font-s=
ize:11pt;font-family:Arial;background-color:transparent;font-variant-numeri=
c:normal;font-variant-east-asian:normal;text-decoration-line:underline;vert=
ical-align:baseline;white-space:pre-wrap">https://chromium-review.googlesou=
rce.com/c/chromiumos/third_party/kernel/+/1703105/7</span></a></p><p dir=3D=
"ltr" style=3D"line-height:1.38;margin-top:0pt;margin-bottom:0pt"><a href=
=3D"https://chromium-review.googlesource.com/c/chromiumos/third_party/kerne=
l/+/1703106/6" style=3D"text-decoration-line:none"><span style=3D"font-size=
:11pt;font-family:Arial;background-color:transparent;font-variant-numeric:n=
ormal;font-variant-east-asian:normal;text-decoration-line:underline;vertica=
l-align:baseline;white-space:pre-wrap">https://chromium-review.googlesource=
.com/c/chromiumos/third_party/kernel/+/1703106/6</span></a></p><br><p dir=
=3D"ltr" style=3D"line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span =
style=3D"font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color=
:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;ver=
tical-align:baseline;white-space:pre-wrap">It is based on a more recent ker=
nel (4.14) and integrated with the airtime fairness tx scheduler in mac8021=
1. This version has been tested rather extensively.=C2=A0 I intended to use=
 it as the basis for my effort to bring AQL upstream, but get sidetracked b=
y other things. I can clean it up and send a patchset next week if you thin=
k that is the right path. Sorry for the long delay and slack off on the ups=
tream effort.=C2=A0=C2=A0</span></p><br><p dir=3D"ltr" style=3D"line-height=
:1.38;margin-top:0pt;margin-bottom:0pt"><span style=3D"font-size:11pt;font-=
family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-num=
eric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-sp=
ace:pre-wrap">There are some concerns in this thread regarding the accuracy=
 of the estimated airtime using the last reported TX rate. It is indeed a r=
ather crude method and did not include retries in the calculation. Besides,=
 there are lags between firmware changing rate and host driver get the rate=
 update. The 16us IFS overhead is only correct for 5G and it is actually 10=
us for 2.4 G. However, that hardly matters. The goal of AQL is to prevent t=
he firmware/hardware queue from getting bloated or starved. AQL doesn&#39;t=
 control the fine grained TX packet scheduling. It is handled by the airtim=
e fairness scheduler and ultimately firmware. There is a lot of headroom in=
 the queue length limit (8-10 ms) to tolerate inaccuracy in the estimate ai=
rtime.=C2=A0</span></p><br><p dir=3D"ltr" style=3D"line-height:1.38;margin-=
top:0pt;margin-bottom:0pt"><span style=3D"font-size:11pt;font-family:Arial;=
color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;f=
ont-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"=
>There are two TX airtimes in the newer version (chromiumos 4.14 kernel): T=
he estimated airtime for frames pending in the queue and the airtime report=
ed by the firmware for the frame transmitted, which should be accurate as f=
irmware supposed to take retries and aggregation into account. The airtime =
fairness scheduler that does the fine grained packet scheduling should used=
 the &quot;accurate&quot; airtime reported by the firmware. That&#39;s the =
reason why the original implementation in the ChromiumOS tree tries to take=
 aggregation size into account when estimate the airtime overhead and the l=
ater version doesn&#39;t even bother with that.=C2=A0</span></p><br><p dir=
=3D"ltr" style=3D"line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span =
style=3D"font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color=
:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;ver=
tical-align:baseline;white-space:pre-wrap">Regards,</span></p><p dir=3D"ltr=
" style=3D"line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style=
=3D"font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:tran=
sparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical=
-align:baseline;white-space:pre-wrap">Kan</span></p></span><br class=3D"gma=
il-Apple-interchange-newline"></div><br><div class=3D"gmail_quote"><div dir=
=3D"ltr" class=3D"gmail_attr">On Fri, Sep 20, 2019 at 6:32 AM Toke H=C3=B8i=
land-J=C3=B8rgensen &lt;<a href=3D"mailto:toke@redhat.com">toke@redhat.com<=
/a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0=
px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">L=
orenzo Bianconi &lt;<a href=3D"mailto:lorenzo@kernel.org" target=3D"_blank"=
>lorenzo@kernel.org</a>&gt; writes:<br>
<br>
&gt;&gt; Lorenzo Bianconi &lt;<a href=3D"mailto:lorenzo@kernel.org" target=
=3D"_blank">lorenzo@kernel.org</a>&gt; writes:<br>
&gt;&gt; <br>
&gt;&gt; &gt;&gt; From: Toke H=C3=B8iland-J=C3=B8rgensen &lt;<a href=3D"mai=
lto:toke@redhat.com" target=3D"_blank">toke@redhat.com</a>&gt;<br>
&gt;&gt; &gt;&gt; <br>
&gt;&gt; &gt;&gt; Some devices have deep buffers in firmware and/or hardwar=
e which prevents<br>
&gt;&gt; &gt;&gt; the FQ structure in mac80211 from effectively limiting bu=
fferbloat on the<br>
&gt;&gt; &gt;&gt; link. For Ethernet devices we have BQL to limit the lower=
-level queues, but<br>
&gt;&gt; &gt;&gt; this cannot be applied to mac80211 because transmit rates=
 can vary wildly<br>
&gt;&gt; &gt;&gt; between packets depending on which station we are transmi=
tting it to.<br>
&gt;&gt; &gt;&gt; <br>
&gt;&gt; &gt;&gt; To overcome this, we can use airtime-based queue limiting=
 (AQL), where we<br>
&gt;&gt; &gt;&gt; estimate the transmission time for each packet before deq=
ueueing it, and<br>
&gt;&gt; &gt;&gt; use that to limit the amount of data in-flight to the har=
dware. This idea<br>
&gt;&gt; &gt;&gt; was originally implemented as part of the out-of-tree air=
time fairness<br>
&gt;&gt; &gt;&gt; patch to ath10k[0] in chromiumos.<br>
&gt;&gt; &gt;&gt; <br>
&gt;&gt; &gt;&gt; This patch ports that idea over to mac80211. The basic id=
ea is simple<br>
&gt;&gt; &gt;&gt; enough: Whenever we dequeue a packet from the TXQs and se=
nd it to the<br>
&gt;&gt; &gt;&gt; driver, we estimate its airtime usage, based on the last =
recorded TX rate<br>
&gt;&gt; &gt;&gt; of the station that packet is destined for. We keep a run=
ning per-AC total<br>
&gt;&gt; &gt;&gt; of airtime queued for the whole device, and when that tot=
al climbs above 8<br>
&gt;&gt; &gt;&gt; ms&#39; worth of data (corresponding to two maximum-sized=
 aggregates), we<br>
&gt;&gt; &gt;&gt; simply throttle the queues until it drops down again.<br>
&gt;&gt; &gt;&gt; <br>
&gt;&gt; &gt;&gt; The estimated airtime for each skb is stored in the tx_in=
fo, so we can<br>
&gt;&gt; &gt;&gt; subtract the same amount from the running total when the =
skb is freed or<br>
&gt;&gt; &gt;&gt; recycled. The throttling mechanism relies on this account=
ing to be<br>
&gt;&gt; &gt;&gt; accurate (i.e., that we are not freeing skbs without subt=
racting any<br>
&gt;&gt; &gt;&gt; airtime they were accounted for), so we put the subtracti=
on into<br>
&gt;&gt; &gt;&gt; ieee80211_report_used_skb().<br>
&gt;&gt; &gt;&gt; <br>
&gt;&gt; &gt;&gt; This patch does *not* include any mechanism to wake a thr=
ottled TXQ again,<br>
&gt;&gt; &gt;&gt; on the assumption that this will happen anyway as a side =
effect of whatever<br>
&gt;&gt; &gt;&gt; freed the skb (most commonly a TX completion).<br>
&gt;&gt; &gt;&gt; <br>
&gt;&gt; &gt;&gt; The throttling mechanism only kicks in if the queued airt=
ime total goes<br>
&gt;&gt; &gt;&gt; above the limit. Since mac80211 calculates the time based=
 on the reported<br>
&gt;&gt; &gt;&gt; last_tx_time from the driver, the whole throttling mechan=
ism only kicks in<br>
&gt;&gt; &gt;&gt; for drivers that actually report this value. With the exc=
eption of<br>
&gt;&gt; &gt;&gt; multicast, where we always calculate an estimated tx time=
 on the assumption<br>
&gt;&gt; &gt;&gt; that multicast is transmitted at the lowest (6 Mbps) rate=
.<br>
&gt;&gt; &gt;&gt; <br>
&gt;&gt; &gt;&gt; The throttling added in this patch is in addition to any =
throttling already<br>
&gt;&gt; &gt;&gt; performed by the airtime fairness mechanism, and in princ=
iple the two<br>
&gt;&gt; &gt;&gt; mechanisms are orthogonal (and currently also uses two di=
fferent sources of<br>
&gt;&gt; &gt;&gt; airtime). In the future, we could amend this, using the a=
irtime estimates<br>
&gt;&gt; &gt;&gt; calculated by this mechanism as a fallback input to the a=
irtime fairness<br>
&gt;&gt; &gt;&gt; scheduler, to enable airtime fairness even on drivers tha=
t don&#39;t have a<br>
&gt;&gt; &gt;&gt; hardware source of airtime usage for each station.<br>
&gt;&gt; &gt;&gt; <br>
&gt;&gt; &gt;&gt; [0] <a href=3D"https://chromium-review.googlesource.com/c=
/chromiumos/third_party/kernel/+/588190/13/drivers/net/wireless-4.2/ath/ath=
10k/mac.c#3845" rel=3D"noreferrer" target=3D"_blank">https://chromium-revie=
w.googlesource.com/c/chromiumos/third_party/kernel/+/588190/13/drivers/net/=
wireless-4.2/ath/ath10k/mac.c#3845</a><br>
&gt;&gt; &gt;&gt; <br>
&gt;&gt; &gt;&gt; Signed-off-by: Toke H=C3=B8iland-J=C3=B8rgensen &lt;<a hr=
ef=3D"mailto:toke@redhat.com" target=3D"_blank">toke@redhat.com</a>&gt;<br>
&gt;&gt; &gt;&gt; ---<br>
&gt;&gt; &gt;&gt;=C2=A0 net/mac80211/debugfs.c=C2=A0 =C2=A0 =C2=A0|=C2=A0 =
=C2=A024 ++++++++++++++++++++++++<br>
&gt;&gt; &gt;&gt;=C2=A0 net/mac80211/ieee80211_i.h |=C2=A0 =C2=A0 7 +++++++=
<br>
&gt;&gt; &gt;&gt;=C2=A0 net/mac80211/status.c=C2=A0 =C2=A0 =C2=A0 |=C2=A0 =
=C2=A022 ++++++++++++++++++++++<br>
&gt;&gt; &gt;&gt;=C2=A0 net/mac80211/tx.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 |=C2=A0 =C2=A038 +++++++++++++++++++++++++++++++++++++-<br>
&gt;&gt; &gt;&gt;=C2=A0 4 files changed, 90 insertions(+), 1 deletion(-)<br=
>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; Hi Toke,<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; Thx a lot for working on this. Few comments inline.<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; Regards,<br>
&gt;&gt; &gt; Lorenzo<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt;&gt; <br>
&gt;&gt; &gt;&gt; diff --git a/net/mac80211/debugfs.c b/net/mac80211/debugf=
s.c<br>
&gt;&gt; &gt;&gt; index 568b3b276931..c846c6e7f3e3 100644<br>
&gt;&gt; &gt;&gt; --- a/net/mac80211/debugfs.c<br>
&gt;&gt; &gt;&gt; +++ b/net/mac80211/debugfs.c<br>
&gt;&gt; &gt;&gt; @@ -148,6 +148,29 @@ static const struct file_operations =
aqm_ops =3D {<br>
&gt;&gt; &gt;&gt;=C2=A0 =C2=A0.llseek =3D default_llseek,<br>
&gt;&gt; &gt;&gt;=C2=A0 };<br>
&gt;&gt; &gt;&gt;=C2=A0 <br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; [...]<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt;&gt; @@ -3581,8 +3591,19 @@ struct sk_buff *ieee80211_tx_deque=
ue(struct ieee80211_hw *hw,<br>
&gt;&gt; &gt;&gt;=C2=A0 =C2=A0tx.skb =3D skb;<br>
&gt;&gt; &gt;&gt;=C2=A0 =C2=A0tx.sdata =3D vif_to_sdata(info-&gt;control.vi=
f);<br>
&gt;&gt; &gt;&gt;=C2=A0 <br>
&gt;&gt; &gt;&gt; - if (txq-&gt;sta)<br>
&gt;&gt; &gt;&gt; + pktlen =3D skb-&gt;len + 38;<br>
&gt;&gt; &gt;&gt; + if (txq-&gt;sta) {<br>
&gt;&gt; &gt;&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0tx.sta =3D contai=
ner_of(txq-&gt;sta, struct sta_info, sta);<br>
&gt;&gt; &gt;&gt; +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (tx.sta-&gt;last_tx=
_bitrate) {<br>
&gt;&gt; &gt;&gt; +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0airtime =3D (pktlen * 8 * 1000 *<br>
&gt;&gt; &gt;&gt; +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 tx.sta-&gt;last_tx_bitrate_recipr=
ocal) &gt;&gt; IEEE80211_RECIPROCAL_SHIFT;<br>
&gt;&gt; &gt;&gt; +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0airtime +=3D IEEE80211_AIRTIME_OVERHEAD;<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; Here we are not taking into account aggregation burst size (i=
t is done<br>
&gt;&gt; &gt; in a rough way in chromeos implementation) and tx retries. I =
have not<br>
&gt;&gt; &gt; carried out any tests so far but I think IEEE80211_AIRTIME_OV=
ERHEAD<br>
&gt;&gt; &gt; will led to a significant airtime overestimation. Do you thin=
k this<br>
&gt;&gt; &gt; can be improved? (..I agree this is not a perfect world, but =
.. :))<br>
&gt;&gt; <br>
&gt;&gt; Hmm, yeah, looking at this again, the way I&#39;m going this now, =
I should<br>
&gt;&gt; probably have used the low 16-us IFS overhead for every packet.<br=
>
&gt;&gt; <br>
&gt;&gt; I guess we could do something similar to what the chromeos thing i=
s<br>
&gt;&gt; doing. I.e., adding a single &quot;large&quot; overhead value when=
 we think the<br>
&gt;&gt; packet is the first of a burst, and using the smaller value for th=
e<br>
&gt;&gt; rest.<br>
&gt;&gt; <br>
&gt;&gt; One approach could be to couple the switch to the &quot;scheduling=
 rounds&quot; we<br>
&gt;&gt; already have. I.e., first packet after a call to<br>
&gt;&gt; ieee8021_txq_schedule_start() will get the 100-us overhead, and ev=
ery<br>
&gt;&gt; subsequent one will get the low one. Not sure how this will fit wi=
th<br>
&gt;&gt; what the driver actually does, though, so I guess some experimenta=
tion<br>
&gt;&gt; is in order.<br>
&gt;&gt; <br>
&gt;&gt; Ultimately,=C2=A0 I&#39;m not sure it matters that much whether oc=
casionally add<br>
&gt;&gt; 80 us extra to the estimate. But as you say, adding 100 us to ever=
y<br>
&gt;&gt; packet is probably a bit much ;)<br>
&gt;<br>
&gt; Would it be possible to use the previous tx airtime reported by the<br=
>
&gt; driver? (not sure if it is feasible). Some drivers can report airtime<=
br>
&gt; compute in hw, the issue is it can be no not linked to the given skb<b=
r>
&gt; or aggregation burst, so we should take into account burst size<br>
<br>
That&#39;s what we do for the fairness scheduler. And yeah, if the HW can<b=
r>
report after-the-fact airtime usage that is bound to be more accurate,<br>
so I think we should keep using that for fairness.<br>
<br>
But for this AQL thing, we really need it ahead of time. However, I<br>
don&#39;t think it&#39;s as important that it is super accurate. As long as=
 we<br>
have a reasonable estimate I think we&#39;ll be fine. We can solve any<br>
inaccuracies by fiddling with the limit, I think. Similar to what BQL<br>
does; dynamically adjusting it up and down.<br>
<br>
So for a first pass, we can just err on the side of having the limit<br>
higher, and then iterate from there.<br>
<br>
&gt;&gt; &gt; Moreover, can this approach be affected by some interrupt coa=
lescing<br>
&gt;&gt; &gt; implemented by the chipset?<br>
&gt;&gt; <br>
&gt;&gt; Probably? Ultimately we don&#39;t really know what exactly the chi=
pset is<br>
&gt;&gt; doing, so we&#39;re guessing here, no?<br>
&gt;<br>
&gt; Here I mean if the hw relies on a 1:n tx interrupt/packet ratio (I<br>
&gt; guess most driver do), it would probably affect throughput, right?<br>
&gt; (e.g TCP)<br>
<br>
Yeah, this is what I alluded to above: If we set the limit too low, were<br=
>
are going to kill TCP throughput. Ideally, we want the limit to be as<br>
low as we can get it without hurting TCP (too much), but no lower. Just<br>
doing the conversion to airtime is a way to achieve this: This will<br>
scale the actual queue length with the achievable throughput as long as<br>
the tx rate estimate is reasonably accurate. If needed, we can add<br>
another layer of dynamic tuning on top using the existing BQL logic; but<br=
>
I&#39;d like to get the basic case working first...<br>
<br>
-Toke<br>
<br>
</blockquote></div>

--00000000000049266d059303ba43--