[Cake] Advantages to tightly tuning latency

Thu Apr 23 12:42:11 EDT 2020

Maxime Bizon <mbizon at freebox.fr> writes:

> On Thursday 23 Apr 2020 à 13:57:25 (+0200), Toke Høiland-Jørgensen wrote:
>
> Hello Toke,
>
>> That is awesome! Please make sure you include the AQL patch for ath10k,
>> it really works wonders, as Dave demonstrated:
>> 
>> https://lists.bufferbloat.net/pipermail/make-wifi-fast/2020-March/002721.html
>
> Was it in 5.4 ? we try to stick to LTS kernel

Didn't make it in until 5.5, unfortunately... :(

I can try to produce a patch that you can manually apply on top of 5.4
if you're interested?

>> We're working on that in kernel land - ever heard of XDP? On big-iron
>> servers we have no issues pushing 10s and 100s of Gbps in software
>> (well, the latter only given enough cores to throw at the problem :)).
>> There's not a lot of embedded platforms support as of yet, but we do
>> have some people in the ARM world working on that.
>> 
>> Personally, I do see embedded platforms as an important (future) use
>> case for XDP, though, in particular for CPEs. So I would be very
>> interested in hearing details about your particular platform, and your
>> DPDK solution, so we can think about what it will take to achieve the
>> same with XDP. If you're interested in this, please feel free to reach
>> out :)
>
> Last time I looked at XDP, its primary use cases were "early drop" /
> "anti ddos".

Yeah, that's the obvious use case (i.e., easiest to implement). But we
really want it to be a general purpose acceleration layer where you can
selectively use only the kernel facilities you need for your use case -
or even skip some of them entirely and reimplement an optimised subset
fitting your use case.

> In our case, each packet has to be routed+NAT, we have VLAN tags, we
> also have MAP-E for IPv4 traffic. So in the vanilla forwading path,
> this does multiple rounds of RX/TX because of tunneling.
>
> TBH, the hard work in our optimized forwarding code is figuring out
> what modifications to apply to each packets. Now whether modifications
> and tx would be done by XDP or by hand written C code in the kernel is
> more of a detail, even though using XDP is much cleaner of course.
>
> What the kernel always lacked is what DaveM called once the "grand
> unified flow cache", the ability to do a single lookup and be able to
> decide what to do with the packet. Instead we have the bridge
> forwarding table, the ip routing table (used to be a cache), the
> netfilter conntrack lookup, and multiple round of those if you do
> tunneling.
>
> Once you have this "flow table" infrastructure, it becomes easy to
> offload forwarding, either to real hardware, or software (for example,
> dedicate a CPU core in polling mode)
>
> The good news is that it seems nftables is building this:
>
> https://wiki.nftables.org/wiki-nftables/index.php/Flowtable
>
> I'm still using iptables, but it seems that the features I was missing
> like TCPMSS are now in nft also, so I will have a look.

I find it useful to think of XDP as a 'software offload' - i.e. a fast
path where you implement the most common functionality as efficiently as
possible and dynamically fall back to the full stack for the edge cases.
Enabling lookups in the flow table from XDP would be an obvious thing to
do, for instance. There were some patches going by to enable some kind
of lookup into conntrack at some point, but I don't recall the details.

Anyhow, my larger point was that we really do want to enable such use
cases for XDP; but we are lacking the details of what exactly is missing
before we can get to something that's useful / deployable. So any
details you could share about what feature set you are supporting in
your own 'fast path' implementation would be really helpful. As would
details about the hardware platform you are using. You can send them
off-list if you don't want to make it public, of course :)

>> Setting aside the fact that those single-stream tests ought to die a
>> horrible death, I do wonder if it would be feasible to do a bit of
>> 'optimising for the test'? With XDP we do have the ability to steer
>> packets between CPUs based on arbitrary criteria, and while it is not as
>> efficient as hardware-based RSS it may be enough to achieve line rate
>> for a single TCP flow?
>
> You cannot do steering for a single TCP flow at those rates because
> you will get out-of-order packets and kill TCP performance.

Depends on the TCP stack (I think). 

> I do not consider those single-stream tests to be unrealistic, this is
> exactly what happen if say you buy a game on Steam and download it.

Steam is perhaps a bad example as that is doing something very much like
bittorrent AFAIK; but point taken, people do occasionally run
single-stream downloads and want them to be fast. I'm just annoyed that
this becomes the *one* benchmark people run, to the exclusion of
everything else that has a much larger impact on the overall user
experience :/

-Toke