Toke,

Thank you very much for pointing me in the right direction.

I am having some fun in the lab tinkering with the 'mq' qdisc and Jesper's xdp-cpumap-tc.

It seems I will need to use iptables or nftables to filter packets to corresponding queues, since mq apparently cannot have u32 filters on its root.

I will try to familiarize myself with iptables and nftables, and hopefully get it working soon and report back. Thank you!

On Fri, Jan 15, 2021 at 5:30 AM Toke Høiland-Jørgensen <toke@toke.dk> wrote:

Robert Chacon <robert.chacon@jackrabbitwireless.com> writes:

>> Cool! What kind of performance are you seeing? The README mentions being
>> limited by the BPF hash table size, but can you actually shape 2000
>> customers on one machine? On what kind of hardware and at what rate(s)?
>
> On our production network our peak throughput is 1.5Gbps from 200 clients,
> and it works very well.
> We use a simple consumer-class AMD 2700X CPU in production because
> utilization of the shaper VM is ~15% at 1.5Gbps load.
> Customers get reliably capped within ±2Mbps of their allocated htb/fq_codel
> bandwidth, which is very helpful to control network congestion.
>
> Here are some graphs from RRUL performed on our test bench hypervisor:
> https://raw.githubusercontent.com/rchac/LibreQoS/main/docs/fq_codel_1000_subs_4G.png
> In that example, bandwidth for the "subscriber" client VM was set to 4Gbps.
> 1000 IPv4 IPs and 1000 IPv6 IPs were in the filter hash table of LibreQoS.
> The test bench server has an AMD 3900X running Ubuntu in Proxmox. 4Gbps
> utilizes 10% of the VM's 12 cores. Paravirtualized VirtIO network drivers
> are used and most offloading types are enabled.
> In our setup, VM networking multiqueue isn't enabled (it kept disrupting
> traffic flow), so 6Gbps is probably the most it can achieve like this. Our
> qdiscs in this VM may be limited to one core because of that.

I suspect the issue you had with multiqueue is that it requires per-CPU
partitioning on a per-customer base to work well. This is possible to do
with XDP, as Jesper demonstrates here:

https://github.com/netoptimizer/xdp-cpumap-tc

With this it should be possible to scale the hardware queues across
multiple CPUs properly, and you should be able to go to much higher
rates by just throwing more CPU cores at it. At least on bare metal; not
sure if the VM virt-drivers have the needed support yet...

-Toke