[Cake] faster scheduling, maybe

Mon Jun 6 15:25:05 EDT 2016

On Mon, 6 Jun 2016, Dave Taht wrote:

> On Mon, Jun 6, 2016 at 11:48 AM, David Lang <david at lang.hm> wrote:
>> On Mon, 6 Jun 2016, Dave Taht wrote:
>>
>>> http://info.iet.unipi.it/~luigi/papers/20160511-mysched-preprint.pdf
>>
>>
>> I don't think so.
>>
>> They don't even try for fairness between flows, they are just looking at
>> fairness between different VMs. they tell a VM that it has complete access
>> to the NIC for a time, then give another VM complete access to the NIC. At
>> best they put each VMs traffic into a different hardware queue in the NIC.
>>
>> This avoids all AQM decisions on the part of the host OS, because the
>> packets never get to the host OS.
>>
>> The speed improvement is by bypassing the host OS and just having the VMs
>> deliver packets directly to the NIC. This speeds things up, but at the cost
>> of any coordination across VMs. Each VM can run fq_codel but it's much
>> corser timeslicing between VMs.
>
>
> Well, the principal things bugging me are:
>
> * that we have multi-core on nearly all the new routers.
> * Nearly all the ethernet devices themselves support hardware multiqueue.
> * we take 6 locks on the qdiscs
> * rx and tx ring cleanup are often combined in existing drivers in a
> single thread.

These are valid concerns. But this paper just arbitrated between multiple VMs 
accessing one hardware NIC. If you don't have multiple VMs in play, their 
approach has nothing to work with.

> The chance to rework the mac80211 layer on make-wifi-fast (where
> manufacturers are also busy adding hardware mq), gives us a chance to
> rethink how we process access to these queues.

Watching the discussion I see a few things.

1. if we can figure out exactly how much data the system is going to handle, we 
can fill each queue with what we want in the next aggregate to that destination.

2. with multiple queues/cores, when we have data from different sources, we can 
route it to different cores and split the work of sorting the data into 
different queues between the cores (each working on a different subset of 
queues, so not having to lock against other cores)

2a. balancing traffic across the cores/queuesets (or at least the output from 
each of them) would be tricky, but that's where thinking like this paper could 
possibly help.

3. as we are seeing in the MAC80211 work, qdiscs operate way to early in the 
process, so we need to eliminate them for wifi and doing the queueing closer to 
the hardware. On a network where packets to different destinaions can be freely 
mixed, qdiscs can continue to operate.

for drivers where the rx and tx ring cleanup is combined, are you talking about 
ones that already go BQL? or are these drivers that need BQL added as well? it 
may be that splitting this when BQL is added is the right thing to do.

David Lang