[Cake] WireGuard Queuing, Bufferbloat, Performance, Latency, and related issues

Wed Oct 5 08:39:14 EDT 2016

On Sun, Oct 2, 2016 at 1:31 PM, Toke Høiland-Jørgensen <toke at toke.dk> wrote:
> You don't need a timer. You already have a signal for when more queue
> space is available in the encryption step: When a packet finishes
> encryption. All you need to do is try to enqueue another one at this
> point.

Oh, silly me. Yes of course. Voila:
https://git.zx2c4.com/WireGuard/commit/?id=a0ad61c1a0e25a376e145f07ca27c605d3852bc4

> Also, I've completely ignored your existing bunching of packets after
> encryption. I think you could still do that (i.e. wait to send things in
> encryption_done() until you have a bunch of packets), as long as you
> call schedule() for each packet that finishes encryption.

But, in fact, I wait until all packets are done and sent to call schedule().
The reason is that sometimes schedule() does not queue, but rather immediately
encrypts and sends the packet if it's small and if there's only one. Waiting
until they're all sent out ensures packet ordering. It also allows the next
batch of packets to be bundled up better, which provides much better
throughput. While it seems like this would contribute unneccessary latency,
the only times that packets are actually bundled in large quantities is when
doing some sort of bulk TCP data transfer, where the kernel has access to a
huge buffer and can form GSO superpackets. In these cases, latency is rarely
the concern, since it's bulk transfer.

> Almost: We're not adding delay, we're trying to get rid of it ;)
>
> Basically, what CoDel does is look at the time the packet spent in the
> queue. If this time grows above 5 ms for an extended period of time,
> CoDel will drop a packet. This signals the TCP sender to slow down,
> which will hopefully drain the queue. If it doesn't help for long
> enough, CoDel will drop another packet; and so on, dropping at an
> increasing rate until the queue is drained.
>
> The basic mechanism of occasionally dropping a packet has been known for
> decades; it's the same good old RED does. What CoDel brings to the table
> is basing the drop on the actual measured time the packets spend in the
> queue. This nicely avoids having to tell the algorithm how many packets
> are too many.

Ahh, of course. That makes sense.

>
> Basically, putting all of the above together, I'd suggest a flow like
> this (expressed in C-like pseudocode):
>
> function enqueue(pkt) { /* this is basically your existing xmit() */
>   peer = get_peer(pkt);
>   timestamp(pkt);
>   fq_enqueue(peer, pkt);
>   list_add_end(peer, active_peers);
>   schedule();
> }
>
> function schedule() {
>   while (!full(encryption queue)) {
>     peer = list_first(active_peers);
>     if (!peer) /* no active peers */
>       break;
>
>     pkt = fq_dequeue(peer); /* Option 1: This runs CoDel */
>     if (!pkt) { /* peer queue empty */
>       list_remove(peer, active_peers);
>       continue;
>     }
>
>     start_encryption(pkt);
>     list_move_end(peer, active_peers); /* rotate list */
>   }
> }
>
> function encryption_done(pkt) {
>   codel(pkt); /* Option 2: Run CoDel here; may drop the packet */
>   if (pkt)
>     send(pkt);
>   schedule();
> }
>
>
> This basically gets you a round-robin scheduler between stations (in
> schedule()), and an FQ structure for each peer that will do per-flow
> fairness and sparse flow prioritisation. You could also do a 'sparse
> peer' optimisation in schedule(), but it's not strictly necessary.
>
> I'd suggest you use the dynamic queue length stuff to check whether the
> encryption queue is full. I.e. the function I called full() would
> actually be a call to dql_avail() - and you'd need to add the companion
> calls to dql_queued() and dql_completed() in appropriate places, of
> course (for which I'd suggest you use bytes rather than number of
> packets). This should keep just the right number of packets in the
> encryption queue to keep things busy, but not more (which would just add
> latency).
>
> As you can see I marked to possible places you could apply CoDel in the
> above. I'm not actually sure which one works best. On the one hand,
> option 2 means CoDel will get a chance to react to the encryption
> latency, which is good. But on the other hand, because there is fairness
> queueing, there can be reordering, so you'll get some packets with very
> low latency intermixed with those that have been queued for longer.
> Which will basically turn CoDel off. So for this reason, I think having
> CoDel in the fq_dequeue() step would be better; in this case, there are
> a separate set of state vars for each flow, so the TCP elephant flows
> can get packets dropped while the others won't. This is what we do in
> mac80211.
>
> Actually with this setup, things are analogous to a regular network
> interface, where the encryption queue takes the place of the hardware.
> So we want to keep the hardware (encryption) busy, but not have more
> packets queued up there than is necessary to achieve this.

That's a nice design you've outlined. One issue, however, with option 2 is
that now I'm dropping packets after they're already been encrypted. This seems
like it's a bit of a waste. Option 1 avoids this, but then it doesn't take
into account the encryption latency, which is the main source of bufferbloat.
Is there a hybrid option? Specifically, is there an API or a pattern to the
tune of:

struct fq_codel_stats fqc;

xmit(struct sk_buff *skb) {
    fq_hash_t flow_hash = fq_compute_flow_hash(iphdr(skb), ...);
    codel_time_t time = codel_get_time_or_drop(&fqc, flow_hash);
    if (!time) {
        kfree_skb(skb);
        return;
    }

    skb->cb.time = time;
    skb->cb.flow_hash = flow_hash;

    put_packet_in_encryption_queue(skb);
}

encryption_finished(struct sk_buff *skb) {
    fq_codel_register_packet_dequeued(&fqc, skb->cb.time, skb->cb.flow_hash);
    send_to_udp_socket(skb);
}

The idea here would be that the fqc structure would keep track of the various
statistics required for fq_codel to do its things. Packets would be registered
coming into the queue and registered coming out of the queue. However, *before*
being added to the queue, codel would say whether or not to drop the packet
based on its flow. It seems like this setup would be much simpler, and fulfill
all the requirements, including sparse flows. The hash computation function
would be in the place with the most information, and could work per-peer,
per-flow.

Have I misunderstood something, or is this kind of simplification a real
possibility?

Jason