From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qt0-x22d.google.com (mail-qt0-x22d.google.com [IPv6:2607:f8b0:400d:c0d::22d]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by lists.bufferbloat.net (Postfix) with ESMTPS id 584213B2AE; Fri, 30 Sep 2016 15:18:13 -0400 (EDT) Received: by mail-qt0-x22d.google.com with SMTP id 38so56425797qte.1; Fri, 30 Sep 2016 12:18:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=BObMjIZlkGd3BNUD5P6WkaaEpgNm4nTMaLg4k106cAQ=; b=SNacucS5BhNCjSd5AKndVqo/2YncTdJOtiW1eXocettSexd2zyPOCtk9Jw1Q4ndWbw 8qUjCanjuQT40RPsnamcSKqc3Dxzplnso9VWEC5KsfG+eVbjZeczprexRKnyOVg1NOsN ZOEwIMG+8Ztk+oevx4bsTWZHp67JR5ulzWJPhTIVR6l8dmyMB2S7wHTxlH1QM6b6bvRc k5l3jaEpjtNwpdL2dCmBISHFPu37Veu+/UV8+QY0pLlfJQyiC4/63M95zYW760ziSixL 2DgtQSUfvXoC7WDQv3HPRLtJwnxJp4PyHdyfQLq4BWZiLnBF7cGcXk4MCIeMEy7lV4/+ IXPw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=BObMjIZlkGd3BNUD5P6WkaaEpgNm4nTMaLg4k106cAQ=; b=Z71hSeBijdCp3/IPn/bOglTe1FifjlitlEL6QFfUVfA3t0dN2GCg7jfTPvHk6QopVh ba/W7K31uuWyQ3+uZ3GKQGq+Bz5W09IhQjNMqRZQy+kf96c8jfeFXreSAjCbydTXsijf gsb8cwtPhOaye9PyEkym8ancyUBLf35F2wT+tP/nfwfZrgzhooujlIsD3QbruLHvsD+Y 89IeGf4pBnEUNLQnf5kn8l888vRLloyHPrQtSxJyqes1hHgGm4G0kNhTM01XLU5NPitY lO7lqrANNtcfNwyEJ5lMJ+xy48f18M5cQFwbDyBsT6UtDflFyCfVw2d30b5znszfTs5E VUKg== X-Gm-Message-State: AA6/9RlQECxTRm8Jbmdfcn2qmMhQAGWqwEFa/+FHw2N3g2cFIq2ugt2br0tuy2/QawH6J0POfeS6e2g5K9Zn6w== X-Received: by 10.237.37.189 with SMTP id x58mr6640317qtc.148.1475263092151; Fri, 30 Sep 2016 12:18:12 -0700 (PDT) MIME-Version: 1.0 Received: by 10.12.146.164 with HTTP; Fri, 30 Sep 2016 12:18:11 -0700 (PDT) In-Reply-To: References: From: Dave Taht Date: Fri, 30 Sep 2016 12:18:11 -0700 Message-ID: To: "Jason A. Donenfeld" , cake@lists.bufferbloat.net, make-wifi-fast@lists.bufferbloat.net Cc: WireGuard mailing list Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Subject: Re: [Make-wifi-fast] WireGuard Queuing, Bufferbloat, Performance, Latency, and related issues X-BeenThere: make-wifi-fast@lists.bufferbloat.net X-Mailman-Version: 2.1.20 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 30 Sep 2016 19:18:13 -0000 Dear Jason: Let me cross post, with a little background, for those not paying attention on the other lists. All: I've always dreamed of a vpn that could fq and - when it was bottlenecking on cpu - throw away packets intelligently. Wireguard, which is what jason & co are working on, is a really simple, elegant set of newer vpn ideas that currently has a queuing model designed to optimize for multi-cpu encryption, and not, so much, for managing worst case network behaviors, or fairness, or working on lower end hardware. There's a lede port for it that topped out at (I think) about 16Mbits on weak hardware. http://wireguard.io/ is really easy to compile and setup. I wrote a bit about it in my blog as well ( http://blog.cerowrt.org/post/wireguard/ ) - and the fact that I spent any time on it at all is symptomatic of my overall ADHD (and at the time I was about to add a few more servers to the flent network and didn't want to use tinc anymore). But - As it turns out, the structure/basic concepts in the mac80211 implementation - the retry queue, the global fq_codel queue with per station hash collision detection - seemed to match much of wireguard's internal model, and I'd tweaked jason's interest. Do do a git clone of the code, and take a look... somewhere on the wireguard list, or privately, jason'd pointed me at the relevant bits of the queuing model. On Fri, Sep 30, 2016 at 11:41 AM, Jason A. Donenfeld wrot= e: > Hey Dave, > > I've been comparing graphs and bandwidth and so forth with flent's > rrul and iperf3, trying to figure out what's going on. A quick note on iperf3 - please see http://burntchrome.blogspot.com/2016/09/iperf3-and-microbursts.html There's a lesson here in this, and in pacing in general, sending a giant burst out of your retry queue, after you finish negotiating the link, is a bad idea, and some sort of pacing mechanism might help. And rather than pre-commenting here, I'll just include your last mail to these new lists: > Here's my > present understanding of the queuing buffering issues. I sort of > suspect these are issues that might not translate entirely well to the > work you've been doing, but maybe I'm wrong. Here goes... > > 1. For each peer, there is a separate queue, called peer_queue. Each > peer corresponds to a specific UDP endpoint, which means that a peer > is a "flow". > 2. When certain crypto handshake requirements haven't yet been met, > packets pile up in peer_queue. Then when a handshake completes, all > the packets that piled up are released. Because handshakes might take > a while, peer_queue is quite big -- 1024 packets (dropping the oldest > packets when full). In this context, that's not huge buffer bloat, but > rather that's just a queue of packets for while the setup operation is > occurring. > 3. WireGuard is a net_device interface, which means it transmits > packets from userspace in softirq. It's advertised as accepting GSO > "super packets", so sometimes it is asked to transmit a packet that is > 65k in length. When this happens, it splits those packets up into > MTU-sized packets, puts them in the queue, and then processes the > entire queue at once, immediately after. > > If that were the totality of things, I believe it would work quite > well. If the description stopped there, it means packets would be > encrypted and sent immediately in the softirq device transmit handler, > just like how the mac80211 stack does things. The above existence of > peer_queue wouldn't equate to any form of buffer bloat or latency > issues, because it would just act as a simple data structure for > immediately transmitting packets. Similarly, when receiving a packet > from the UDP socket, we _could_ simply just decrypt in softirq, again > like mac80211, as the packet comes in. This makes all the expensive > crypto operations blocking to the initiator of the operation -- the > userspace application calling send() or the udp socket receiving an > encrypted packet. All is well. > > However, things get complicated and ugly when we add multi-core > encryption and decryption. We add on to the above as follows: > > 4. The kernel has a library called padata (kernel/padata.c). You > submit asynchronous jobs, which are then sent off to various CPUs in > parallel, and then you're notified when the jobs are done, with the > nice feature that you get these notifications in the same order that > you submitted the jobs, so that packets don't get reordered. padata > has a hard coded maximum of in-progress operations of 1000. We can > artificially make this lower, if we want (currently we don't), but we > can't make it higher. > 5. We continue from the above described peer_queue, only this time > instead of encrypting immediately in softirq, we simply send all of > peer_queue off to padata. Since the actual work happens > asynchronously, we return immediately, not spending cycles in softirq. > When that batch of encryption jobs completes, we transmit the > resultant encrypted packets. When we send those jobs off, it's > possible padata already has 1000 operations in progress, in which case > we get "-EBUSY", and can take one of two options: (a) put that packet > back at the top of peer_queue, return from sending, and try again to > send all of peer_queue the next time the user submits a packet, or (b) > discard that packet, and keep trying to queue up the ones after. > Currently we go with behavior (a). > 6. Likewise, when receiving an encrypted packet from a UDP socket, we > decrypt it asynchronously using padata. If there are already 1000 > operations in place, we drop the packet. > > If I change the length of peer_queue from 1024 to something small like > 16, it makes some effect when combined with choice (a) as opposed to > choice (b), but I think this nob isn't so important, and I can leave > it at 1024. However, if I change the length of padata's maximum from > 1000 to something small like 16, I immediately get much lower latency. > However, bandwidth suffers greatly, no matter choice (a) or choice > (b). Padata's maximum seems to be the relevant nob. But I'm not sure > the best way to tune it, nor am I sure the best way to interact with > everything else here. > > I'm open to all suggestions, as at the moment I'm a bit in the dark on > how to proceed. Simply saying "just throw fq_codel at it!" or "change > your buffer lengths!" doesn't really help me much, as I believe the > design is a bit more nuanced. > > Thanks, > Jason --=20 Dave T=C3=A4ht Let's go make home routers and wifi faster! With better software! http://blog.cerowrt.org