[Bloat] Comcast & L4S

Mon Feb 3 09:04:58 EST 2025

>> Actually, the 5ms target is already too tight for efficient TCP operation on typical Internet paths - unless there is significant statistical multiplexing on the bottleneck link, which is rarely the case in a domestic context.  
> 
> I respectfully disagree, even at 1 Gbps we only go down to 85% utilisation with a single flow I assume. That is a trade-off I am happy to make...

At 100ms RTT, yes.  But you can see that Codel has disproportionately more trouble when the RTT increases a little more than that, and such paths are not uncommon when you look outside of our usual stomping grounds of Europe and North America.  This happens because each congestion episode starts to lead to more than one Multiplicative Decrease due to congestion signalling, so the average cwnd falls below what would normally be expected.  This would not typically occur at high statistical multiplexing.

>> Short RTTs on a LAN allow for achieving full throughput with the queue held this small, but remember that the concept of "LAN" also includes WiFi links whose median latency is orders of magnitude greater than that of switched Ethernet.  That's why I don't want to encourage going below 5ms too much.
> 
> Not wanting top be contrarian, but here I believe fixing WiFi is the better path forward.

Perhaps, but you'll need to change the fundamental collision-avoidance MAC design of WiFi to do that.  Until someone (and it will take more than mere *individual* contributions) gets around to that and existing WiFi hardware mostly drops out of use, we have to design for its current behaviour.

I'm not talking about the bufferbloat of some specific WiFi hardware here - we've already done all the technical work we can to fix that.  It's the fundamental link protocol.

>> DelTiC actually reverts to the 25ms queue target that has historically been typical for AQMs targeting conventional TCP.
> 
> Not doubting one bit that 25ms makes a ton of sense for DelTic, but where do these historical 25ms come from and how was this number selected?

Perhaps "historical" is putting it too strongly - it's only quite recently that AQM has used a time-based delay target at all.  It is, however, the delay target that PIE uses.

The graphs I attached arise from an effort to decide what "rightsize" actually means for a dumb FIFO buffer, in which it proved convenient to also test some AQMs.  The classical rule is based on Reno behaviour, and in the absence of statistical multiplexing reduces to "buffer depth equal to baseline path length" to obtain 100% throughput.  Updating this for CUBIC yields a rule of "buffer depth 3/7ths of baseline path length", which for a 100ms path would be around 40ms buffer.  This is, again, for 100% throughput at steady state.

Examining the detailed behaviour of CUBIC, we realised that approximately halving this would still yield reasonably good throughput, due to CUBIC's designed-in decelerating approach to the previous highest cwnd and, particularly, its intermittent use of "fast convergence" cycles in which the inflection point is placed halfway between the peak and trough of the sawtooth.  That yields a buffer size of 3/14ths of the baseline RTT.  On a 100ms path, 25ms gives a reasonable engineering margin on top of this rule, and is also small enough for VoIP to easily accommodate the jitter induced by a competing traffic load.

Thus, in the graphs, you can see DelTiC staying consistently above 95% throughput at 100ms, and falling off relatively gracefully above that.  Codel requires a path of 32ms or shorter to achieve that.  Even PIE, with the same delay target as DelTiC, doesn't do as well - but that is due to its incorrect marking behaviour, which we have discussed at length before.

>> As for CPU efficiency, that is indeed something to keep in mind.  The scheduling logic in Cake got very complex in the end, and there are undoubtedly ways to avoid that with a fresh design.
> 
> Ah, that was not my main focus here, with 1600 Gbps ethernet already in the horizon, I assume a shaper running out of CPU is not really avoidable, I am more interested in that shaper having a graceful latency-conserving failure mode when running out of timely CPU access. Making scheduling more efficient is something that I am fully behind, but I consider these two mostly orthogonal issues.

I suppose there are two distinct meanings of "scheduling".  One is deciding which packet to send next.  The other is deciding WHEN the next packet can be sent.  It's the latter that might be more complicated than necessary in Cake, and that complexity could easily result in exercising the kernel timer infrastructure more than required.

However, I would also note that this behaviour is only seen on certain specific classes of hardware, and on that particular hardware I think there is another mechanism contributing to poor throughput.  Cake's shaper architecture quite deliberately "pushes harder" when the throughput goes below the configured rate, and that manifests as higher CPU utilisation.  HTB isn't as good at that.  But the underlying reason may be a bottleneck in the I/O infrastructure between the CPU and the network.  When a qdisc is not attached, this I/O bottleneck is bypassed.

 - Jonathan Morton