thanks David - I really like your clear distinction between avoidance and optimized congestion. v On Tue, Sep 28, 2021 at 6:15 PM David P. Reed wrote: > Upon thinking about this, here's a radical idea: > > > > the expected time until a bottleneck link clears, that is, 0 packets are > in the queue to be sent on it, must be < t, where t is an Internet-wide > constant corresponding to the time it takes light to circle the earth. > > > > This is a local constraint, one that is required of a router. It can be > achieved in any of a variety of ways (for example choosing to route > different flows on different paths that don't include the bottleneck link). > > > > It need not be true at all times - but when I say "expected time", I mean > that the queue's behavior is monitored so that this situation is quite rare > over any interval of ten minutes or more. > > > > If a bottleneck link is continuously full for more than the time it takes > for packets on a fiber (< light speed) to circle the earth, it is in REALLY > bad shape. That must never happen. > > > > Why is this important? > > > > It's a matter of control theory - if the control loop delay gets longer > than its minimum, instability tends to take over no matter what control > discipline is used to manage the system. > > > > Now, it is important as hell to avoid bullshit research programs that try > to "optimize" ustilization of link capacity at 100%. Those research > programs focus on the absolute wrong measure - a proxy for "network capital > cost" that is in fact the wrong measure of any real network operator's cost > structure. The cost of media (wires, airtime, ...) is a tiny fraction of > most network operations' cost in any real business or institution. We don't > optimize highways by maximizing the number of cars on every stretch of > highway, for obvious reasons, but also for non-obvious reasons. > > > > Latency and lack of flexibiilty or reconfigurability impose real costs on > a system that are far more significant to end-user value than the cost of > the media. > > > > A sustained congestion of a bottleneck link is not a feature, but a very > serious operational engineering error. People should be fired if they don't > prevent that from ever happening, or allow it to persist. > > > > This is why telcos, for example, design networks to handle the expected > maximum traffic with some excess apactity. This is why networks are > constantly being upgraded as load increases, *before* overloads occur. > > > > It's an incredibly dangerous and arrogant assumption that operation in a > congested mode is acceptable. > > > > That's the rationale for the "radical proposal". > > > > Sadly, academic thinkers (even ones who have worked in industry research > labs on minor aspects) get drawn into solving the wrong problem - > optimizing the case that should never happen. > > > > Sure that's helpful - but only in the same sense that when designing > systems where accidents need to have fallbacks one needs to design the > fallback system to work. > > > > Operating at fully congested state - or designing TCP to essencially come > close to DDoS behavior on a bottleneck to get a publishable paper - is > missing the point. > > > > > > On Monday, September 27, 2021 10:50am, "Bob Briscoe" < > research@bobbriscoe.net> said: > > > Dave, > > > > On 26/09/2021 21:08, Dave Taht wrote: > > > ... an exploration of smaller mss sizes in response to persistent > congestion > > > > > > This is in response to two declarative statements in here that I've > > > long disagreed with, > > > involving NOT shrinking the mss, and not trying to do pacing... > > > > I would still avoid shrinking the MSS, 'cos you don't know if the > > congestion constraint is the CPU, in which case you'll make congestion > > worse. But we'll have to differ on that if you disagree. > > > > I don't think that paper said don't do pacing. In fact, it says "...pace > > the segments at less than one per round trip..." > > > > Whatever, that paper was the problem statement, with just some ideas on > > how we were going to solve it. > > after that, Asad (added to the distro) did his whole Masters thesis on > > this - I suggest you look at his thesis and code (pointers below). > > > > Also soon after he'd finished, changes to BBRv2 were introduced to > > reduce queuing delay with large numbers of flows. You might want to take > > a look at that too: > > > https://datatracker.ietf.org/meeting/106/materials/slides-106-iccrg-update-on-bbrv2#page=10 > > > > > > > > https://www.bobbriscoe.net/projects/latency/sub-mss-w.pdf > > > > > > OTherwise, for a change, I largely agree with bob. > > > > > > "No amount of AQM twiddling can fix this. The solution has to fix TCP." > > > > > > "nearly all TCP implementations cannot operate at less than two > packets per > > RTT" > > > > Back to Asad's Master's thesis, we found that just pacing out the > > packets wasn't enough. There's a very brief summary of the 4 things we > > found we had to do in 4 bullets in this section of our write-up for > netdev: > > > https://bobbriscoe.net/projects/latency/tcp-prague-netdev0x13.pdf#subsubsection.3.1.6 > > And I've highlighted a couple of unexpected things that cropped up below. > > > > Asad's full thesis: > > > > Ahmed, A., "Extending TCP for Low Round Trip Delay", > > > > Masters Thesis, Uni Oslo , August 2019, > > > > . > > Asad's thesis presentation: > > https://bobbriscoe.net/presents/1909submss/present_asadsa.pdf > > > > Code: > > https://bitbucket.org/asadsa/kernel420/src/submss/ > > Despite significant changes to basic TCP design principles, the diffs > > were not that great. > > > > A number of tricky problems came up. > > > > * For instance, simple pacing when <1 ACK per RTT wasn't that simple. > > Whenever there were bursts from cross-traffic, the consequent burst in > > your own flow kept repeating in subsequent rounds. We realized this was > > because you never have a real ACK clock (you always set the next send > > time based on previous send times). So we set up the the next send time > > but then re-adjusted it if/when the next ACK did actually arrive. > > > > * The additive increase of one segment was the other main problem. When > > you have such a small window, multiplicative decrease scales fine, but > > an additive increase of 1 segment is a huge jump in comparison, when > > cwnd is a fraction of a segment. "Logarithmically scaled additive > > increase" was our solution to that (basically, every time you set > > ssthresh, alter the additive increase constant using a formula that > > scales logarithmically with ssthresh, so it's still roughly 1 for the > > current Internet scale). > > > > What became of Asad's work? > > Altho the code finally worked pretty well {1}, we decided not to pursue > > it further 'cos a minimum cwnd actually gives a trickle of throughput > > protection against unresponsive flows (with the downside that it > > increases queuing delay). That's not to say this isn't worth working on > > further, but there was more to do to make it bullet proof, and we were > > in two minds how important it was, so it worked its way down our > > priority list. > > > > {Note 1: From memory, there was an outstanding problem with one flow > > remaining dominant if you had step-ECN marking, which we worked out was > > due to the logarithmically scaled additive increase, but we didn't work > > on it further to fix it.} > > > > > > > > Bob > > > > > > -- > > ________________________________________________________________ > > Bob Briscoe http://bobbriscoe.net/ > > > > _______________________________________________ > > Ecn-sane mailing list > > Ecn-sane@lists.bufferbloat.net > > https://lists.bufferbloat.net/listinfo/ecn-sane > > > _______________________________________________ > Ecn-sane mailing list > Ecn-sane@lists.bufferbloat.net > https://lists.bufferbloat.net/listinfo/ecn-sane > -- Please send any postal/overnight deliveries to: Vint Cerf 1435 Woodhurst Blvd McLean, VA 22102 703-448-0965 until further notice