[Ecn-sane] paper idea: praising smaller packets

David P. Reed dpreed at deepplum.com
Tue Sep 28 18:15:55 EDT 2021


Upon thinking about this, here's a radical idea:
 
the expected time until a bottleneck link clears, that is, 0 packets are in the queue to be sent on it, must be < t, where t is an Internet-wide constant corresponding to the time it takes light to circle the earth.
 
This is a local constraint, one that is required of a router. It can be achieved in any of a variety of ways (for example choosing to route different flows on different paths that don't include the bottleneck link).
 
It need not be true at all times - but when I say "expected time", I mean that the queue's behavior is monitored so that this situation is quite rare over any interval of ten minutes or more.
 
If a bottleneck link is continuously full for more than the time it takes for packets on a fiber (< light speed) to circle the earth, it is in REALLY bad shape. That must never happen.
 
Why is this important?
 
It's a matter of control theory - if the control loop delay gets longer than its minimum, instability tends to take over no matter what control discipline is used to manage the system.
 
Now, it is important as hell to avoid bullshit research programs that try to "optimize" ustilization of link capacity at 100%. Those research programs focus on the absolute wrong measure - a proxy for "network capital cost" that is in fact the wrong measure of any real network operator's cost structure. The cost of media (wires, airtime, ...) is a tiny fraction of most network operations' cost in any real business or institution. We don't optimize highways by maximizing the number of cars on every stretch of highway, for obvious reasons, but also for non-obvious reasons.
 
Latency and lack of flexibiilty or  reconfigurability impose real costs on a system that are far more significant to end-user value than the cost of the media.
 
A sustained congestion of a bottleneck link is not a feature, but a very serious operational engineering error. People should be fired if they don't prevent that from ever happening, or allow it to persist.
 
This is why telcos, for example, design networks to handle the expected maximum traffic with some excess apactity. This is why networks are constantly being upgraded as load increases, *before* overloads occur.
 
It's an incredibly dangerous and arrogant assumption that operation in a congested mode is acceptable.
 
That's the rationale for the "radical proposal".
 
Sadly, academic thinkers (even ones who have worked in industry research labs on minor aspects) get drawn into solving the wrong problem - optimizing the case that should never happen.
 
Sure that's helpful - but only in the same sense that when designing systems where accidents need to have fallbacks one needs to design the fallback system to work.
 
Operating at fully congested state - or designing TCP to essencially come close to DDoS behavior on a bottleneck to get a publishable paper - is missing the point.
 
 
On Monday, September 27, 2021 10:50am, "Bob Briscoe" <research at bobbriscoe.net> said:



> Dave,
> 
> On 26/09/2021 21:08, Dave Taht wrote:
> > ... an exploration of smaller mss sizes in response to persistent congestion
> >
> > This is in response to two declarative statements in here that I've
> > long disagreed with,
> > involving NOT shrinking the mss, and not trying to do pacing...
> 
> I would still avoid shrinking the MSS, 'cos you don't know if the
> congestion constraint is the CPU, in which case you'll make congestion
> worse. But we'll have to differ on that if you disagree.
> 
> I don't think that paper said don't do pacing. In fact, it says "...pace
> the segments at less than one per round trip..."
> 
> Whatever, that paper was the problem statement, with just some ideas on
> how we were going to solve it.
> after that, Asad (added to the distro) did his whole Masters thesis on
> this - I suggest you look at his thesis and code (pointers below).
> 
> Also soon after he'd finished, changes to BBRv2 were introduced to
> reduce queuing delay with large numbers of flows. You might want to take
> a look at that too:
> https://datatracker.ietf.org/meeting/106/materials/slides-106-iccrg-update-on-bbrv2#page=10
> 
> >
> > https://www.bobbriscoe.net/projects/latency/sub-mss-w.pdf
> >
> > OTherwise, for a change, I largely agree with bob.
> >
> > "No amount of AQM twiddling can fix this. The solution has to fix TCP."
> >
> > "nearly all TCP implementations cannot operate at less than two packets per
> RTT"
> 
> Back to Asad's Master's thesis, we found that just pacing out the
> packets wasn't enough. There's a very brief summary of the 4 things we
> found we had to do in 4 bullets in this section of our write-up for netdev:
> https://bobbriscoe.net/projects/latency/tcp-prague-netdev0x13.pdf#subsubsection.3.1.6
> And I've highlighted a couple of unexpected things that cropped up below.
> 
> Asad's full thesis:
>              
> Ahmed, A., "Extending TCP for Low Round Trip Delay",
>              
> Masters Thesis, Uni Oslo , August 2019,
>              
> <https://www.duo.uio.no/handle/10852/70966>.
> Asad's thesis presentation:
>     https://bobbriscoe.net/presents/1909submss/present_asadsa.pdf
> 
> Code:
>     https://bitbucket.org/asadsa/kernel420/src/submss/
> Despite significant changes to basic TCP design principles, the diffs
> were not that great.
> 
> A number of tricky problems came up.
> 
> * For instance, simple pacing when <1 ACK per RTT wasn't that simple.
> Whenever there were bursts from cross-traffic, the consequent burst in
> your own flow kept repeating in subsequent rounds. We realized this was
> because you never have a real ACK clock (you always set the next send
> time based on previous send times). So we set up the the next send time
> but then re-adjusted it if/when the next ACK did actually arrive.
> 
> * The additive increase of one segment was the other main problem. When
> you have such a small window, multiplicative decrease scales fine, but
> an additive increase of 1 segment is a huge jump in comparison, when
> cwnd is a fraction of a segment. "Logarithmically scaled additive
> increase" was our solution to that (basically, every time you set
> ssthresh, alter the additive increase constant using a formula that
> scales logarithmically with ssthresh, so it's still roughly 1 for the
> current Internet scale).
> 
> What became of Asad's work?
> Altho the code finally worked pretty well {1}, we decided not to pursue
> it further 'cos a minimum cwnd actually gives a trickle of throughput
> protection against unresponsive flows (with the downside that it
> increases queuing delay). That's not to say this isn't worth working on
> further, but there was more to do to make it bullet proof, and we were
> in two minds how important it was, so it worked its way down our
> priority list.
> 
> {Note 1: From memory, there was an outstanding problem with one flow
> remaining dominant if you had step-ECN marking, which we worked out was
> due to the logarithmically scaled additive increase, but we didn't work
> on it further to fix it.}
> 
> 
> 
> Bob
> 
> 
> --
> ________________________________________________________________
> Bob Briscoe http://bobbriscoe.net/
> 
> _______________________________________________
> Ecn-sane mailing list
> Ecn-sane at lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/ecn-sane
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.bufferbloat.net/pipermail/ecn-sane/attachments/20210928/c6d7e639/attachment.html>


More information about the Ecn-sane mailing list