[Ecn-sane] [tsvwg] per-flow scheduling

David P. Reed dpreed at deepplum.com
Wed Jun 26 12:31:46 EDT 2019


It's the limiting case, but also the optimal state given "perfect knowledge".
 
Yes, it requires that the source-destination pairs sharing the link in question coordinate their packet admission times so they don't "collide" at the link. Ideally the next packet would arrive during the previous packet's transmission, so it is ready-to-go when that packet's transmission ends.
 
Such exquisite coordination is feasible when future behavior by source and destination at the interface is known, which requires an Oracle.
That's the same kind of condition most information theoretic and queueing theoretic optimality requires.
 
But this is worth keeping in mind as the overall joint goal of all users.
 
In particular,  "link utilization" isn't a user goal at all. The link is there and is being paid for whether it is used or not (looking from the network structure as a whole). Its capacity exists to move packets out of the way. An ideal link satisfies the requirement that it never creates a queue because of anything other than imperfect coordination of the end-to-end flows mapped onto it. That's why the router should not be measured by "link utilization" anymore than a tunnel in a city during commuting hours should be measured by cars moved per hour. Clearly a tunnel can be VERY congested and moving many cars if they are attached to each other bumper to bumper - the latency through the tunnel would then be huge. If the cars were tipped on their ends and stacked, even more throughput would be achieved through the tunnel, and the delay of rotating them and packing them would add even more delay.
 
The idea that "link utilization" of 100% must be achieved is why we got bufferbloat designed into routers. It's a worm's eye perspective. To this day, Arista Networks brags about how its bufferbloated feature design optimizes switch utilization ([ https://packetpushers.net/aristas-big-buffer-b-s/ ]( https://packetpushers.net/aristas-big-buffer-b-s/ )). And it selects benchmarks to "prove" it. Andy Bechtolsheim apparently is such a big name that he can sell defective gear at a premium price, letting the datacenters who buy it discover that those switches get "clogged up" by TCP traffic when they are the "bottleneck link". Fortunately, they are fast, so they are less frequently the bottleneck in datacenter daily use.
 
In trying to understand what is going on with congestion signalling, any buffering at the entry to the link should be due only to imperfect information being fed back to the endpoints generating traffic. Because a misbehaving endpoint generates Denial of Service for all other users.
 
Priority mechanisms focused on protecting high-paying users from low-paying ones don't help much - they only help at overloaded states of the network. Which isn't to say that priority does nothing - it's just that stable assignment of a sharing level to priority levels isn't easy.  (See Paris Metro Pricing, where there are only two classes, and the problem of deciding how to manage the access to the "first class" section - the idea that 15 classes with different metrics can be handled simply and interoperably between differently managed autonomous systems seems to be an incredibly impractical goal).
Even in the priority case, buffering is NOT a desirable end user thing.
 
My personal view is that the manager of a network needs to configure the network so that no link ever gets overloaded, if possible. The response to overload should be to tell the relevant flows to all slow down (not just one, because if there are 100 flows that start up roughly at the same time, causing MD on one does very little. This is an example of something where per-flow stuff in the router actually makes the router helpful in the large scheme of things. Maybe all flows should be equally informed, as flows. Which means the router needs to know how to signal multiple flows, while not just hammering all the packets of a single flow.  This case is very real, but not as frequently on the client side as on the "server side" in "load balancers" and such like.
 
My point here is simple:
 
1) the endpoints tell the routers what flows are going through a link already. That's just the address information. So that information can be used for fairness pretty well, especially if short term memory (a bloom filter, perhaps) can track a sufficiently large number of flows.
 
2) The per-flow decisions related to congestion control within a flow are necessarily end-to-end in nature - the router can only tell the ends what is going on, but the ends (together - their admissions rates and consumption rates are coupled to the use being made) must be informed and decide. The congestion management must combine information about the source and the destination future behavior (even if it is just taking recent history and projecting it as an estimate of future behavior at source and destination). Which is why it is quite natural to have routers signal the destination, which then signals the source, which changes its behavior.
 
3) there are definitely other ways to improve latency for IP and protocols built on top of it  - routing some flows over different paths under congestion is one. call the per-flow routing. Another is scattering a flow over several paths (but that seems problematic for today's TcP which assumes all packets take the same path).
 
4) A different, but very coupled view of IP is that any application-relevant buffering shoujld be driven into the endpoints - at the source, buffering is useful to deal with variability in the rate of production of data to be sent. At the destination, buffering is useful to minimize jitter, matching to the consumption behavior of the application.  But these buffers should not be pushed into the network where they cause congestion for other flows sharing resources.
So buffering in the network should ONLY deal with the uncertainty in resource competition.
 
This tripartite breakdown of buffering is protocol independent. It applies to TCP, NTP, RTP, QUIC/UDP, ...  It's what we (that is me) had in mind when we split UDP out of TCP, allowing UDP based protocols to manage source and destination buffering in the application for all the things we thought UDP would be used for - packet speech, computer-computer remote procedure calls (what would be QUIC today), SATNET/interplanetary Internet connections , ...).
 
Sadly, in the many years since the late 1970's the tendency to think file transfers between infinite speed storage devices over TCP are the only relevant use of the Internet has penetrated the router design community. I can't seem to get anyone to recognize how far we are from that.  No one runs benchmarks for such behavior, no one even measures anything other than the "hot rod" maximum throughput cases.
 
And many egos seem to think that working on the hot rod cases is going to make their career or sell product.  (e.g. the sad case of Arista).
 
 
On Wednesday, June 26, 2019 8:48am, "Sebastian Moeller" <moeller0 at gmx.de> said:



> 
> 
> > On Jun 23, 2019, at 00:09, David P. Reed <dpreed at deepplum.com> wrote:
> >
> > [...]
> >
> > per-flow scheduling is appropriate on a shared link. However, the end-to-end
> argument would suggest that the network not try to divine which flows get
> preferred.
> > And beyond the end-to-end argument, there's a practical problem - since the
> ideal state of a shared link means that it ought to have no local backlog in the
> queue, the information needed to schedule "fairly" isn't in the queue backlog
> itself. If there is only one packet, what's to schedule?
> >
> [...]
> 
> Excuse my stupidity, but the "only one single packet" case is the theoretical
> limiting case, no?
> Because even on a link not running at capacity this effectively requires a
> mechanism to "synchronize" all senders (whose packets traverse the hop we are
> looking at), as no other packet is allowed to reach the hop unless the "current"
> one has been passed to the PHY otherwise we transiently queue 2 packets (I note
> that this rationale should hold for any small N). The more packets per second a
> hop handles the less likely it will be to avoid for any newcomer to run into an
> already existing packet(s), that is to transiently grow the queue.
> Not having a CS background, I fail to see how this required synchronized state can
> exist outside of a few steady state configurations where things change slowly
> enough that the seemingly required synchronization can actually happen (given
> that the feedback loop e.g. through ACKs, seems somewhat jittery). Since packets
> never know which path they take and which hop is going to be critical there seems
> to be no a priori way to synchronize all senders, heck I fail to see whether it
> would be possible at all to guarantee synchronized behavior on more than one hop
> (unless all hops are extremely uniform).
> I happen to believe that L4S suffers from the same conceptual issue (plus overly
> generic promises, from the RITE website:
> "We are so used to the unpredictability of queuing delay, we don’t know how
> good the Internet would feel without it. The RITE project has developed simple
> technology to make queuing delay a thing of the past—not just for a select
> few apps, but for all." this seems missing a conditions apply statement)
> 
> Best Regards
> Sebastian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.bufferbloat.net/pipermail/ecn-sane/attachments/20190626/e08648be/attachment.html>


More information about the Ecn-sane mailing list