[Ecn-sane] [tsvwg] per-flow scheduling

Sat Jun 22 19:07:08 EDT 2019

> On 23 Jun, 2019, at 1:09 am, David P. Reed <dpreed at deepplum.com> wrote:
> 
> per-flow scheduling is appropriate on a shared link. However, the end-to-end argument would suggest that the network not try to divine which flows get preferred.
> And beyond the end-to-end argument, there's a practical problem - since the ideal state of a shared link means that it ought to have no local backlog in the queue, the information needed to schedule "fairly" isn't in the queue backlog itself.  If there is only one packet, what's to schedule?

This is a great straw-man.  Allow me to deconstruct it.

The concept that DRR++ has empirically proved is that flows can be classified into two categories - sparse and saturating - very easily by the heuristic that a saturating flow's arrival rate exceeds its available delivery rate, and the opposite is true for a sparse flow.

An excessive arrival rate results in a standing queue; with Reno, the excess arrival rate after capacity is reached is precisely 1 segment per RTT, very small next to modern link capacities.  If there is no overall standing queue, then by definition all of the flows passing through are currently sparse.  DRR++ (as implemented in fq_codel and Cake) ensures that all sparse traffic is processed with minimum delay and no AQM activity, while saturating traffic is metered out fairly and given appropriate AQM signals.

> In fact, what the ideal queueing discipline would do is send signals to the endpoints that provide information as to what each flow's appropriate share is, and/or how far its current share is from what's fair.

The definition of which flows are sparse and which are saturating shifts dynamically in response to endpoint behaviour.

> Well, presumably the flows have definable average rates.

Today's TCP traffic exhibits the classic sawtooth behaviour - which has a different shape and period with CUBIC than Reno, but is fundamentally similar.  The sender probes capacity by increasing send rate until a congestion signal is fed back to it, at which point it drops back sharply.  With efficient AQM action, a TCP flow will therefore spend most of its time "sparse" and using less than the available path capacity, with occasional excursions into "saturating" territory which are fairly promptly terminated by AQM signals.

So TCP does *not* have a definable "average rate".  It grows to fill available capacity, just like the number of cars on a motorway network.

The recent work on high-fidelity ECN (including SCE) aims to eliminate the sawtooth, so that dropping out of "saturating" mode is done faster and by only a small margin, wasting less capacity and reducing peak delays - very close to ideal control as you describe.  But it's still necessary to avoid giving these signals unnecessarily to "sparse" flows, which would cause them to back off and thus waste capacity, but only to "saturating" flows that have just begun to build a queue.  And it's also necessary to protect these well-behaved "modern" flows from "legacy" endpoint behaviour, and vice versa.  DRR++ does that very naturally.

> Merely re-ordering the packets on a link is just not very effective at achieving fairness.

I'm afraid this assertion is simply false.  DRR++ does precisely that, and achieves near-perfect fairness.

It is important however to define "flow" correctly relative to the measure of fairness you want to achieve.  Traditionally the unique 5-tuple is used to define "flow", but this means applications can game the system by opening multiple flows.  For an ISP a better definition might be that each subscriber's traffic is one "flow".  Or there is a tweak to DRR++ which allows a two-layer fairness definition, implemented successfully in Cake.

> So the end-to-end approach would suggest moving most of the scheduling back to the endpoints of each flow, with the role of the routers being to extract information about the competing flows that are congesting the network, and forwarding those signals (via drops or marking) to the endpoints. That's because, in the end-to-end argument that applies here - the router cannot do the entire function of managing congestion or priority.

It must be remembered that congestion signals require one RTT to circulate from the bottleneck, via the receiver, back to the sender, and their effects to then be felt at the bottleneck.  That's typically a much longer response time (say 100ms for a general Internet path) than can be achieved by packet scheduling (sub-millisecond for a 20Mbps link), and therefore effects only a looser control (by fundamental control theory).  Both mechanisms are useful and complement each other.

My personal interpretation of the end-to-end principle is that endpoints generally do not, cannot, and *should not* be aware of the topology of the network between them, nor of any other traffic that might be sharing that network.  The network itself takes care of those details, and may send standardised control-feedback signals to the endpoints to inform them about actions they need to take.  These currently take the form of ICMP error packets and the ECN field, the latter substituted by packet drops on Not-ECT flows.

 - Jonathan Morton