From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp92.iad3a.emailsrvr.com (smtp92.iad3a.emailsrvr.com [173.203.187.92]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by lists.bufferbloat.net (Postfix) with ESMTPS id DBFB93CB35 for ; Wed, 26 Jun 2019 12:53:02 -0400 (EDT) Received: from smtp12.relay.iad3a.emailsrvr.com (localhost [127.0.0.1]) by smtp12.relay.iad3a.emailsrvr.com (SMTP Server) with ESMTP id 99012251F3; Wed, 26 Jun 2019 12:53:02 -0400 (EDT) X-SMTPDoctor-Processed: csmtpprox beta DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=g001.emailsrvr.com; s=20190322-9u7zjiwi; t=1561567982; bh=odWZgAxOjWk4zAz4E3je0UMtt6RAq/sKpR7lWgJVu/4=; h=Date:Subject:From:To:From; b=ub9duWvimrLEnwVHkKWy0Ssb93oPvpSkHNc7ByKAJObYJxNj25VAlkwEqqi7SRGk2 emdm1x11VcY7LkiYsclVufUhJ4Fd37yH+aqmW+Qm+lYCPQ7u3YiRBjGD9eEAVaKDhe LY6M3WR8Mz00wN/ck0avQq64BPdD4ylNr/9mp8Fk= Received: from app15.wa-webapps.iad3a (relay-webapps.rsapps.net [172.27.255.140]) by smtp12.relay.iad3a.emailsrvr.com (SMTP Server) with ESMTP id 42CC8251C0; Wed, 26 Jun 2019 12:53:02 -0400 (EDT) X-Sender-Id: dpreed@deepplum.com Received: from app15.wa-webapps.iad3a (relay-webapps.rsapps.net [172.27.255.140]) by 0.0.0.0:25 (trex/5.7.12); Wed, 26 Jun 2019 12:53:02 -0400 Received: from deepplum.com (localhost.localdomain [127.0.0.1]) by app15.wa-webapps.iad3a (Postfix) with ESMTP id 29DC4E0084; Wed, 26 Jun 2019 12:53:02 -0400 (EDT) Received: by apps.rackspace.com (Authenticated sender: dpreed@deepplum.com, from: dpreed@deepplum.com) with HTTP; Wed, 26 Jun 2019 12:53:02 -0400 (EDT) X-Auth-ID: dpreed@deepplum.com Date: Wed, 26 Jun 2019 12:53:02 -0400 (EDT) From: "David P. Reed" To: "David P. Reed" Cc: "Sebastian Moeller" , "ecn-sane@lists.bufferbloat.net" , "Brian E Carpenter" , "tsvwg IETF list" MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_20190626125302000000_78119" Importance: Normal X-Priority: 3 (Normal) X-Type: html In-Reply-To: <1561566706.778820831@apps.rackspace.com> References: <350f8dd5-65d4-d2f3-4d65-784c0379f58c@bobbriscoe.net> <46D1ABD8-715D-44D2-B7A0-12FE2A9263FE@gmx.de> <835b1fb3-e8d5-c58c-e2f8-03d2b886af38@gmail.com> <1561233009.95886420@apps.rackspace.com> <71EF351D-AFBF-4C92-B6B9-7FD695B68815@gmail.com> <1561241377.4026977@apps.rackspace.com> <4E863FC5-D30E-4F76-BDF7-6A787958C628@gmx.de> <1561566706.778820831@apps.rackspace.com> Message-ID: <1561567982.16883207@apps.rackspace.com> X-Mailer: webmail/16.4.5-RC Subject: Re: [Ecn-sane] [tsvwg] per-flow scheduling X-BeenThere: ecn-sane@lists.bufferbloat.net X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion of explicit congestion notification's impact on the Internet List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Jun 2019 16:53:02 -0000 ------=_20190626125302000000_78119 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable =0AA further minor thought, maybe one that needs not be said:=0A =0AFlows a= ren't "connections". Routers are not involved in connection state managemen= t, which is purely part of the end to end protocol. Anything about "connect= ions" that a router might need to know to handle a packet should be package= d into the IP header of each packet in a standard form. Routers can "store"= this information associated with the source, destination pair if they want= , for a short time, subject to well understood semantics when they run out = of storage. This fits into an end-to-end argument as an optiimization of a = kind, as long as the function of such information is very narrowly and gene= rally defined to benefit all users of IP-based protocols.=0A =0AFor example= , remembering the last time a packet of a particular flow was received afte= r forwarding it, for a short time, to calculate fairness, that seems like a= very useful idea, as long as forgetting the last time of receipt is not un= fair.=0A =0AThis use of the flow's IP headers to carry info into router que= ueing and routing decisions is analogous to the "Fate Sharing" principle of= protocol design that DDC describes. Instead of having an independent contr= ol plane protocol, which has all kinds of problems with synchronization and= combinatorial problems of packet loss, "Fate Sharing" of protocol informat= ion is very elegant.=0AOn Wednesday, June 26, 2019 12:31pm, "David P. Reed"= said:=0A=0A=0A=0AIt's the limiting case, but also th= e optimal state given "perfect knowledge".=0A =0AYes, it requires that the = source-destination pairs sharing the link in question coordinate their pack= et admission times so they don't "collide" at the link. Ideally the next pa= cket would arrive during the previous packet's transmission, so it is ready= -to-go when that packet's transmission ends.=0A =0ASuch exquisite coordinat= ion is feasible when future behavior by source and destination at the inter= face is known, which requires an Oracle.=0AThat's the same kind of conditio= n most information theoretic and queueing theoretic optimality requires.=0A= =0ABut this is worth keeping in mind as the overall joint goal of all user= s.=0A =0AIn particular, "link utilization" isn't a user goal at all. The l= ink is there and is being paid for whether it is used or not (looking from = the network structure as a whole). Its capacity exists to move packets out = of the way. An ideal link satisfies the requirement that it never creates a= queue because of anything other than imperfect coordination of the end-to-= end flows mapped onto it. That's why the router should not be measured by "= link utilization" anymore than a tunnel in a city during commuting hours sh= ould be measured by cars moved per hour. Clearly a tunnel can be VERY conge= sted and moving many cars if they are attached to each other bumper to bump= er - the latency through the tunnel would then be huge. If the cars were ti= pped on their ends and stacked, even more throughput would be achieved thro= ugh the tunnel, and the delay of rotating them and packing them would add e= ven more delay.=0A =0AThe idea that "link utilization" of 100% must be achi= eved is why we got bufferbloat designed into routers. It's a worm's eye per= spective. To this day, Arista Networks brags about how its bufferbloated fe= ature design optimizes switch utilization ([ https://packetpushers.net/aris= tas-big-buffer-b-s/ ]( https://packetpushers.net/aristas-big-buffer-b-s/ ))= . And it selects benchmarks to "prove" it. Andy Bechtolsheim apparently is = such a big name that he can sell defective gear at a premium price, letting= the datacenters who buy it discover that those switches get "clogged up" b= y TCP traffic when they are the "bottleneck link". Fortunately, they are fa= st, so they are less frequently the bottleneck in datacenter daily use.=0A = =0AIn trying to understand what is going on with congestion signalling, any= buffering at the entry to the link should be due only to imperfect informa= tion being fed back to the endpoints generating traffic. Because a misbehav= ing endpoint generates Denial of Service for all other users.=0A =0APriorit= y mechanisms focused on protecting high-paying users from low-paying ones d= on't help much - they only help at overloaded states of the network. Which = isn't to say that priority does nothing - it's just that stable assignment = of a sharing level to priority levels isn't easy. (See Paris Metro Pricing= , where there are only two classes, and the problem of deciding how to mana= ge the access to the "first class" section - the idea that 15 classes with = different metrics can be handled simply and interoperably between different= ly managed autonomous systems seems to be an incredibly impractical goal).= =0AEven in the priority case, buffering is NOT a desirable end user thing.= =0A =0AMy personal view is that the manager of a network needs to configure= the network so that no link ever gets overloaded, if possible. The respons= e to overload should be to tell the relevant flows to all slow down (not ju= st one, because if there are 100 flows that start up roughly at the same ti= me, causing MD on one does very little. This is an example of something whe= re per-flow stuff in the router actually makes the router helpful in the la= rge scheme of things. Maybe all flows should be equally informed, as flows.= Which means the router needs to know how to signal multiple flows, while n= ot just hammering all the packets of a single flow. This case is very real= , but not as frequently on the client side as on the "server side" in "load= balancers" and such like.=0A =0AMy point here is simple:=0A =0A1) the endp= oints tell the routers what flows are going through a link already. That's = just the address information. So that information can be used for fairness = pretty well, especially if short term memory (a bloom filter, perhaps) can = track a sufficiently large number of flows.=0A =0A2) The per-flow decisions= related to congestion control within a flow are necessarily end-to-end in = nature - the router can only tell the ends what is going on, but the ends (= together - their admissions rates and consumption rates are coupled to the = use being made) must be informed and decide. The congestion management must= combine information about the source and the destination future behavior (= even if it is just taking recent history and projecting it as an estimate o= f future behavior at source and destination). Which is why it is quite natu= ral to have routers signal the destination, which then signals the source, = which changes its behavior.=0A =0A3) there are definitely other ways to imp= rove latency for IP and protocols built on top of it - routing some flows = over different paths under congestion is one. call the per-flow routing. An= other is scattering a flow over several paths (but that seems problematic f= or today's TcP which assumes all packets take the same path).=0A =0A4) A di= fferent, but very coupled view of IP is that any application-relevant buffe= ring shoujld be driven into the endpoints - at the source, buffering is use= ful to deal with variability in the rate of production of data to be sent. = At the destination, buffering is useful to minimize jitter, matching to the= consumption behavior of the application. But these buffers should not be = pushed into the network where they cause congestion for other flows sharing= resources.=0ASo buffering in the network should ONLY deal with the uncerta= inty in resource competition.=0A =0AThis tripartite breakdown of buffering = is protocol independent. It applies to TCP, NTP, RTP, QUIC/UDP, ... It's w= hat we (that is me) had in mind when we split UDP out of TCP, allowing UDP = based protocols to manage source and destination buffering in the applicati= on for all the things we thought UDP would be used for - packet speech, com= puter-computer remote procedure calls (what would be QUIC today), SATNET/in= terplanetary Internet connections , ...).=0A =0ASadly, in the many years si= nce the late 1970's the tendency to think file transfers between infinite s= peed storage devices over TCP are the only relevant use of the Internet has= penetrated the router design community. I can't seem to get anyone to reco= gnize how far we are from that. No one runs benchmarks for such behavior, = no one even measures anything other than the "hot rod" maximum throughput c= ases.=0A =0AAnd many egos seem to think that working on the hot rod cases i= s going to make their career or sell product. (e.g. the sad case of Arista= ).=0A =0A =0AOn Wednesday, June 26, 2019 8:48am, "Sebastian Moeller" said:=0A=0A=0A=0A> =0A> =0A> > On Jun 23, 2019, at 00:09, David= P. Reed wrote:=0A> >=0A> > [...]=0A> >=0A> > per-flo= w scheduling is appropriate on a shared link. However, the end-to-end=0A> a= rgument would suggest that the network not try to divine which flows get=0A= > preferred.=0A> > And beyond the end-to-end argument, there's a practical = problem - since the=0A> ideal state of a shared link means that it ought to= have no local backlog in the=0A> queue, the information needed to schedule= "fairly" isn't in the queue backlog=0A> itself. If there is only one packe= t, what's to schedule?=0A> >=0A> [...]=0A> =0A> Excuse my stupidity, but th= e "only one single packet" case is the theoretical=0A> limiting case, no?= =0A> Because even on a link not running at capacity this effectively requir= es a=0A> mechanism to "synchronize" all senders (whose packets traverse the= hop we are=0A> looking at), as no other packet is allowed to reach the hop= unless the "current"=0A> one has been passed to the PHY otherwise we trans= iently queue 2 packets (I note=0A> that this rationale should hold for any = small N). The more packets per second a=0A> hop handles the less likely it = will be to avoid for any newcomer to run into an=0A> already existing packe= t(s), that is to transiently grow the queue.=0A> Not having a CS background= , I fail to see how this required synchronized state can=0A> exist outside = of a few steady state configurations where things change slowly=0A> enough = that the seemingly required synchronization can actually happen (given=0A> = that the feedback loop e.g. through ACKs, seems somewhat jittery). Since pa= ckets=0A> never know which path they take and which hop is going to be crit= ical there seems=0A> to be no a priori way to synchronize all senders, heck= I fail to see whether it=0A> would be possible at all to guarantee synchro= nized behavior on more than one hop=0A> (unless all hops are extremely unif= orm).=0A> I happen to believe that L4S suffers from the same conceptual iss= ue (plus overly=0A> generic promises, from the RITE website:=0A> "We are so= used to the unpredictability of queuing delay, we don=E2=80=99t know how= =0A> good the Internet would feel without it. The RITE project has develope= d simple=0A> technology to make queuing delay a thing of the past=E2=80=94n= ot just for a select=0A> few apps, but for all." this seems missing a condi= tions apply statement)=0A> =0A> Best Regards=0A> Sebastian ------=_20190626125302000000_78119 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable

A further minor though= t, maybe one that needs not be said:

=0A

 

= =0A

Flows aren't "connections". Routers are not involve= d in connection state management, which is purely part of the end to end pr= otocol. Anything about "connections" that a router might need to know to ha= ndle a packet should be packaged into the IP header of each packet in a sta= ndard form. Routers can "store" this information associated with the source= , destination pair if they want, for a short time, subject to well understo= od semantics when they run out of storage. This fits into an end-to-end arg= ument as an optiimization of a kind, as long as the function of such inform= ation is very narrowly and generally defined to benefit all users of IP-bas= ed protocols.

=0A

 

=0A

= For example, remembering the last time a packet of a particular flow was re= ceived after forwarding it, for a short time, to calculate fairness, that s= eems like a very useful idea, as long as forgetting the last time of receip= t is not unfair.

=0A

 

=0A

This use of the flow's IP headers to carry info into router queueing and= routing decisions is analogous to the "Fate Sharing" principle of protocol= design that DDC describes. Instead of having an independent control plane = protocol, which has all kinds of problems with synchronization and combinat= orial problems of packet loss, "Fate Sharing" of protocol information is ve= ry elegant.

=0A

On Wednesday, June 26, 2019 12:31pm,= "David P. Reed" <dpreed@deepplum.com> said:

=0A
=0A

I= t's the limiting case, but also the optimal state given "perfect knowledge"= .

=0A

 

=0A

Yes, it requires that the source-destin= ation pairs sharing the link in question coordinate their packet admission = times so they don't "collide" at the link. Ideally the next packet would ar= rive during the previous packet's transmission, so it is ready-to-go when t= hat packet's transmission ends.

=0A

 

=0A

Such exquis= ite coordination is feasible when future behavior by source and destination= at the interface is known, which requires an Oracle.

=0A

That's the same kind of condition most informa= tion theoretic and queueing theoretic optimality requires.

=0A

 

=0A

But this is worth keeping in mind as the overall joint go= al of all users.

=0A

 =0A

In particular,  "link= utilization" isn't a user goal at all. The link is there and is being paid= for whether it is used or not (looking from the network structure as a who= le). Its capacity exists to move packets out of the way. An ideal link sati= sfies the requirement that it never creates a queue because of anything oth= er than imperfect coordination of the end-to-end flows mapped onto it. That= 's why the router should not be measured by "link utilization" anymore than= a tunnel in a city during commuting hours should be measured by cars moved= per hour. Clearly a tunnel can be VERY congested and moving many cars if t= hey are attached to each other bumper to bumper - the latency through the t= unnel would then be huge. If the cars were tipped on their ends and stacked= , even more throughput would be achieved through the tunnel, and the delay = of rotating them and packing them would add even more delay.

=0A

 

=0A

The idea that "link utilization" of 100% must be achieved= is why we got bufferbloat designed into routers. It's a worm's eye perspec= tive. To this day, Arista Networks brags about how its bufferbloated featur= e design optimizes switch utilization (https://packetpushers.net/aristas-big-buffer-b-s/= ). And it selects benchmarks to "prove" it. Andy Bechtolsheim apparentl= y is such a big name that he can sell defective gear at a premium price, le= tting the datacenters who buy it discover that those switches get "clogged = up" by TCP traffic when they are the "bottleneck link". Fortunately, they a= re fast, so they are less frequently the bottleneck in datacenter daily use= .

=0A

 

=0A

In trying to understand what is going o= n with congestion signalling, any buffering at the entry to the link should= be due only to imperfect information being fed back to the endpoints gener= ating traffic. Because a misbehaving endpoint generates Denial of Service f= or all other users.

=0A

 = ;

=0A

Priority mechanisms foc= used on protecting high-paying users from low-paying ones don't help much -= they only help at overloaded states of the network. Which isn't to say tha= t priority does nothing - it's just that stable assignment of a sharing lev= el to priority levels isn't easy.  (See Paris Metro Pricing, where the= re are only two classes, and the problem of deciding how to manage the acce= ss to the "first class" section - the idea that 15 classes with different m= etrics can be handled simply and interoperably between differently managed = autonomous systems seems to be an incredibly impractical goal).

=0A

Even in the priority case, buffering= is NOT a desirable end user thing.

=0A

 

=0A

My pers= onal view is that the manager of a network needs to configure the network s= o that no link ever gets overloaded, if possible. The response to overload = should be to tell the relevant flows to all slow down (not just one, becaus= e if there are 100 flows that start up roughly at the same time, causing MD= on one does very little. This is an example of something where per-flow st= uff in the router actually makes the router helpful in the large scheme of = things. Maybe all flows should be equally informed, as flows. Which means t= he router needs to know how to signal multiple flows, while not just hammer= ing all the packets of a single flow.  This case is very real, but not= as frequently on the client side as on the "server side" in "load balancer= s" and such like.

=0A

 <= /p>=0A

My point here is simple:<= /p>=0A

 

=0A

1) the endpoints tell the routers what flow= s are going through a link already. That's just the address information. So= that information can be used for fairness pretty well, especially if short= term memory (a bloom filter, perhaps) can track a sufficiently large numbe= r of flows.

=0A

 

=0A=

2) The per-flow decisions relat= ed to congestion control within a flow are necessarily end-to-end in nature= - the router can only tell the ends what is going on, but the ends (togeth= er - their admissions rates and consumption rates are coupled to the use be= ing made) must be informed and decide. The congestion management must combi= ne information about the source and the destination future behavior (even i= f it is just taking recent history and projecting it as an estimate of futu= re behavior at source and destination). Which is why it is quite natural to= have routers signal the destination, which then signals the source, which = changes its behavior.

=0A

&nb= sp;

=0A

3) there are definite= ly other ways to improve latency for IP and protocols built on top of it&nb= sp; - routing some flows over different paths under congestion is one. call= the per-flow routing. Another is scattering a flow over several paths (but= that seems problematic for today's TcP which assumes all packets take the = same path).

=0A

 

=0A=

4) A different, but very couple= d view of IP is that any application-relevant buffering shoujld be driven i= nto the endpoints - at the source, buffering is useful to deal with variabi= lity in the rate of production of data to be sent. At the destination, buff= ering is useful to minimize jitter, matching to the consumption behavior of= the application.  But these buffers should not be pushed into the net= work where they cause congestion for other flows sharing resources.

=0A<= p style=3D"margin:0;padding:0;margin: 0; padding: 0; font-family: arial; fo= nt-size: 12pt; overflow-wrap: break-word;">So buffering in the network shou= ld ONLY deal with the uncertainty in resource competition.

=0A

 

=0A

This tripartite breakdown of buffering is protocol indepe= ndent. It applies to TCP, NTP, RTP, QUIC/UDP, ...  It's what we (that = is me) had in mind when we split UDP out of TCP, allowing UDP based protoco= ls to manage source and destination buffering in the application for all th= e things we thought UDP would be used for - packet speech, computer-compute= r remote procedure calls (what would be QUIC today), SATNET/interplanetary = Internet connections , ...).

=0A

 

=0A

Sadly, in the = many years since the late 1970's the tendency to think file transfers betwe= en infinite speed storage devices over TCP are the only relevant use of the= Internet has penetrated the router design community. I can't seem to get a= nyone to recognize how far we are from that.  No one runs benchmarks f= or such behavior, no one even measures anything other than the "hot rod" ma= ximum throughput cases.

=0A

&= nbsp;

=0A

And many egos seem = to think that working on the hot rod cases is going to make their career or= sell product.  (e.g. the sad case of Arista).

=0A

 

=0A

 

=0A

On Wednes= day, June 26, 2019 8:48am, "Sebastian Moeller" <moeller0@gmx.de> said= :

=0A
=0A

>
>
> > On Jun 23, 2019, a= t 00:09, David P. Reed <dpreed@deepplum.com> wrote:
> >> > [...]
> >
> > per-flow scheduling is app= ropriate on a shared link. However, the end-to-end
> argument would= suggest that the network not try to divine which flows get
> prefe= rred.
> > And beyond the end-to-end argument, there's a practica= l problem - since the
> ideal state of a shared link means that it = ought to have no local backlog in the
> queue, the information need= ed to schedule "fairly" isn't in the queue backlog
> itself. If the= re is only one packet, what's to schedule?
> >
> [...]>
> Excuse my stupidity, but the "only one single packet" c= ase is the theoretical
> limiting case, no?
> Because even = on a link not running at capacity this effectively requires a
> mec= hanism to "synchronize" all senders (whose packets traverse the hop we are<= br />> looking at), as no other packet is allowed to reach the hop unles= s the "current"
> one has been passed to the PHY otherwise we trans= iently queue 2 packets (I note
> that this rationale should hold fo= r any small N). The more packets per second a
> hop handles the les= s likely it will be to avoid for any newcomer to run into an
> alre= ady existing packet(s), that is to transiently grow the queue.
> No= t having a CS background, I fail to see how this required synchronized stat= e can
> exist outside of a few steady state configurations where th= ings change slowly
> enough that the seemingly required synchroniza= tion can actually happen (given
> that the feedback loop e.g. throu= gh ACKs, seems somewhat jittery). Since packets
> never know which = path they take and which hop is going to be critical there seems
> = to be no a priori way to synchronize all senders, heck I fail to see whethe= r it
> would be possible at all to guarantee synchronized behavior = on more than one hop
> (unless all hops are extremely uniform).
> I happen to believe that L4S suffers from the same conceptual issue = (plus overly
> generic promises, from the RITE website:
> "= We are so used to the unpredictability of queuing delay, we don=E2=80=99t k= now how
> good the Internet would feel without it. The RITE project= has developed simple
> technology to make queuing delay a thing of= the past=E2=80=94not just for a select
> few apps, but for all." t= his seems missing a conditions apply statement)
>
> Best R= egards
> Sebastian

=0A
=0A
------=_20190626125302000000_78119--