From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp108.iad3a.emailsrvr.com (smtp108.iad3a.emailsrvr.com [173.203.187.108]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by lists.bufferbloat.net (Postfix) with ESMTPS id 625E83CB35 for ; Wed, 26 Jun 2019 12:31:47 -0400 (EDT) Received: from smtp6.relay.iad3a.emailsrvr.com (localhost [127.0.0.1]) by smtp6.relay.iad3a.emailsrvr.com (SMTP Server) with ESMTP id 29AA15414; Wed, 26 Jun 2019 12:31:47 -0400 (EDT) X-SMTPDoctor-Processed: csmtpprox beta DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=g001.emailsrvr.com; s=20190322-9u7zjiwi; t=1561566707; bh=ix7wcDp31AVO49BPt0JlG7fNFUy1ThBJ4gdtfsOerIg=; h=Date:Subject:From:To:From; b=TUROM9nPTOzbDCjV+ii77pOfjhWkULnFsJ7SdE9qjot4Iuu5PPXRuPOe5cMOM1+Ug wpsDevt38sUzRkL/CPgudeePEqFYJrPwNJtyPxQPjlIJggJrVAJfg1vvjihq7AwJxA NvLyrgEL3sDaUOuEx9xPthDOgS3vXOJ8RsFq7EO4= Received: from app68.wa-webapps.iad3a (relay-webapps.rsapps.net [172.27.255.140]) by smtp6.relay.iad3a.emailsrvr.com (SMTP Server) with ESMTP id D33A85153; Wed, 26 Jun 2019 12:31:46 -0400 (EDT) X-Sender-Id: dpreed@deepplum.com Received: from app68.wa-webapps.iad3a (relay-webapps.rsapps.net [172.27.255.140]) by 0.0.0.0:25 (trex/5.7.12); Wed, 26 Jun 2019 12:31:47 -0400 Received: from deepplum.com (localhost.localdomain [127.0.0.1]) by app68.wa-webapps.iad3a (Postfix) with ESMTP id BEA81E00F9; Wed, 26 Jun 2019 12:31:46 -0400 (EDT) Received: by apps.rackspace.com (Authenticated sender: dpreed@deepplum.com, from: dpreed@deepplum.com) with HTTP; Wed, 26 Jun 2019 12:31:46 -0400 (EDT) X-Auth-ID: dpreed@deepplum.com Date: Wed, 26 Jun 2019 12:31:46 -0400 (EDT) From: "David P. Reed" To: "Sebastian Moeller" Cc: "Jonathan Morton" , "ecn-sane@lists.bufferbloat.net" , "Brian E Carpenter" , "tsvwg IETF list" MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_20190626123146000000_26801" Importance: Normal X-Priority: 3 (Normal) X-Type: html In-Reply-To: <4E863FC5-D30E-4F76-BDF7-6A787958C628@gmx.de> References: <350f8dd5-65d4-d2f3-4d65-784c0379f58c@bobbriscoe.net> <46D1ABD8-715D-44D2-B7A0-12FE2A9263FE@gmx.de> <835b1fb3-e8d5-c58c-e2f8-03d2b886af38@gmail.com> <1561233009.95886420@apps.rackspace.com> <71EF351D-AFBF-4C92-B6B9-7FD695B68815@gmail.com> <1561241377.4026977@apps.rackspace.com> <4E863FC5-D30E-4F76-BDF7-6A787958C628@gmx.de> Message-ID: <1561566706.778820831@apps.rackspace.com> X-Mailer: webmail/16.4.5-RC Subject: Re: [Ecn-sane] [tsvwg] per-flow scheduling X-BeenThere: ecn-sane@lists.bufferbloat.net X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion of explicit congestion notification's impact on the Internet List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Jun 2019 16:31:47 -0000 ------=_20190626123146000000_26801 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable =0AIt's the limiting case, but also the optimal state given "perfect knowle= dge".=0A =0AYes, it requires that the source-destination pairs sharing the = link in question coordinate their packet admission times so they don't "col= lide" at the link. Ideally the next packet would arrive during the previous= packet's transmission, so it is ready-to-go when that packet's transmissio= n ends.=0A =0ASuch exquisite coordination is feasible when future behavior = by source and destination at the interface is known, which requires an Orac= le.=0AThat's the same kind of condition most information theoretic and queu= eing theoretic optimality requires.=0A =0ABut this is worth keeping in mind= as the overall joint goal of all users.=0A =0AIn particular, "link utiliz= ation" isn't a user goal at all. The link is there and is being paid for wh= ether it is used or not (looking from the network structure as a whole). It= s capacity exists to move packets out of the way. An ideal link satisfies t= he requirement that it never creates a queue because of anything other than= imperfect coordination of the end-to-end flows mapped onto it. That's why = the router should not be measured by "link utilization" anymore than a tunn= el in a city during commuting hours should be measured by cars moved per ho= ur. Clearly a tunnel can be VERY congested and moving many cars if they are= attached to each other bumper to bumper - the latency through the tunnel w= ould then be huge. If the cars were tipped on their ends and stacked, even = more throughput would be achieved through the tunnel, and the delay of rota= ting them and packing them would add even more delay.=0A =0AThe idea that "= link utilization" of 100% must be achieved is why we got bufferbloat design= ed into routers. It's a worm's eye perspective. To this day, Arista Network= s brags about how its bufferbloated feature design optimizes switch utiliza= tion ([ https://packetpushers.net/aristas-big-buffer-b-s/ ]( https://packet= pushers.net/aristas-big-buffer-b-s/ )). And it selects benchmarks to "prove= " it. Andy Bechtolsheim apparently is such a big name that he can sell defe= ctive gear at a premium price, letting the datacenters who buy it discover = that those switches get "clogged up" by TCP traffic when they are the "bott= leneck link". Fortunately, they are fast, so they are less frequently the b= ottleneck in datacenter daily use.=0A =0AIn trying to understand what is go= ing on with congestion signalling, any buffering at the entry to the link s= hould be due only to imperfect information being fed back to the endpoints = generating traffic. Because a misbehaving endpoint generates Denial of Serv= ice for all other users.=0A =0APriority mechanisms focused on protecting hi= gh-paying users from low-paying ones don't help much - they only help at ov= erloaded states of the network. Which isn't to say that priority does nothi= ng - it's just that stable assignment of a sharing level to priority levels= isn't easy. (See Paris Metro Pricing, where there are only two classes, a= nd the problem of deciding how to manage the access to the "first class" se= ction - the idea that 15 classes with different metrics can be handled simp= ly and interoperably between differently managed autonomous systems seems t= o be an incredibly impractical goal).=0AEven in the priority case, bufferin= g is NOT a desirable end user thing.=0A =0AMy personal view is that the man= ager of a network needs to configure the network so that no link ever gets = overloaded, if possible. The response to overload should be to tell the rel= evant flows to all slow down (not just one, because if there are 100 flows = that start up roughly at the same time, causing MD on one does very little.= This is an example of something where per-flow stuff in the router actuall= y makes the router helpful in the large scheme of things. Maybe all flows s= hould be equally informed, as flows. Which means the router needs to know h= ow to signal multiple flows, while not just hammering all the packets of a = single flow. This case is very real, but not as frequently on the client s= ide as on the "server side" in "load balancers" and such like.=0A =0AMy poi= nt here is simple:=0A =0A1) the endpoints tell the routers what flows are g= oing through a link already. That's just the address information. So that i= nformation can be used for fairness pretty well, especially if short term m= emory (a bloom filter, perhaps) can track a sufficiently large number of fl= ows.=0A =0A2) The per-flow decisions related to congestion control within a= flow are necessarily end-to-end in nature - the router can only tell the e= nds what is going on, but the ends (together - their admissions rates and c= onsumption rates are coupled to the use being made) must be informed and de= cide. The congestion management must combine information about the source a= nd the destination future behavior (even if it is just taking recent histor= y and projecting it as an estimate of future behavior at source and destina= tion). Which is why it is quite natural to have routers signal the destinat= ion, which then signals the source, which changes its behavior.=0A =0A3) th= ere are definitely other ways to improve latency for IP and protocols built= on top of it - routing some flows over different paths under congestion i= s one. call the per-flow routing. Another is scattering a flow over several= paths (but that seems problematic for today's TcP which assumes all packet= s take the same path).=0A =0A4) A different, but very coupled view of IP is= that any application-relevant buffering shoujld be driven into the endpoin= ts - at the source, buffering is useful to deal with variability in the rat= e of production of data to be sent. At the destination, buffering is useful= to minimize jitter, matching to the consumption behavior of the applicatio= n. But these buffers should not be pushed into the network where they caus= e congestion for other flows sharing resources.=0ASo buffering in the netwo= rk should ONLY deal with the uncertainty in resource competition.=0A =0AThi= s tripartite breakdown of buffering is protocol independent. It applies to = TCP, NTP, RTP, QUIC/UDP, ... It's what we (that is me) had in mind when we= split UDP out of TCP, allowing UDP based protocols to manage source and de= stination buffering in the application for all the things we thought UDP wo= uld be used for - packet speech, computer-computer remote procedure calls (= what would be QUIC today), SATNET/interplanetary Internet connections , ...= ).=0A =0ASadly, in the many years since the late 1970's the tendency to thi= nk file transfers between infinite speed storage devices over TCP are the o= nly relevant use of the Internet has penetrated the router design community= . I can't seem to get anyone to recognize how far we are from that. No one= runs benchmarks for such behavior, no one even measures anything other tha= n the "hot rod" maximum throughput cases.=0A =0AAnd many egos seem to think= that working on the hot rod cases is going to make their career or sell pr= oduct. (e.g. the sad case of Arista).=0A =0A =0AOn Wednesday, June 26, 201= 9 8:48am, "Sebastian Moeller" said:=0A=0A=0A=0A> =0A> =0A= > > On Jun 23, 2019, at 00:09, David P. Reed wrote:= =0A> >=0A> > [...]=0A> >=0A> > per-flow scheduling is appropriate on a shar= ed link. However, the end-to-end=0A> argument would suggest that the networ= k not try to divine which flows get=0A> preferred.=0A> > And beyond the end= -to-end argument, there's a practical problem - since the=0A> ideal state o= f a shared link means that it ought to have no local backlog in the=0A> que= ue, the information needed to schedule "fairly" isn't in the queue backlog= =0A> itself. If there is only one packet, what's to schedule?=0A> >=0A> [..= .]=0A> =0A> Excuse my stupidity, but the "only one single packet" case is t= he theoretical=0A> limiting case, no?=0A> Because even on a link not runnin= g at capacity this effectively requires a=0A> mechanism to "synchronize" al= l senders (whose packets traverse the hop we are=0A> looking at), as no oth= er packet is allowed to reach the hop unless the "current"=0A> one has been= passed to the PHY otherwise we transiently queue 2 packets (I note=0A> tha= t this rationale should hold for any small N). The more packets per second = a=0A> hop handles the less likely it will be to avoid for any newcomer to r= un into an=0A> already existing packet(s), that is to transiently grow the = queue.=0A> Not having a CS background, I fail to see how this required sync= hronized state can=0A> exist outside of a few steady state configurations w= here things change slowly=0A> enough that the seemingly required synchroniz= ation can actually happen (given=0A> that the feedback loop e.g. through AC= Ks, seems somewhat jittery). Since packets=0A> never know which path they t= ake and which hop is going to be critical there seems=0A> to be no a priori= way to synchronize all senders, heck I fail to see whether it=0A> would be= possible at all to guarantee synchronized behavior on more than one hop=0A= > (unless all hops are extremely uniform).=0A> I happen to believe that L4S= suffers from the same conceptual issue (plus overly=0A> generic promises, = from the RITE website:=0A> "We are so used to the unpredictability of queui= ng delay, we don=E2=80=99t know how=0A> good the Internet would feel withou= t it. The RITE project has developed simple=0A> technology to make queuing = delay a thing of the past=E2=80=94not just for a select=0A> few apps, but f= or all." this seems missing a conditions apply statement)=0A> =0A> Best Reg= ards=0A> Sebastian ------=_20190626123146000000_26801 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable

It's the limiting case= , but also the optimal state given "perfect knowledge".

=0A

 

=0A

Yes, it requires that the source-= destination pairs sharing the link in question coordinate their packet admi= ssion times so they don't "collide" at the link. Ideally the next packet wo= uld arrive during the previous packet's transmission, so it is ready-to-go = when that packet's transmission ends.

=0A

 

= =0A

Such exquisite coordination is feasible when future= behavior by source and destination at the interface is known, which requir= es an Oracle.

=0A

That's the same kind of condition = most information theoretic and queueing theoretic optimality requires.

= =0A

 

=0A

But this is worth= keeping in mind as the overall joint goal of all users.

=0A

 

=0A

In particular,  "link utili= zation" isn't a user goal at all. The link is there and is being paid for w= hether it is used or not (looking from the network structure as a whole). I= ts capacity exists to move packets out of the way. An ideal link satisfies = the requirement that it never creates a queue because of anything other tha= n imperfect coordination of the end-to-end flows mapped onto it. That's why= the router should not be measured by "link utilization" anymore than a tun= nel in a city during commuting hours should be measured by cars moved per h= our. Clearly a tunnel can be VERY congested and moving many cars if they ar= e attached to each other bumper to bumper - the latency through the tunnel = would then be huge. If the cars were tipped on their ends and stacked, even= more throughput would be achieved through the tunnel, and the delay of rot= ating them and packing them would add even more delay.

=0A

 

=0A

The idea that "link utilization" o= f 100% must be achieved is why we got bufferbloat designed into routers. It= 's a worm's eye perspective. To this day, Arista Networks brags about how i= ts bufferbloated feature design optimizes switch utilization (https://packetpushers.net/= aristas-big-buffer-b-s/). And it selects benchmarks to "prove" it. Andy= Bechtolsheim apparently is such a big name that he can sell defective gear= at a premium price, letting the datacenters who buy it discover that those= switches get "clogged up" by TCP traffic when they are the "bottleneck lin= k". Fortunately, they are fast, so they are less frequently the bottleneck = in datacenter daily use.

=0A

 

=0A

In trying to understand what is going on with congestion signa= lling, any buffering at the entry to the link should be due only to imperfe= ct information being fed back to the endpoints generating traffic. Because = a misbehaving endpoint generates Denial of Service for all other users.

= =0A

 

=0A

Priority mechanis= ms focused on protecting high-paying users from low-paying ones don't help = much - they only help at overloaded states of the network. Which isn't to s= ay that priority does nothing - it's just that stable assignment of a shari= ng level to priority levels isn't easy.  (See Paris Metro Pricing, whe= re there are only two classes, and the problem of deciding how to manage th= e access to the "first class" section - the idea that 15 classes with diffe= rent metrics can be handled simply and interoperably between differently ma= naged autonomous systems seems to be an incredibly impractical goal).

= =0A

Even in the priority case, buffering is NOT a desir= able end user thing.

=0A

 

=0A

My personal view is that the manager of a network needs to configure= the network so that no link ever gets overloaded, if possible. The respons= e to overload should be to tell the relevant flows to all slow down (not ju= st one, because if there are 100 flows that start up roughly at the same ti= me, causing MD on one does very little. This is an example of something whe= re per-flow stuff in the router actually makes the router helpful in the la= rge scheme of things. Maybe all flows should be equally informed, as flows.= Which means the router needs to know how to signal multiple flows, while n= ot just hammering all the packets of a single flow.  This case is very= real, but not as frequently on the client side as on the "server side" in = "load balancers" and such like.

=0A

 

=0A

My point here is simple:

=0A

 = ;

=0A

1) the endpoints tell the routers what flows a= re going through a link already. That's just the address information. So th= at information can be used for fairness pretty well, especially if short te= rm memory (a bloom filter, perhaps) can track a sufficiently large number o= f flows.

=0A

 

=0A

2) Th= e per-flow decisions related to congestion control within a flow are necess= arily end-to-end in nature - the router can only tell the ends what is goin= g on, but the ends (together - their admissions rates and consumption rates= are coupled to the use being made) must be informed and decide. The conges= tion management must combine information about the source and the destinati= on future behavior (even if it is just taking recent history and projecting= it as an estimate of future behavior at source and destination). Which is = why it is quite natural to have routers signal the destination, which then = signals the source, which changes its behavior.

=0A

=  

=0A

3) there are definitely other ways to imp= rove latency for IP and protocols built on top of it  - routing some f= lows over different paths under congestion is one. call the per-flow routin= g. Another is scattering a flow over several paths (but that seems problema= tic for today's TcP which assumes all packets take the same path).

=0A 

=0A

4) A different, but ve= ry coupled view of IP is that any application-relevant buffering shoujld be= driven into the endpoints - at the source, buffering is useful to deal wit= h variability in the rate of production of data to be sent. At the destinat= ion, buffering is useful to minimize jitter, matching to the consumption be= havior of the application.  But these buffers should not be pushed int= o the network where they cause congestion for other flows sharing resources= .

=0A

So buffering in the network should ONLY deal w= ith the uncertainty in resource competition.

=0A

&nb= sp;

=0A

This tripartite breakdown of buffering is pr= otocol independent. It applies to TCP, NTP, RTP, QUIC/UDP, ...  It's w= hat we (that is me) had in mind when we split UDP out of TCP, allowing UDP = based protocols to manage source and destination buffering in the applicati= on for all the things we thought UDP would be used for - packet speech, com= puter-computer remote procedure calls (what would be QUIC today), SATNET/in= terplanetary Internet connections , ...).

=0A

 =

=0A

Sadly, in the many years since the late 1970's = the tendency to think file transfers between infinite speed storage devices= over TCP are the only relevant use of the Internet has penetrated the rout= er design community. I can't seem to get anyone to recognize how far we are= from that.  No one runs benchmarks for such behavior, no one even mea= sures anything other than the "hot rod" maximum throughput cases.

=0A

 

=0A

And many egos seem to t= hink that working on the hot rod cases is going to make their career or sel= l product.  (e.g. the sad case of Arista).

=0A

=  

=0A

 

=0A

On Wedn= esday, June 26, 2019 8:48am, "Sebastian Moeller" <moeller0@gmx.de> sa= id:

=0A
=0A

>
>
> > On Jun 23, 2019, at 00:09, David P. Ree= d <dpreed@deepplum.com> wrote:
> >
> > [...]> >
> > per-flow scheduling is appropriate on a shared = link. However, the end-to-end
> argument would suggest that the net= work not try to divine which flows get
> preferred.
> > = And beyond the end-to-end argument, there's a practical problem - since the=
> ideal state of a shared link means that it ought to have no loca= l backlog in the
> queue, the information needed to schedule "fairl= y" isn't in the queue backlog
> itself. If there is only one packet= , what's to schedule?
> >
> [...]
>
> E= xcuse my stupidity, but the "only one single packet" case is the theoretica= l
> limiting case, no?
> Because even on a link not running= at capacity this effectively requires a
> mechanism to "synchroniz= e" all senders (whose packets traverse the hop we are
> looking at)= , as no other packet is allowed to reach the hop unless the "current"
= > one has been passed to the PHY otherwise we transiently queue 2 packet= s (I note
> that this rationale should hold for any small N). The m= ore packets per second a
> hop handles the less likely it will be t= o avoid for any newcomer to run into an
> already existing packet(s= ), that is to transiently grow the queue.
> Not having a CS backgro= und, I fail to see how this required synchronized state can
> exist= outside of a few steady state configurations where things change slowly> enough that the seemingly required synchronization can actually hap= pen (given
> that the feedback loop e.g. through ACKs, seems somewh= at jittery). Since packets
> never know which path they take and wh= ich hop is going to be critical there seems
> to be no a priori way= to synchronize all senders, heck I fail to see whether it
> would = be possible at all to guarantee synchronized behavior on more than one hop<= br />> (unless all hops are extremely uniform).
> I happen to be= lieve that L4S suffers from the same conceptual issue (plus overly
>= ; generic promises, from the RITE website:
> "We are so used to the= unpredictability of queuing delay, we don=E2=80=99t know how
> goo= d the Internet would feel without it. The RITE project has developed simple=
> technology to make queuing delay a thing of the past=E2=80=94not= just for a select
> few apps, but for all." this seems missing a c= onditions apply statement)
>
> Best Regards
> Seba= stian

=0A
------=_20190626123146000000_26801--