From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp116.iad3a.emailsrvr.com (smtp116.iad3a.emailsrvr.com [173.203.187.116]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by lists.bufferbloat.net (Postfix) with ESMTPS id ACE2B3B2A4 for ; Thu, 18 Jul 2019 11:02:12 -0400 (EDT) Received: from smtp23.relay.iad3a.emailsrvr.com (localhost [127.0.0.1]) by smtp23.relay.iad3a.emailsrvr.com (SMTP Server) with ESMTP id 7ABB02510F; Thu, 18 Jul 2019 11:02:12 -0400 (EDT) X-SMTPDoctor-Processed: csmtpprox beta DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=g001.emailsrvr.com; s=20190322-9u7zjiwi; t=1563462132; bh=abi00DGe85HDO726LL8YKB/hCViTrUmH+4hMNctAEWI=; h=Date:Subject:From:To:From; b=R9OaGqLlNg19ykQo3e8UD1t4Z/g3zB1YhT/pKS2ZlI72iHVUpTdvIILQdfhk7JXiv 4U1TsplEmFefL0bGW4oL3Ch6jsyVY7U3dT5N1VcakzFaVoD/v+u+XUgdLnzUcCcotz iG+x+FyB7wyK4+R1dVcykpbcuBfgL/cIaUTArt6Q= Received: from app63.wa-webapps.iad3a (relay-webapps.rsapps.net [172.27.255.140]) by smtp23.relay.iad3a.emailsrvr.com (SMTP Server) with ESMTP id 3C9352516D; Thu, 18 Jul 2019 11:02:12 -0400 (EDT) X-Sender-Id: dpreed@deepplum.com Received: from app63.wa-webapps.iad3a (relay-webapps.rsapps.net [172.27.255.140]) by 0.0.0.0:25 (trex/5.7.12); Thu, 18 Jul 2019 11:02:12 -0400 Received: from deepplum.com (localhost.localdomain [127.0.0.1]) by app63.wa-webapps.iad3a (Postfix) with ESMTP id 23049E0046; Thu, 18 Jul 2019 11:02:12 -0400 (EDT) Received: by apps.rackspace.com (Authenticated sender: dpreed@deepplum.com, from: dpreed@deepplum.com) with HTTP; Thu, 18 Jul 2019 11:02:12 -0400 (EDT) X-Auth-ID: dpreed@deepplum.com Date: Thu, 18 Jul 2019 11:02:12 -0400 (EDT) From: "David P. Reed" To: "Dave Taht" Cc: "ecn-sane@lists.bufferbloat.net" , "Bob Briscoe" , "tsvwg IETF list" MIME-Version: 1.0 Content-Type: text/plain;charset=UTF-8 Content-Transfer-Encoding: quoted-printable Importance: Normal X-Priority: 3 (Normal) X-Type: plain In-Reply-To: References: <350f8dd5-65d4-d2f3-4d65-784c0379f58c@bobbriscoe.net> <40605F1F-A6F5-4402-9944-238F92926EA6@gmx.de> <1563401917.00951412@apps.rackspace.com> <1563402855.88484511@apps.rackspace.com> Message-ID: <1563462132.13975616@apps.rackspace.com> X-Mailer: webmail/16.4.5-RC Subject: Re: [Ecn-sane] per-flow scheduling X-BeenThere: ecn-sane@lists.bufferbloat.net X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion of explicit congestion notification's impact on the Internet List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 18 Jul 2019 15:02:12 -0000 Dave -=0AThe context of my remarks was about the end-to-end arguments for p= lacing function in the Internet.=0A=0ATo that end, that "you do not mind pu= tting storage for low priority packets in the routers" doesn't matter, for = two important reasons:=0A=0A1) the idea that one should "throw in a feature= " because people "don't mind" is exactly what leads to feature creep of the= worst kind - features that serve absolutely no real purpose. That's what w= e rigorously objected to in the late 1970's. No, we would NOT throw in feat= ures as they were "requested" because we didn't mind.=0A=0A2) you have made= no argument that the function cannot be done properly at the ends, and no = argument that putting it in the network is necessary for the ends to achiev= e storage.=0A=0AOn Wednesday, July 17, 2019 7:23pm, "Dave Taht" said:=0A=0A> On Wed, Jul 17, 2019 at 3:34 PM David P. Reed wrote:=0A>>=0A>> A follow up point that I think needs to b= e made is one more end-to-end argument:=0A>>=0A>> It is NOT the job of the = IP transport layer to provide free storage for low=0A>> priority packets. T= he end-to-end argument here says: the ends can and must hold=0A>> packets u= ntil they are either delivered or not relevant (in RTP, they become=0A>> ir= relevant when they get older than their desired delivery time, if you want = an=0A>> example of the latter), SO, the network should not provide the func= tion of=0A>> storage beyond the minimum needed to deal with transients.=0A>= >=0A>> That means, unfortunately, that the dream of some kind of "backgroun= d" path that=0A>> stores "low priority" packets in the network fails the en= d-to-end argument test.=0A> =0A> I do not mind reserving a tiny portion of = the network for "background"=0A> traffic. This=0A> is different (I think?) = than storing low priority packets in the=0A> network. A background=0A> traf= fic "queue" of 1 packet would be fine....=0A> =0A>> If you think about this= , it even applies to some imaginary interplanetary IP=0A>> layer network. Q= ueueing delay is not a feature of any end-to-end requirement.=0A>>=0A>> Wha= t may be desired at the router/link level in an interplanetary IP layer is= =0A>> holding packets because a link is actually down, or using link-level = error=0A>> correction coding or retransmission to bring the error rate down= to an acceptable=0A>> level before declaring it down. But that's quite dif= ferent - it's the link level=0A>> protocol, which aims to deliver minimum q= ueueing delay under tough conditions,=0A>> without buffering more than need= ed for that (the number of bits that fit in the=0A>> light-speed transmissi= on at the transmission rate.=0A> =0A> As I outlined in my mit wifi talk - 1= layer of retry of at the wifi=0A> mac layer made it=0A> work, in 1998, and= that seemed a very acceptable compromise at the=0A> time. Present day=0A> = retries at the layer, not congestion controlled, is totally out of hand.=0A= > =0A> In thinking about starlink's mac, and mobility, I gradulally came to= =0A> the conclusion that=0A> 1 retry from satellites 550km up (3.6ms rtt) w= as needed, as much as I=0A> disliked the idea.=0A> =0A> I still dislike ret= ries at layer 2, even for nearby sats. really=0A> complicates things. so fo= r all I know I'll be advocating ripping 'em=0A> out in starlink, if they ar= e indeed, in there, next week.=0A> =0A>> So, the main reason I'm saying thi= s is because again, there are those who want to=0A>> implement the TCP func= tion of reliable delivery of each packet in the links.=0A>> That's a very b= ad idea.=0A> =0A> It was tried in the arpanet, and didn't work well there. = There's a=0A> good story about many=0A> of the flaws of the Arpanet's desig= n, including that problem, in the=0A> latter half of Kleinrock's second boo= k on queue theory, at least the=0A> first edition...=0A> =0A> Wifi (and 345= g) re-introduced the same problem with retransmits and=0A> block acks at la= yer 2.=0A> =0A> and after dissecting my ecn battlemesh data and observing w= hat the=0A> retries at the mac layer STILL do on wifi with the current defa= ult=0A> wifi codel target (20ms AFTER two txops are in the hardware) curren= tly=0A> achieve (50ms, which is 10x worse than what we could do and still= =0A> better performance under load than any other shipping physical layer= =0A> we have with fifos)... and after thinking hard about nagle's thought= =0A> that "every application has a right to one packet in the network", and= =0A> this very long thread reworking the end to end argument in a similar,= =0A> but not quite identical direction, I'm coming to a couple conclusions= =0A> I'd possibly not quite expressed well before.=0A> =0A> 1) transports s= hould treat an RFC3168 CE coupled with loss (drop and=0A> mark) as an even = stronger signal of congestion than either, and that=0A> this bit of the cod= el algorithm,=0A> when ecn is in use, is wrong, and has always been wrong:= =0A> =0A> https://github.com/dtaht/fq_codel_fast/blob/master/codel_impl.h#L= 178=0A> =0A> (we added this arbitrarily to codel in the 5th day of developm= ent in=0A> 2012. Using FQ masked it's effects on light traffic)=0A> =0A> Wh= at it should do instead is peek the queue and drop until it hits a=0A> mark= able packet, at the very least.=0A> =0A> Pie has an arbitrary drop at 10% f= igure, which does lighten the load=0A> some... cake used to have drop and m= ark also until a year or two=0A> back...=0A> =0A> 2) At low rates and high = contention, we really need pacing and fractional cwnd.=0A> =0A> (while I wo= uld very much like to see a dynamic reduction of MSS tried,=0A> that too ha= s a bottom limit)=0A> =0A> even then, drop as per bullet 1.=0A> =0A> 3) In = the end, I could see a world with SCE marks, and CE being=0A> obsoleted in = favor of drop, or CE only being exerted on really light=0A> loads similar t= o (or less than!) what the arbitrary 10% figure for pie=0A> uses=0A> =0A> 4= ) in all cases, I vastly prefer somehow ultimately shifting greedy=0A> tran= sports to RTT rather than drop or CE as their primary congestion=0A> contro= l indicator. FQ makes that feasible today. With enough FQ=0A> deployed for = enough congestive scenarios and hardware, and RTT=0A> becoming the core ind= icator for more transports, single queued designs=0A> become possible in th= e distant future.=0A> =0A> =0A>>=0A>> On Wednesday, July 17, 2019 6:18pm, "= David P. Reed" said:=0A>>=0A>> > I do want to toss in= my personal observations about the "end-to-end argument"=0A>> > related to= per-flow-scheduling. (Such arguments are, of course, a class of=0A>> > arg= uments to which my name is attached. Not that I am a judge/jury of such=0A>= > > questions...)=0A>> >=0A>> > A core principle of the Internet design is = to move function out of the=0A>> network,=0A>> > including routers and midd= leboxes, if those functions=0A>> >=0A>> > a) can be properly accomplished b= y the endpoints, and=0A>> > b) are not relevant to all uses of the Internet= transport fabric being used by=0A>> the=0A>> > ends.=0A>> >=0A>> > The rat= ionale here has always seemed obvious to me. Like Bob Briscoe suggests,=0A>= > we=0A>> > were very wary of throwing features into the network that would= preclude=0A>> > unanticipated future interoperability needs, new applicati= ons, and new=0A>> technology=0A>> > in the infrastructure of the Internet a= s a whole.=0A>> >=0A>> > So what are we talking about here (ignoring the fi= ne points of SCE, some of=0A>> which=0A>> > I think are debatable - especia= lly the focus on TCP alone, since much traffic=0A>> will=0A>> > likely move= away from TCP in the near future.=0A>> >=0A>> > A second technical require= ment (necessary invariant) of the Internet's=0A>> transport=0A>> > is that = the entire Internet depends on rigorously stopping queueing delay from=0A>>= > building up anywhere except at the endpoints, where the ends can manage = it.This=0A>> is=0A>> > absolutely critical, though it is peculiar in that m= any engineers, especially=0A>> > those who work at the IP layer and below, = have a mental model of routing as=0A>> > essentially being about building u= p queueing delay (in order to manage priority=0A>> in=0A>> > some trivial w= ay by building up the queue on purpose, apparently).=0A>> >=0A>> > This sec= ond technical requirement cannot be resolved merely by the endpoints.=0A>> = > The reason is that the endpoints cannot know accurately what host-host pa= ths=0A>> share=0A>> > common queues.=0A>> >=0A>> > This lack of a way to "c= ooperate" among independent users of a queue cannot be=0A>> > solved by a p= urely end-to-end solution. (well, I suppose some genius might=0A>> invent= =0A>> > a way, but I have not seen one in my 36 years closely watching the = Internet in=0A>> > operation since it went live in 1983.)=0A>> >=0A>> > So,= what the end-to-end argument would tend to do here, in my opinion, is to= =0A>> > provide the most minimal mechanism in the devices that are capable = of building=0A>> up=0A>> > a queue in order to allow all the ends sharing t= hat queue to do their job -=0A>> which=0A>> > is to stop filling up the que= ue!=0A>> >=0A>> > Only the endpoints can prevent filling up queues. And dep= ending on the=0A>> protocol,=0A>> > they may need to make very different, y= et compatible choices.=0A>> >=0A>> > This is a question of design at the ar= chitectural level. And the future=0A>> matters.=0A>> >=0A>> > So there is a= n end-to-end argument to be made here, but it is a subtle one.=0A>> >=0A>> = > The basic mechanism for controlling queue depth has been, and remains, qu= ite=0A>> > simple: dropping packets. This has two impacts: 1) immediately r= educing=0A>> queueing=0A>> > delay, and 2) signalling to endpoints that are= paying attention that they have=0A>> > contributed to an overfull queue.= =0A>> >=0A>> > The optimum queueing delay in a steady state would always be= one packet or=0A>> less.=0A>> > Kleinrock has shown this in the last few y= ears. Of course there aren't steady=0A>> > states. But we don't want a mech= anism that can't converge to that steady state=0A>> > *quickly*, for all qu= eues in the network.=0A>> >=0A>> > Another issue is that endpoints are not = aware of the fact that packets can=0A>> take=0A>> > multiple paths to any d= estination. In the future, alternate path choices can=0A>> be=0A>> > made b= y routers (when we get smarter routing algorithms based on traffic=0A>> > e= ngineering).=0A>> >=0A>> > So again, some minimal kind of information must = be exposed to endpoints that=0A>> will=0A>> > continue to communicate. Agai= n, the routers must be able to help a wide variety=0A>> of=0A>> > endpoints= with different use cases to decide how to move queue buildup out of=0A>> t= he=0A>> > network itself.=0A>> >=0A>> > Now the decision made by the endpoi= nts must be made in the context of=0A>> information=0A>> > about fairness. = Maybe this is what is not obvious.=0A>> >=0A>> > The most obvious notion of= fairness is equal shares among source host, dest=0A>> host=0A>> > pairs. T= here are drawbacks to that, but the benefit of it is that it affects=0A>> t= he=0A>> > IP layer alone, and deals with lots of boundary cases like the ca= se where a=0A>> single=0A>> > host opens a zillion TCP connections or uses = lots of UDP source ports or=0A>> > destinations to somehow "cheat" by appea= ring to have "lots of flows".=0A>> >=0A>> > Another way to deal with dividi= ng up flows is to ignore higher level protocol=0A>> > information entirely,= and put the flow idenfitication in the IP layer. A 32-bit=0A>> or=0A>> > 6= 4-bit random number could be added as an "option" to IP to somehow extend t= he=0A>> > flow space.=0A>> >=0A>> > But that is not the most important thin= g today.=0A>> >=0A>> > I write this to say:=0A>> > 1) some kind of per-flow= queueing, during the transient state where a queue is=0A>> > overloaded be= fore packets are dropped would provide much needed information to=0A>> the= =0A>> > ends of every flow sharing a common queue.=0A>> > 2) per-flow queue= ing, minimized to a very low level, using IP envelope address=0A>> > inform= ation (plus maybe UDP and TCP addresses for those protocols in an=0A>> exte= nded=0A>> > address-based flow definition) is totally compatible with end-t= o-end=0A>> arguments,=0A>> > but ONLY if the decisions made are certain to = drive queueing delay out of the=0A>> > router to the endpoints.=0A>> >=0A>>= >=0A>> >=0A>> >=0A>> > On Wednesday, July 17, 2019 5:33pm, "Sebastian Moel= ler" =0A>> said:=0A>> >=0A>> >> Dear Bob, dear IETF team,= =0A>> >>=0A>> >>=0A>> >>> On Jun 19, 2019, at 16:12, Bob Briscoe wrote:=0A>> >>>=0A>> >>> Jake, all,=0A>> >>>=0A>> >>> You may n= ot be aware of my long history of concern about how per-flow=0A>> schedulin= g=0A>> >>> within endpoints and networks will limit the Internet in future.= I find=0A>> per-flow=0A>> >>> scheduling a violation of the e2e principle = in such a profound way - the=0A>> dynamic=0A>> >>> choice of the spacing be= tween packets - that most people don't even associate=0A>> it=0A>> >>> with= the e2e principle.=0A>> >>=0A>> >> This does not rhyme well with the = L4S stated advantage of allowing=0A>> packet=0A>> >> reordering (due to man= dating RACK for all L4S tcp endpoints). Because surely=0A>> >> changing the= order of packets messes up the "the dynamic choice of the=0A>> spacing=0A>= > >> between packets" in a significant way. IMHO it is either L4S is great = because=0A>> it=0A>> >> will give intermediate hops more leeway to re-order= packets, or "a sender's=0A>> >> packet spacing" is sacred, please make up = your mind which it is.=0A>> >>=0A>> >>>=0A>> >>> I detected that you were t= alking about FQ in a way that might have assumed=0A>> my=0A>> >>> concern w= ith it was just about implementation complexity. If you (or anyone=0A>> >>>= watching) is not aware of the architectural concerns with per-flow=0A>> sc= heduling, I=0A>> >>> can enumerate them.=0A>> >>=0A>> >> Please do not= hesitate to do so after your deserved holiday, and please=0A>> state a=0A>= > >> superior alternative.=0A>> >>=0A>> >> Best Regards=0A>> >> Sebast= ian=0A>> >>=0A>> >>=0A>> >>>=0A>> >>> I originally started working on what = became L4S to prove that it was possible=0A>> to=0A>> >>> separate out redu= cing queuing delay from throughput scheduling. When Koen and=0A>> I=0A>> >>= > started working together on this, we discovered we had identical concerns= on=0A>> >>> this.=0A>> >>>=0A>> >>>=0A>> >>>=0A>> >>> Bob=0A>> >>>=0A>> >>= >=0A>> >>> --=0A>> >>> ____________________________________________________= ____________=0A>> >>> Bob Briscoe http://bobb= riscoe.net/=0A>> >>>=0A>> >>> _____________________________________________= __=0A>> >>> Ecn-sane mailing list=0A>> >>> Ecn-sane@lists.bufferbloat.net= =0A>> >>> https://lists.bufferbloat.net/listinfo/ecn-sane=0A>> >>=0A>> >> _= ______________________________________________=0A>> >> Ecn-sane mailing lis= t=0A>> >> Ecn-sane@lists.bufferbloat.net=0A>> >> https://lists.bufferbloat.= net/listinfo/ecn-sane=0A>> >>=0A>> >=0A>> >=0A>> > ________________________= _______________________=0A>> > Ecn-sane mailing list=0A>> > Ecn-sane@lists.= bufferbloat.net=0A>> > https://lists.bufferbloat.net/listinfo/ecn-sane=0A>>= >=0A>>=0A>>=0A>> _______________________________________________=0A>> Ecn-= sane mailing list=0A>> Ecn-sane@lists.bufferbloat.net=0A>> https://lists.bu= fferbloat.net/listinfo/ecn-sane=0A> =0A> =0A> =0A> --=0A> =0A> Dave T=C3=A4= ht=0A> CTO, TekLibre, LLC=0A> http://www.teklibre.com=0A> Tel: 1-831-205-97= 40=0A> =0A