From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <moeller0@gmx.de>
Received: from mout.gmx.net (mout.gmx.net [212.227.17.20])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by lists.bufferbloat.net (Postfix) with ESMTPS id 81FAD3B2A4
 for <bloat@lists.bufferbloat.net>; Sun, 17 Mar 2019 15:57:38 -0400 (EDT)
Received: from [192.168.42.220] ([77.182.103.198]) by mail.gmx.com (mrgmx103
 [212.227.17.168]) with ESMTPSA (Nemesis) id 0MRGvT-1hSTJD3KsZ-00UWqR; Sun, 17
 Mar 2019 20:57:24 +0100
Content-Type: text/plain;
	charset=utf-8
Mime-Version: 1.0 (Mac OS X Mail 11.5 \(3445.9.1\))
From: Sebastian Moeller <moeller0@gmx.de>
In-Reply-To: <D5774843-A41C-47E9-AF30-809E6C58F939@tzi.org>
Date: Sun, 17 Mar 2019 20:57:23 +0100
Cc: Greg White <g.white@CableLabs.com>,
 Ingemar Johansson S <ingemar.s.johansson@ericsson.com>,
 "bloat@lists.bufferbloat.net" <bloat@lists.bufferbloat.net>
Content-Transfer-Encoding: quoted-printable
Message-Id: <1248F1C4-35E1-4AA6-9553-560C89D19276@gmx.de>
References: <HE1PR07MB442526730269DA318B2ED38BC24B0@HE1PR07MB4425.eurprd07.prod.outlook.com>
 <E3154B34-123E-4A64-B15A-F5F8CF5C55B4@gmx.de>
 <BF9A0862-8C25-43CC-B1C2-0D7B5BE4053B@cablelabs.com>
 <C9000B72-0F6C-4E5A-837A-A864FF773D88@gmx.de>
 <94B04C6B-5997-4971-9698-57BEA3AE5C0E@tzi.org>
 <166A7220-875F-4FA0-A8EE-17F11037EC76@gmx.de>
 <4FA6FA39-7092-4E98-B12E-5236C8EACCE2@tzi.org>
 <C18E0000-CC99-4056-BBC6-9AF9FC15EED8@gmx.de>
 <D5774843-A41C-47E9-AF30-809E6C58F939@tzi.org>
To: Carsten Bormann <cabo@tzi.org>
X-Mailer: Apple Mail (2.3445.9.1)
X-Provags-ID: V03:K1:JviUiacXtiKHrldWRoNc2wy6qbca3PFZZXOe8kroANbful4vLRO
 HMK5rrBhMcFuaV98rycSeSKvkbsFiBS619wCJAEkNIV3c1nrDRO7jBpZZuARV2oJaWjBK6C
 j9K0vPfJth9CSXbuhMNVYH+bFDThi5FsWdaMyBKDYWoGsJlYT2kaakXqZQ590uUB4CbSE+l
 BW57hxaf0xlUpZsh9IEMA==
X-Spam-Flag: NO
X-UI-Out-Filterresults: notjunk:1;V03:K0:CXOJAPTjUnE=:NWhoSw+vWdbiIpVQvOsjt2
 yPtLZGMUPo70DqMwsUPKhcQb83ZuVtuv3WD/9QV+2OtC5y2Qi7FojYGv0hJJEeXrsLpGoRocY
 FDD0X+hggvCabnEgOYoD4+ZQUhQkUymsDHnkWM2mNkrVhEgW+ZGfTYujF+QXNblAum96fbbH4
 bkVLkxIcbdfhLulj4z9s8H3kHJedmkFbgI/NsFCKpTNGh604Chp79JhgWonYQfvudHFAvKQDg
 3TZxqOuZ0xVh16KLCaFbzM0lPnJkdrdvuw66SaOAcXxP+2MbDOmrZOz63iZLCZgKSicL2iB2C
 5yCdGiWdg0/mVdcUQMpV2LOi6QUlXACPaY3oHdr8hv2tbpm0IUGnxShMmOVoBBmH8wvySNzVw
 OhtOVT48bpAjqrlrFxVgm+u2V/ZiXfD4yBKMn1vBHEmm87HFD03NNADREBO0FZxGZAiuTK+GU
 83FXnUupQEcApAEtud+fJmFFiBAnSZXGOmyzQDkTtalSJLP9MnqSo4zWZM97XjbCGtFjSlhUD
 ghSahN0IS3w+qXnIrpWcAF2C5XWmEiehx9UBrJLGIRNLX/YUG1yo/oitlj0q7GKE8fe0r3RV/
 ynII1CH3K+QMwb+s73+t8md8SDNRVxhJuT6md9KBu6gNcypCUBnSSARXiCzMEK7b6sWu0pfeU
 bI2vC1dRM2t7NQD55/txdtq9BjU8LAn1rzYZtHcb22RMix8srFf3hJBTW6dVwsvk69K9uJTjK
 tmQ2SW+2a7sQ9AhoiFjgFDIuhNsRO41ZA0sA1JMI6PEH4vNaA2nPO/rNqHehjay1SmH7Xc2HM
 pGyGO6D8wPvbvpnuXyGhLDpLMahvon97BllLChNHkCWeBQMKKRZ8ofvGA+HaT5c+78ctUajp0
 rMyw3mf5efzdYtIdBlzWtiWyO6vNGzI+vRxWI43fy53PkZxzMLq4VJDmpAAi2+r0IrA+6qkKj
 BYZbFv4yz5w==
Subject: Re: [Bloat] Packet reordering and RACK (was The "Some Congestion
 Experienced" ECN codepoint)
X-BeenThere: bloat@lists.bufferbloat.net
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: General list for discussing Bufferbloat <bloat.lists.bufferbloat.net>
List-Unsubscribe: <https://lists.bufferbloat.net/options/bloat>,
 <mailto:bloat-request@lists.bufferbloat.net?subject=unsubscribe>
List-Archive: <https://lists.bufferbloat.net/pipermail/bloat>
List-Post: <mailto:bloat@lists.bufferbloat.net>
List-Help: <mailto:bloat-request@lists.bufferbloat.net?subject=help>
List-Subscribe: <https://lists.bufferbloat.net/listinfo/bloat>,
 <mailto:bloat-request@lists.bufferbloat.net?subject=subscribe>
X-List-Received-Date: Sun, 17 Mar 2019 19:57:38 -0000

Dear Carsten,

please excuse my tortured logic, being from outside the field it seems I =
routinely misuse the nomenclature and sow confusion.

> On Mar 17, 2019, at 18:09, Carsten Bormann <cabo@tzi.org> wrote:
>=20
>>>=20
>>>>> The end-to-end argument applies:  Ultimately, there needs to be =
resequencing at the end anyway, so any reordering in the network would =
be a performance optimization.  It turns out that keeping packets lying =
around in some buffer somewhere in the network just to do resequencing =
before they exit an L2 domain (or a tunnel) is a pessimization, not an =
optimization.
>>>>=20
>>>> 	I do not buy the end to end argument here, because in the =
extreme why do ARQ on individual links anyway, we can just leave it to =
the end-points to do the ARQ and TCP does anyway.
>>>=20
>>> The optimization is that the retransmission on a single link (or =
within a path segment, which is what I=E2=80=99m interested in) does not =
need to span the entire end-to-end path.  That is strictly better than =
an end-to-end retransmission. =20
>>=20
>> 	I agree, and by the same logic local resequencing is also =
better,
>=20
> Non sequitur.  The same logic simply does not apply.  A resequenced =
packet consumes the same transmission resources.  (It also consumes more =
buffer resources.  So it is strictly worse when just looking at network =
resources expended, which is the basis for the kind of logic applied =
here.)

In my tortured example the latency cost of resequencing at the =
re-ordering fast link was much smaller than the latency cost of =
resequencing after traversing the bottleneck link, the typical situation =
for end users. My focus is on the latency visible at that point, I agree =
that for the intermediate hop it will be simpler to just forward those =
packets that traversed the link intact.

>=20
>> unless the re-ordering event happened at the bottleneck link.
>=20
> Not sure how this comes in now.

	This comes from my vantage point from the edge, I really am =
sympathetic to what the core network needs to do to maintain the =
illusion of temporal ordering, but I do really care for end to end =
latency more than allowing core routers to get away with less buffer =
memory.

>=20
>>> Also, a local segment may allow faster recovery by not implicating =
the entire e2e latency, which allows for strictly better latency.
>>> So, yes, there are significant optimizations in doing local =
retransmissions, but there are also interesting interactions with =
end-to-end retransmission that need to be taken care of.  This has been =
known for a long time, e.g., see =
https://tools.ietf.org/html/rfc3819#section-8 which documents things =
that were considered to be well known in the early 2000s.
>>=20
>> 	Thanks, but my understanding of this is basically that a link =
should just drop a packet unless it can be retransmitted with reasonable =
effort (like the G.INP retransmissiond on dsl-links will give up); sure =
we can argue about what "reasonable effort" is in reality, but I fear if =
we move away from 3 dupACKs to say X ms all transport links will assume =
they have leewway to allow re-ordering close to X, that will certainly =
be worse than today. And since I am an end-user and do not operate a =
transport network, I know what I prefer here=E2=80=A6
>=20
> I=E2=80=99m sorry, I grew up as transport layer guy, so =
=E2=80=9Ctransport=E2=80=9D means L4 (transport layer) for me, not =
=E2=80=9Ctransport network=E2=80=9D.
> You may want to re-read my sentences with that knowledge; they might =
make more sense.

	Sorry for my misuse of the nomenclature, will try to stick to =
transport network.


>=20
>>> Resequencing (which is the term I prefer for putting things back in =
sequence again, after they have been reordered) requires storing packets =
that are ahead of later packets.
>>=20
>> 	Obviously.
>>=20
>>> This is strictly suboptimal if these packets could be delivered =
instead (in contrast, it *is* a good idea to resequence packets that are =
in a queue waiting for a transmission opportunity).
>>=20
>> 	Fair enough, but that basically expects the bottleneck link that =
actually accumulates a queue to do the heavy lifting, not sure that the =
economic incentives are properly aligned here.
>=20
> It can actually do so more easily, because the speeds are lower.

	Tell that to the person paying for the CMTS/BNG, the issue is =
that queueing here sees to happen not directly at the edge but at =
centralized places, that will need to queue traffic for 10s of thousands =
of end-users. This still might be easier than in core.

> But deployment economy arguments are interesting as well; I was making =
theoretical arguments first.
>=20
>>> So *requiring*(*) local path segments to resequence is strictly =
suboptimal.
>>>=20
>>> (*) even if this is not a strict requirement, but just a statement =
of the form =E2=80=9Cthe transport will be much more efficient if you =
deliver in order=E2=80=9D.
>>=20
>> 	My point is the transport will much more useful if if undertakes =
(reasonable) effort to deliver in-order,
>=20
> Please re-read as advised above.
>=20
>> that is slight;y different, and I understand that those responsible =
for transport networks have a different viewpoint on this.
>>=20
>>>=20
>>>> To put numbers to my example, assume I am on a 1/1 Mbps link and I =
get TCP data at 1 Mbps rate and MTU1500 packets (I am going to keep the =
numbers approximate) and I get a burst of say 10 packets containing say =
10 individual messages for my application telling the position of say an =
object in 3d space
>>>>=20
>>>> each packet is going to "hog" the link for: 1000 ms/s * (1500 * 8 =
b/packet ) / (1000 * 1000 b/s)  =3D 12 ms
>>>> So I get access to messages/new positions every 12 ms and I can =
display this smoothly
>>>=20
>>> That is already broken by design.
>>=20
>> 	Does not matter much, a well designed network should also allow =
to do stupid things=E2=80=A6
>=20
> Sure, but it won=E2=80=99t work very well then (and there is no point =
in optimizing for that =E2=80=94 remember: all in-network work is just =
an optimization under the end-to-end principle).
>=20
>>> If you are not accounting for latency variation (=E2=80=9Cjitter=E2=80=
=9D), you won=E2=80=99t be able to deal with it.
>>=20
>> 	Which would just complicate the issue a bit if we would =
introduce a say 25 ms de-jitter buffer without affecting the gist of it.
>=20
> That buffer increases the total latency but also the (useful) packet =
delivery rate in the presence of reordering.

	Yes, but IMHO it does not change the problem in a qualitative =
way.

>=20
>>> Your example also makes sure it does not work well by being based on =
100 % utilization.
>>=20
>> 	Same here, access links certainly run closer to 100% utilization =
than core links, so operation at full saturation is not completely =
unrealistic, but I really just set it up that way for clarity.
>=20
> Please use an example that is more realistic.

	Why? Unless your argument is an additional 100ms of latency does =
not matter, I fail to see how the "realness" of my example is relevant. =
TCP will only pass data to the application after (internal) =
resequencing, and if we allow extreme out-of-order delivery that will =
show up as additional extreme latency for the TCP-using application. Now =
one can claim that TCP might be the wrong "transport" ( I assume that is =
the correct use of the term), but that effectively demotes TCP to a bulk =
transport while clearly it does a reasonable job even for mild real-time =
requirements (I note that even "once a day" is a real-time requirement, =
one that should be easily achievable with TCP).

>=20
>>>> Now if the first packet gets r-odered to be last, I either drop =
that packet
>>>=20
>>> =E2=80=A6which is another nice function the network could do for you =
before expending further resources on useless delivery; see e.g. =
draft-ietf-6lo-deadline-time for one way to do this.
>>=20
>> 	Yes, but typically I do not want the network to do this, as I =
would be quite interested in knowing how much too late the packet =
arrived.
>=20
> I don=E2=80=99t know how to make use of that knowledge, do you?

	Well in a packet capture, seeing a packet out-of-sequence with a =
delay has more value than not seeing a packet at all, was that culled =
due to the deadline or was it lost for other reasons, and do I really =
care? In the first case it can be used for diagnosis in the second it =
can not.

> Early discarding of a late packet (e.g., by not retransmitting it in =
the first place) is so much better.

	This seems to be a) pretty cool and b) restricted to a number of =
quite special use-caeses, no? This might be okay for real time packets =
(where the consumer needs to react with tight timing constraints), but =
even just for VoIP  (a mild RT application) it might be better to be =
able to reconstruct intelligible speech with a massive delay than =
getting something garbled beyond recognition by dropped packets in a =
timely fashion. But I am sure there are examples where a deadline drop =
might be advantageous, it is just that none of my use-cases fall into =
that category.=20
And I naively see only use for that if the bandwidth/tx-slot advantage =
gained from dropping instead of transmitting a packet is larger than the =
loss incurred from not getting the information at all, I would guess =
real-time control with lots of redundant sensors to be such a case, =
assuming we talk about rarely dropping a packet.

>=20
>>>> and accept a 12 ms gap or if that is not an option I get to wait =
9*12 =3D 108ms before positions can be updated, that IMHO shows why =
re-ordering is terrible even if TCP would be more tolerant.=20
>>>=20
>>> You are assuming that the network can magically resequence a packet =
into place that it does not have.
>>=20
>> 	All I expect is that the network makes a reasonable effort to =
undo re-ordering close to where re-ordering happened.
>=20
> All I=E2=80=99m trying to say is that this is bad engineering, =
apparently perpetuated by bad transport layer implementations.

	??? Now I am confuzed, to me engineering is all about balancing =
trade-offs, and this is all about how to evaluate different dimensions. =
I believe we agree that a network should not re-order packets =
artificially without some justification and we also agree that some =
level of re-ordering might be un-avoidable, we basically haggle over how =
much is acceptable. I also believe, and correct me if I am wrong, that =
we agree that with TCP endpoints prefer less over more en-passage =
re-ordering. So why is it bad for a transport layer to aim for pleasing =
its users (in my book the transport network only exists to allow =
end-to-end communication)?

>=20
>>> Now I do understand that forwarding an out-of-order packet will =
block the output port for the time needed to serialize it.  So if you =
get it right before what would have been an in-order packet, the latter =
incurs additional latency.  Note that this requires a bottleneck =
configuration, i.e., packets to be forwarded arrive faster than they can =
be serialized out.  Don=E2=80=99t do bottlenecks if you want ultra-low =
latency.  (And don=E2=80=99t do links where you need to retransmit, =
either.)
>>=20
>> 	I agree, but that is live with a home internet access link, the =
bottleneck is there. This also points out a problem with the L4S =
argument for end-users, as the ultra-low latency (their words, not mine) =
will not realize for end-users close to what the project seems to =
promise.
>=20
> I think reordering is not really a problem for ultra-low latency, or =
more specifically, once reordering happens, you are no longer in the =
ultra-low latency domain,

	In this thread it has been mentioned that L4S will allow more =
reordering by mandating participating hosts to implement RACK, so from a =
transport network's perspective the L4S identifier can be seen as a =
license to allow more re-ordering. This runs afoul of L4S's claim of =
ultra-low latency, I agree, but I do not have to square that circle.=20
But in the context of this discussion we have the transport network that =
re-orders packets simply because this apparently makes the network more =
efficient and L4S allows it, which according to your argument would =
break the L4S stated goal of low latency. I can not believe that this is =
the L4S position in regards to re-ordering (reading =
https://tools.ietf.org/html/draft-ietf-tsvwg-ecn-l4s-id-06#page-23 =
A.1.7.  Measuring Reordering Tolerance in Time Units tells me they fail =
to actually make the connection between reordering and increased latency =
for the affected flow).

>=20
>>>> Especially in the context of L4S something like this seems to be =
totally unacceptable if ultra-low latency is supposed to be anything =
more than marketing.=20
>>>=20
>>> Dropping packets that can=E2=80=99t be used anyway is strictly =
better than delivering them.
>>=20
>> 	Well, not for L4S, as TCP Praque is supposed to fall back to =
legacy congestion control behavior upon encountering packet drops=E2=80=A6=

>=20
> L4S is for reliable transport, which is a different scenario than the =
one that benefits a lot from deadlines for packets.  (Well, deadlines =
might be used to make sure there is no dual retransmission, both local =
and end-to-end, but again, this is not where you would use L4S.)
>=20
>>> But apart from that, forwarding packets that I have is strictly =
better for low latency than leaving the output port idle and waiting for =
previous-in-order packets to send them out in sequence.
>>=20
>> 	It really depends what we mean when we talk about latency here, =
as shown for and end-user that might be quite different=E2=80=A6
>=20
> Apart from the port blocking effect I talked about (which is mostly =
relevant for highly scheduled transmission schemes), I really have no =
idea how the end-to-end latency would benefit from sitting on packets =
while the port is idle.

	Because as demonstrated with my toy example above that decision =
to send intact packets immediately might incur a visible delay-increse =
by 108 ms, so from the end-points perspective transmitting before =
re-sequencing can have a noticeable effect.


>=20
>>>>> For three decades now, we have acted as if there is no cost for =
in-order delivery from L2 =E2=80=94 not because that is true, but =
because deployed transport protocol implementations were built and =
tested with simple links that don=E2=80=99t reorder. =20
>>>>=20
>>>> 	Well, that is similar to the argument for performing non-aligned =
loads fast in hardware, yes this comes with a considerable cost in =
complexity and it is harder to make this go fast than just allowing =
aligned loads and fixing up unaligned loads by trapping to software, but =
from a user perspective the fast hardware beats the fickle only make =
aligned loads go fast approach any old day.
>>>=20
>>> CPUs have an abundance of transistors you can throw at this problem =
so the support of unaligned loads has become standard practice for CPUs =
with enough transistors.
>>> I=E2=80=99m not sure this argument transfers, because this is not =
about transistors (except maybe when we talk about in-queue =
resequencing, which would be a nice feature if we had information in the =
packets to allow it).
>>=20
>> Like the 5-tuple in TCP and UDP?
>=20
> That doesn=E2=80=99t help.  I need a sequence number for resequencing,

	If you do ARQ on your link, you will in all likelihood have =
something equivalent as you need to identify the packets that need to be =
retransmitted. As my proposed goal is not generic re-sequencing, but =
simply not to introduce any additional re-ordering, that should be =
sufficient, no?

> and I can=E2=80=99t use the transport layer one because that is being =
encrypted.  Again, this is mostly theoretical as I don=E2=80=99t see =
people rushing to do in-queue resequencing any time soon.

	I guess you are right, but then please do not complain that you =
need to stay idle while waiting for retransmit, if that is a conscious =
trade-off you engineered into your system ;)

>=20
> (Skipping some text that is not relevant to my argument here.)
>=20
>>> Where does this number come from?  100 ms is pretty long as a =
reordering maximum for most paths outside of satellite links. Instead, =
you would do something based on an RTT estimate.
>>=20
>> 	I just made that number up as the exact N does not matter, the =
argument is what ever we set as the new threshold will be approached by =
transport characteristics. Then again havin something that inversely =
scales with bandwidth is certainly terrible from a transport =
perspective, so I can understand the argument for a fixed temporal =
threshold.
>=20
> I don=E2=80=99t follow at all here.

	Well the RACK RFC makes the same point, as I discovered later =
(https://tools.ietf.org/html/draft-ietf-tcpm-rack-04):

"=46rom a network or link designer's viewpoint, parallelization (eg. =
link bonding) is
   the easiest way to get a network to go faster.  Therefore their main
   constraint on speed is reordering, and there is pressure to relax
   that constraint.  If RACK becomes widely deployed, the underlying
   networks may introduce more reordering for higher throughput.  But
   this may result in excessive reordering that hurts end to end
   performance:

   1.  End host packet processing: extreme reordering on high-speed
       networks would incur high CPU cost by greatly reducing the
       effectiveness of aggregation mechanisms, such as large receive
       offload (LRO) and generic receive offload (GRO), and
       significantly increasing the number of ACKs.

   2.  Congestion control: TCP congestion control implicitly assumes the
       feedback from ACKs are from the same bottleneck.  Therefore it
       cannot handle well scenarios where packets are traversing largely
       disjoint paths.

   3.  Loss recovery: Having an excessively large reordering window to
       accommodate widely different latencies from different paths would
       increase the latency of loss recovery."

I note that the benefit of reordering is all in the transport network, =
while the costs are all carried by the endpoints. With this kind of =
skewed incentives, what outcome do you expect.=20


>=20
>>>>> at least within some limits that we still have to find.
>>>>> That probably requires some evolution at the end-to-end transport =
implementation layer.  We are in a better position to make that happen =
than we have been for a long time.
>>>>=20
>>>> 	Probably true, but also not very attractive from an end-user =
perspective=E2=80=A6. unless this will allow transport innovations that =
will allow massively more bandwidth at a smallish latency cost.
>>>=20
>>> The argument against in-network resequencing is mostly a latency =
argument (but, as a second order effect, that reduced latency may also =
allow more throughput), so, again, I don=E2=80=99t quite understand.
>>=20
>> 	As I tried to show for TCP the flow with re-ordered packets =
certainly pays a latency cost that especially if re-ordering does not =
happen on the bottleneck link but at a faster link could be smaller.
>=20
> I can=E2=80=99t parse this sentence, but my main point remains:
>=20
> In-network resequencing increases latency (with a potential impact on =
throughput, too), unless it happens within a queue. =20

	But this latency is not necessarily end-point visible latency, =
if a tree falls in a wood and no one is there to hear it, does it make a =
sound?=20

> We wouldn=E2=80=99t want to do that, unless forced by a transport =
protocol that can=E2=80=99t cope.  If we can fix the transport protocols =
to enable (out-of-order) immediate forwarding, then let=E2=80=99s do it; =
this might also enable doing more in-network recovery, with the =
attendant performance improvements.

	If all of this does not increase the end-point visible latency =
and latency variation (too much) I am all for it, but if it does I =
maintain that the network should serve its users not the other way =
around (easy to say if one's position is pure end-user, and the =
complexity of making it happen falls onto others).=20
Anyway, thanks for your time and arguments and information, that gives =
me something to think about.

Beste Gruesse
	Sebastian

>=20
> Gr=C3=BC=C3=9Fe, Carsten
>=20