[Rpm] draft-ietf-ippm-responsiveness

Sun Dec 3 13:13:47 EST 2023

Dear IPPM members,

On re-reading the current responsiveness draft I stumbled over the following section:

Parallel vs Sequential Uplink and Downlink

Poor responsiveness can be caused by queues in either (or both) the upstream and the downstream direction. Furthermore, both paths may differ significantly due to access link conditions (e.g., 5G downstream and LTE upstream) or routing changes within the ISPs. To measure responsiveness under working conditions, the algorithm must explore both directions.

One approach could be to measure responsiveness in the uplink and downlink in parallel. It would allow for a shorter test run-time.

However, a number of caveats come with measuring in parallel:

	• Half-duplex links may not permit simultaneous uplink and downlink traffic. This restriction means the test might not reach the path's capacity in both directions at once and thus not expose all the potential sources of low responsiveness.
	• Debuggability of the results becomes harder: During parallel measurement it is impossible to differentiate whether the observed latency happens in the uplink or the downlink direction.
Thus, we recommend testing uplink and downlink sequentially. Parallel testing is considered a future extension.

I argue, that this is not the correct diagnosis and hence not the correct decision.
For half-duplex links the given argument is not incorrect, but incomplete, as it is quite likely that when forced to multiplex more bi-directional traffic (all TCP testing is bi-directional, so we only argue about the amount of reverse traffic, not whether it exist, and even if we would switch to QUIC/UDP we would still need a feed-back channel) we will se different "potential sources of low responsiveness" so ignoring any of the two seems ill advised.
Debuggability is not "rocket science" either, all one needs is a three value timestamp format (similar to what NTP uses) and one can, even without synchronized clocks! establish baseline OWDs and then under bi-directional load one can see which of these unloaded OWDs actually increases, so I argue that "it is impossible to differentiate whether the observed latency happens in the uplink or the downlink direction" is simply an incorrect assertion... (and we are actually doing this successfully in the existing internet as part of the cake-autorate project [h++ps://github.com/lynxthecat/cake-autorate/tree/master] already, based on ICMP timestamps). The relevant observation here is that we are not necessarily interested in veridical OWDs under idle conditions, but we want to see which OWD(s) increase during working-conditions, and that works with desynchronized clocks and is also robust against slow clock drift.

Given these observations, I ask that we change this design parameter to default requiring both measurement modes and defaulting to parallel testing (or randomly select between both modes, but report which it choose).

Best Regards
	Sebastian