[Rpm] [ippm] draft-ietf-ippm-responsiveness

Fri Jan 19 08:14:09 EST 2024

Hi Christoph

> On 16. Jan 2024, at 20:01, Christoph Paasch <cpaasch at apple.com> wrote:
> 
> Hello Sebastian,
> 
> 
> thanks for the feedback, please see inline!
> 
>> On Dec 3, 2023, at 10:13 AM, Sebastian Moeller <moeller0 at gmx.de> wrote:
>> 
>> Dear IPPM members,
>> 
>> On re-reading the current responsiveness draft I stumbled over the following section:
>> 
>> 
>> Parallel vs Sequential Uplink and Downlink
>> 
>> Poor responsiveness can be caused by queues in either (or both) the upstream and the downstream direction. Furthermore, both paths may differ significantly due to access link conditions (e.g., 5G downstream and LTE upstream) or routing changes within the ISPs. To measure responsiveness under working conditions, the algorithm must explore both directions.
>> 
>> One approach could be to measure responsiveness in the uplink and downlink in parallel. It would allow for a shorter test run-time.
>> 
>> However, a number of caveats come with measuring in parallel:
>> 
>> • Half-duplex links may not permit simultaneous uplink and downlink traffic. This restriction means the test might not reach the path's capacity in both directions at once and thus not expose all the potential sources of low responsiveness.
>> • Debuggability of the results becomes harder: During parallel measurement it is impossible to differentiate whether the observed latency happens in the uplink or the downlink direction.
>> Thus, we recommend testing uplink and downlink sequentially. Parallel testing is considered a future extension.
>> 
>> 
>> I argue, that this is not the correct diagnosis and hence not the correct decision.
>> For half-duplex links the given argument is not incorrect, but incomplete, as it is quite likely that when forced to multiplex more bi-directional traffic (all TCP testing is bi-directional, so we only argue about the amount of reverse traffic, not whether it exist, and even if we would switch to QUIC/UDP we would still need a feed-back channel) we will se different "potential sources of low responsiveness" so ignoring any of the two seems ill advised.
> 
> You are saying that parallel bi-directional traffic exposes different sources of responsiveness issues than uni-directional traffic (up and down) ? What kind of different sources would that expose ? Can you give some examples and maybe a suggestion on how to word things ?

[SM] If the bottleneck is a WiFi link we occasionally see that some OS are more aggressive than others in acquiring airtime, which easily results in differential throughput for the two directions and often higher queueing delay for the direction that is 'slowed' down. In theory that should not really happen but in practise it does, e.g. the ISP unhelpfully passes undesired DSCP marks into a home network that then are acted upon by WiFi WMM. To elaborate, Comcast for a long time had an issue where large fractions (IIRC up to 25%) of packets where inadvertently marked as CS1 which in default WMM translates to AC_BK, and if the client sends the upload traffic via the default AC_BE, these differential AC usage can now result in different queueing delay compared to looking at upload and download individually. (If all traffic of a channel uses AC_BK instead of AC_BE this should not affect latency much)
Side-note: Comcast after being alerted took notice of the issue and fixed it, but I think this kind of issue can happen to other ISPs as well.

> 
>> Debuggability is not "rocket science" either, all one needs is a three value timestamp format (similar to what NTP uses) and one can, even without synchronized clocks! establish baseline OWDs and then under bi-directional load one can see which of these unloaded OWDs actually increases, so I argue that "it is impossible to differentiate whether the observed latency happens in the uplink or the downlink direction" is simply an incorrect assertion... (and we are actually doing this successfully in the existing internet as part of the cake-autorate project [h++ps://github.com/lynxthecat/cake-autorate/tree/master] already, based on ICMP timestamps). The relevant observation here is that we are not necessarily interested in veridical OWDs under idle conditions, but we want to see which OWD(s) increase during working-conditions, and that works with desynchronized clocks and is also robust against slow clock drift.
> 
> Unfortunately, this would require for the server to add timestamps to the HTTP-response, right ?

[SM] Yes in a sense.... but that could be a a small process that simply updates the content of that file every couple of milliseconds, so would not strictly need to be the server process... 

> We opted against this because the “power” of the responsiveness methodology is that it is extremely lightweight on the server-side. And with lightweight I mean not only from an implementation/CPU perspective but also from a deployment perspective. All one needs to do on the server in order to provide a responsiveness-measurement-endpoint is to host 2 files (one very large one and a very small one) and provide an endpoint to “POST” data to. All of these are standard capabilities in every webserver that can easily be configured. And we have seen a rise of endpoints showing up thanks to the simplicity to deploy it.
> 
> So, it is IMO a balance between “deployability” and “debuggability”. The responsiveness test is clearly aiming towards being deployable and accessible. Thus I think we would prefer keeping things on the server-side simple.
> 
> 
> Thoughts ?

[SM] I really really would like some way to get OWDs if only optional, but even more than that I think RPM should get as wide a deployment as possible, ubiquity has its own inherent value for measurement platforms, so if this makes deployment harder it would be a no-go. 

Now, I get that this is a long shot, but I fear that if the draft does not mention this at all the chance will be gone forever.... 
Could we maybe add a description of an optional 'time' payload, so clients could expect a single standardised format for that, if a server would optionally support it?

> That being said, I’m not entirely opposed to recommending the parallel mode as well. The interesting bit about the parallel mode is not so much the responsiveness measurement but rather the capacity measurement. Because, surprisingly many modems/… that are supposedly (according to their spec-sheet) able to handle 1 Gbps full-duplex suddenly show their weakness and are no more able to handle line-rate. So, it is more about capacity than responsiveness IMO.

[SM] True, yet such overload also occasionally affects queuing delay and jitter (sure RPM does not report jitter, but it likely affects the ability of a test to reach the required stability criteria).

> However, as a frequent user of the networkQuality-tool I realize myself that whenever I want to test my network I end up using a sequential test in favor of the parallel test.

[SM] I agree that a full complement of upload, then download, then combined upload & download is a great tool for understanding network behaviour. I also want to applaud Apple's networkQuality of an excellent implementation of the ideas behind this draft, offering a great and well selected set of options:

USAGE: networkQuality [-C <configuration_url>] [-c] [-d] [-f <comma-separated list>] [-h] [-I <network interface name>] [-k] [-p] [-r host] [-S <port>] [-s] [-u] [-v]
    -C: Override Configuration URL or path (with scheme file://)
    -c: Produce computer-readable output
    -d: Do not run a download test (implies -s)
    -f: <comma-separated list>: Enforce Protocol selections. Available options:
        h1: Force-enable HTTP/1.1
        h2: Force-enable HTTP/2
        h3: Force-enable HTTP/3 (QUIC)
        L4S: Force-enable L4S
        noL4S: Force-disable L4S
    -h: Show help (this message)
    -I: Bind test to interface (e.g., en0, pdp_ip0,...)
    -k: Disable certificate validation
    -p: Use iCloud Private Relay
    -r: Connect to host or IP, overriding DNS for initial config request
    -S: Start and run server on specified port. Other specified options ignored
    -s: Run tests sequentially instead of parallel upload/download
    -u: Do not run an upload test (implies -s)
    -v: Verbose output

that cover a lot of cases with a relative small set of control parameters.

> 
> 
> 
> Christoph
> 
> 
>> 
>> Given these observations, I ask that we change this design parameter to default requiring both measurement modes and defaulting to parallel testing (or randomly select between both modes, but report which it choose).
>> 
>> Best Regards
>> Sebastian
>> _______________________________________________
>> ippm mailing list
>> ippm at ietf.org
>> https://www.ietf.org/mailman/listinfo/ippm
>