[Starlink] Starlink hidden buffers

Sun May 14 02:06:42 EDT 2023

On 14/05/2023 10:57 am, David Lang wrote:
> On Sat, 13 May 2023, Ulrich Speidel via Starlink wrote:
>
>> Here's a bit of a question to you all. See what you make of it.
>>
>> I've been thinking a bit about the latencies we see in the Starlink 
>> network. This is why this list exist (right, Dave?). So what do we know?
>>
>> 1) We know that RTTs can be in the 100's of ms even in what appear to 
>> be bent-pipe scenarios where the physical one-way path should be well 
>> under 3000 km, with physical RTT under 20 ms.
>> 2) We know from plenty of traceroutes that these RTTs accrue in the 
>> Starlink network, not between the Starlink handover point (POP) to 
>> the Internet.
>> 3) We know that they aren't an artifact of the Starlink WiFi router 
>> (our traceroutes were done through their Ethernet adaptor, which 
>> bypasses the router), so they must be delays on the satellites or the 
>> teleports.
>
> the ethernet adapter bypasses the wifi, but not the router, you have 
> to cut the cable and replace the plug to bypass the router
Good point - but you still don't get the WiFi buffering here. Or at 
least we don't seem to, looking at the difference between running with 
and without the adapter.
>
>> 4) We know that processing delay isn't a huge factor because we also 
>> see RTTs well under 30 ms.
>> 5) That leaves queuing delays.
>>
>> This issue has been known for a while now. Starlink have been 
>> innovating their heart out around pretty much everything here - and 
>> yet, this bufferbloat issue hasn't changed, despite Dave proposing 
>> what appears to be an easy fix compared to a lot of other things they 
>> have done. So what are we possibly missing here?
>>
>> Going back to first principles: The purpose of a buffer on a network 
>> device is to act as a shock absorber against sudden traffic bursts. 
>> If I want to size that buffer correctly, I need to know at the very 
>> least (paraphrasing queueing theory here) something about my packet 
>> arrival process.
>
> The question is over what timeframe. If you have a huge buffer, you 
> can buffer 10s of seconds of traffic and eventually send it. That will 
> make benchmarks look good, but not the user experience. The rapid drop 
> in RAM prices (beyond merely a free fall) and the benchmark scores 
> that heavily penalized any dropped packets encouraged buffers to get 
> larger than is sane.
>
> it's still a good question to define what is sane, the longer the 
> buffer, the mor of a chance of finding time to catch up, but having 
> packets in the buffer that have timed out (i.e. DNS queries tend to 
> time out after 3 seconds, TCP will give up and send replacement 
> packets, making the initial packets meaningless) is counterproductive. 
> What is the acceptable delay to your users?
>
> Here at the bufferbloat project, we tend to say that buffers past a 
> few 10s of ms worth of traffic are probably bad and are aiming to 
> single-digit ms in many cases.
Taken as read.
>
>> If I look at conventional routers, then that arrival process involves 
>> traffic generated by a user population that changes relatively 
>> slowly: WiFi users come and go. One at a time. Computers in a company 
>> get turned on and off and rebooted, but there are no instantaneous 
>> jumps in load - you don't suddenly have a hundred users in the middle 
>> of watching Netflix turning up that weren't there a second ago. Most 
>> of what we know about Internet traffic behaviour is based on this 
>> sort of network, and this is what we've designed our queuing systems 
>> around, right?
>
> not true, for businesses, every hour as meetings start and let out, 
> and as people arrive in the morning, arrive back from lunch, you have 
> very sharp changes in the traffic.
And herein lies the crunch: All of these things that you list happen 
over much longer timeframes than a switch to a different satellite. 
Also, folk coming back from lunch would start with something like 
cwnd=10. Users whose TCP connections get switched over to a different 
satellite by some underlying tunneling protocol could have much larger 
cwnd.
>
> at home you have less changes in users, but you also may have less 
> bandwidth (although many tech enthusiasts have more bandwidth than 
> many companies, two of my last 3 jobs have had <400Mb at their main 
> office with hundreds of employees while many people would consider 
> that 'slow' for home use). As such a parent arriving home with a 
> couple of kids will make a drastic change to the network usage in a 
> very short time.
I think you've missed my point - I'm talking about changes in network 
mid-flight, not people coming home and getting started over a period of 
a few minutes. The change you see in a handover is sudden and probably 
width sub-second ramp-up. And it's something that doesn't just happen 
when people come home or return from lunch - it happens every few minutes.
>
>
> but the active quueing systems that we are designing (cake, fq_codel) 
> handle these conditions very well because they don't try to guess what 
> the usage is going to be, they just look at the packets that they have 
> to process and figure out how to dispatch them out in the best way.
Understood - I've followed your work.
>
> because we have observed that latency tends to be more noticable for 
> short connections (DNS, checking if cached web pages are up to date, 
> etc), our algorithms give a slight priority to new-low-traffic 
> connections over long-running-high-traffic connections rather than 
> just splitting the bandwidth evenly across all connections, and can 
> even go further to split bandwith between endpoints, not just 
> connections (with endpoints being a configurable definition)
>
> without active queue management, the default is FIFO, which allows the 
> high-user-impact, short connection packets to sit in a queue behind 
> the low-user-impace, bulk data transfers. For benchmarks, 
> a-packet-is-a-packet and they all count, so until you have enough 
> buffering that you start having expired packets in flight, it doesn't 
> matter, but for the user experience, there can be a huge difference.

All understood - you're preaching to the converted. It's just that I 
think Starlink may be a different ballpark.

Put another way: If you have a protocol (TCP) that is designed to 
reasonably expect that its current cwnd is OK to use for now is put into 
a situation where there are relatively frequent, huge and lasting step 
changes in available BDP within subsecond periods, are your underlying 
assumptions still valid?

I suspect they're handing over whole cells, not individual users, at a 
time.

>
> David Lang
>
-- 
****************************************************************
Dr. Ulrich Speidel

School of Computer Science

Room 303S.594 (City Campus)

The University of Auckland
u.speidel at auckland.ac.nz
http://www.cs.auckland.ac.nz/~ulrich/
****************************************************************