[Starlink] Starlink hidden buffers

Sat May 13 18:57:56 EDT 2023

On Sat, 13 May 2023, Ulrich Speidel via Starlink wrote:

> Here's a bit of a question to you all. See what you make of it.
>
> I've been thinking a bit about the latencies we see in the Starlink 
> network. This is why this list exist (right, Dave?). So what do we know?
>
> 1) We know that RTTs can be in the 100's of ms even in what appear to be 
> bent-pipe scenarios where the physical one-way path should be well under 
> 3000 km, with physical RTT under 20 ms.
> 2) We know from plenty of traceroutes that these RTTs accrue in the 
> Starlink network, not between the Starlink handover point (POP) to the 
> Internet.
> 3) We know that they aren't an artifact of the Starlink WiFi router (our 
> traceroutes were done through their Ethernet adaptor, which bypasses the 
> router), so they must be delays on the satellites or the teleports.

the ethernet adapter bypasses the wifi, but not the router, you have to cut the 
cable and replace the plug to bypass the router

> 4) We know that processing delay isn't a huge factor because we also see 
> RTTs well under 30 ms.
> 5) That leaves queuing delays.
>
> This issue has been known for a while now. Starlink have been innovating 
> their heart out around pretty much everything here - and yet, this 
> bufferbloat issue hasn't changed, despite Dave proposing what appears to 
> be an easy fix compared to a lot of other things they have done. So what 
> are we possibly missing here?
>
> Going back to first principles: The purpose of a buffer on a network 
> device is to act as a shock absorber against sudden traffic bursts. If I 
> want to size that buffer correctly, I need to know at the very least 
> (paraphrasing queueing theory here) something about my packet arrival 
> process.

The question is over what timeframe. If you have a huge buffer, you can buffer 
10s of seconds of traffic and eventually send it. That will make benchmarks look 
good, but not the user experience. The rapid drop in RAM prices (beyond merely 
a free fall) and the benchmark scores that heavily penalized any dropped packets 
encouraged buffers to get larger than is sane.

it's still a good question to define what is sane, the longer the buffer, the 
mor of a chance of finding time to catch up, but having packets in the buffer 
that have timed out (i.e. DNS queries tend to time out after 3 seconds, TCP will 
give up and send replacement packets, making the initial packets meaningless) is 
counterproductive. What is the acceptable delay to your users?

Here at the bufferbloat project, we tend to say that buffers past a few 10s of 
ms worth of traffic are probably bad and are aiming to single-digit ms in many 
cases.

> If I look at conventional routers, then that arrival process involves 
> traffic generated by a user population that changes relatively slowly: 
> WiFi users come and go. One at a time. Computers in a company get turned 
> on and off and rebooted, but there are no instantaneous jumps in load - 
> you don't suddenly have a hundred users in the middle of watching 
> Netflix turning up that weren't there a second ago. Most of what we know 
> about Internet traffic behaviour is based on this sort of network, and 
> this is what we've designed our queuing systems around, right?

not true, for businesses, every hour as meetings start and let out, and as 
people arrive in the morning, arrive back from lunch, you have very sharp 
changes in the traffic.

at home you have less changes in users, but you also may have less bandwidth 
(although many tech enthusiasts have more bandwidth than many companies, two of 
my last 3 jobs have had <400Mb at their main office with hundreds of employees 
while many people would consider that 'slow' for home use). As such a parent 
arriving home with a couple of kids will make a drastic change to the network 
usage in a very short time.

but the active quueing systems that we are designing (cake, fq_codel) handle 
these conditions very well because they don't try to guess what the usage is 
going to be, they just look at the packets that they have to process and figure 
out how to dispatch them out in the best way.

because we have observed that latency tends to be more noticable for short 
connections (DNS, checking if cached web pages are up to date, etc), our 
algorithms give a slight priority to new-low-traffic connections over 
long-running-high-traffic connections rather than just splitting the bandwidth 
evenly across all connections, and can even go further to split bandwith between 
endpoints, not just connections (with endpoints being a configurable definition)

without active queue management, the default is FIFO, which allows the 
high-user-impact, short connection packets to sit in a queue behind the 
low-user-impace, bulk data transfers. For benchmarks, a-packet-is-a-packet and 
they all count, so until you have enough buffering that you start having expired 
packets in flight, it doesn't matter, but for the user experience, there can be 
a huge difference.

David Lang