[Starlink] Starlink hidden buffers
Ulrich Speidel
u.speidel at auckland.ac.nz
Sun May 14 02:06:42 EDT 2023
On 14/05/2023 10:57 am, David Lang wrote:
> On Sat, 13 May 2023, Ulrich Speidel via Starlink wrote:
>
>> Here's a bit of a question to you all. See what you make of it.
>>
>> I've been thinking a bit about the latencies we see in the Starlink
>> network. This is why this list exist (right, Dave?). So what do we know?
>>
>> 1) We know that RTTs can be in the 100's of ms even in what appear to
>> be bent-pipe scenarios where the physical one-way path should be well
>> under 3000 km, with physical RTT under 20 ms.
>> 2) We know from plenty of traceroutes that these RTTs accrue in the
>> Starlink network, not between the Starlink handover point (POP) to
>> the Internet.
>> 3) We know that they aren't an artifact of the Starlink WiFi router
>> (our traceroutes were done through their Ethernet adaptor, which
>> bypasses the router), so they must be delays on the satellites or the
>> teleports.
>
> the ethernet adapter bypasses the wifi, but not the router, you have
> to cut the cable and replace the plug to bypass the router
Good point - but you still don't get the WiFi buffering here. Or at
least we don't seem to, looking at the difference between running with
and without the adapter.
>
>> 4) We know that processing delay isn't a huge factor because we also
>> see RTTs well under 30 ms.
>> 5) That leaves queuing delays.
>>
>> This issue has been known for a while now. Starlink have been
>> innovating their heart out around pretty much everything here - and
>> yet, this bufferbloat issue hasn't changed, despite Dave proposing
>> what appears to be an easy fix compared to a lot of other things they
>> have done. So what are we possibly missing here?
>>
>> Going back to first principles: The purpose of a buffer on a network
>> device is to act as a shock absorber against sudden traffic bursts.
>> If I want to size that buffer correctly, I need to know at the very
>> least (paraphrasing queueing theory here) something about my packet
>> arrival process.
>
> The question is over what timeframe. If you have a huge buffer, you
> can buffer 10s of seconds of traffic and eventually send it. That will
> make benchmarks look good, but not the user experience. The rapid drop
> in RAM prices (beyond merely a free fall) and the benchmark scores
> that heavily penalized any dropped packets encouraged buffers to get
> larger than is sane.
>
> it's still a good question to define what is sane, the longer the
> buffer, the mor of a chance of finding time to catch up, but having
> packets in the buffer that have timed out (i.e. DNS queries tend to
> time out after 3 seconds, TCP will give up and send replacement
> packets, making the initial packets meaningless) is counterproductive.
> What is the acceptable delay to your users?
>
> Here at the bufferbloat project, we tend to say that buffers past a
> few 10s of ms worth of traffic are probably bad and are aiming to
> single-digit ms in many cases.
Taken as read.
>
>> If I look at conventional routers, then that arrival process involves
>> traffic generated by a user population that changes relatively
>> slowly: WiFi users come and go. One at a time. Computers in a company
>> get turned on and off and rebooted, but there are no instantaneous
>> jumps in load - you don't suddenly have a hundred users in the middle
>> of watching Netflix turning up that weren't there a second ago. Most
>> of what we know about Internet traffic behaviour is based on this
>> sort of network, and this is what we've designed our queuing systems
>> around, right?
>
> not true, for businesses, every hour as meetings start and let out,
> and as people arrive in the morning, arrive back from lunch, you have
> very sharp changes in the traffic.
And herein lies the crunch: All of these things that you list happen
over much longer timeframes than a switch to a different satellite.
Also, folk coming back from lunch would start with something like
cwnd=10. Users whose TCP connections get switched over to a different
satellite by some underlying tunneling protocol could have much larger
cwnd.
>
> at home you have less changes in users, but you also may have less
> bandwidth (although many tech enthusiasts have more bandwidth than
> many companies, two of my last 3 jobs have had <400Mb at their main
> office with hundreds of employees while many people would consider
> that 'slow' for home use). As such a parent arriving home with a
> couple of kids will make a drastic change to the network usage in a
> very short time.
I think you've missed my point - I'm talking about changes in network
mid-flight, not people coming home and getting started over a period of
a few minutes. The change you see in a handover is sudden and probably
width sub-second ramp-up. And it's something that doesn't just happen
when people come home or return from lunch - it happens every few minutes.
>
>
> but the active quueing systems that we are designing (cake, fq_codel)
> handle these conditions very well because they don't try to guess what
> the usage is going to be, they just look at the packets that they have
> to process and figure out how to dispatch them out in the best way.
Understood - I've followed your work.
>
> because we have observed that latency tends to be more noticable for
> short connections (DNS, checking if cached web pages are up to date,
> etc), our algorithms give a slight priority to new-low-traffic
> connections over long-running-high-traffic connections rather than
> just splitting the bandwidth evenly across all connections, and can
> even go further to split bandwith between endpoints, not just
> connections (with endpoints being a configurable definition)
>
> without active queue management, the default is FIFO, which allows the
> high-user-impact, short connection packets to sit in a queue behind
> the low-user-impace, bulk data transfers. For benchmarks,
> a-packet-is-a-packet and they all count, so until you have enough
> buffering that you start having expired packets in flight, it doesn't
> matter, but for the user experience, there can be a huge difference.
All understood - you're preaching to the converted. It's just that I
think Starlink may be a different ballpark.
Put another way: If you have a protocol (TCP) that is designed to
reasonably expect that its current cwnd is OK to use for now is put into
a situation where there are relatively frequent, huge and lasting step
changes in available BDP within subsecond periods, are your underlying
assumptions still valid?
I suspect they're handing over whole cells, not individual users, at a
time.
>
> David Lang
>
--
****************************************************************
Dr. Ulrich Speidel
School of Computer Science
Room 303S.594 (City Campus)
The University of Auckland
u.speidel at auckland.ac.nz
http://www.cs.auckland.ac.nz/~ulrich/
****************************************************************
More information about the Starlink
mailing list