[Starlink] Natural Internet Traffic matters more than the network architects want to think about

Sat Jul 30 17:12:01 EDT 2022

There's been a good discussion here triggered by my comments on "avoiding queueing delay" being a "good operating point" of the Internet. I have always thought that the research community interested in bettering networks ought to invest its time in modeling "real need" of "real people" rather than create yet another queueing model that misses what matters.

Queueing theory is a pretty nice foundation, but it isn't sufficient unto inself, because it is full of toy examples and narrowly relevant theorems. That's not a complaint about queueing theory itself, but a complaint about assuming it provides the answers in and of itself.

There's a related set of applications of queueing theory that also suffer from similar misuse - which I encountered long before network packet switching queueing theory - and that's computer OS scheduling of processes. There is no valid reason to believe that the load presented to a computer timesharing system is Poisson arrival modelable. But in the 1970's, OS research was full of claims about optimizing time sharing schedulers that assumed each user was a poisson process. I worked on a real time sharing system with real users, and we worked on the scheduler for that system to make it usable. (It was a 70 user Multics system with two processors and 784K 36 bit words of memory that were paged from a Librafile "drum" that had a 16 msec. latency).

I mention this, because it became VERY obvious that NONE of the queueing theory theorems could tell us what to do to schedule user processes on that system. And the reason was simple - humans are NOT poisson processes, nor are algorithms.

If you look at the whole Internet today, the primary load is the WWW, and now "conversational AV" [Zoom]. Not telephony, not video streaming, not voice.
And if you look at what happens when ANY user clicks on a link, and often even when some program running in the browser makes a decision at some scheduled time instant, a cascade of highly correlated events happens at unpredictable parts of the whole system.

This cascade of events doesn't "average" or "smooth".  In fact, what happens actually changes the users' behavior in the future, a feedback loop that is really hard to predict in advance.
Also, the event cascade is highly parallelized across endpoints. A typical landing page in the web launches near simultaneous requests for up to 100 different sets of data (that's cookies and advertising etc.).

This cascade is also getting much worse, because Google has decided to invent a non-TCP protocol called QUIC, which uses UDP packets for all kinds of concurrent activity.

Statistically, this is incredibly bursty. (despite what Comcast researchers might share with academics).
Moreover, response time involves latencies being added up because many packets are emitted only after the packet makes many RTT's over the Internet.

This is the actual reality of the Internet (or any corporate Intranet). 

Yet, every user-driven action (retrieving a file of any size, or firing a bullet at an "enemy") wants to have predictable, very low response time overall - AS SEEN BY THE USERS.

What to do?

Well in fact, a lot can be done. You can't tell the designers of applications or whatever to redesign their applications to be "less bursty" at a bottleneck link. Instead, what you do is make it so there are never any substantive queues anywhere in the network, 99.99% of the time. (Notice I talk about "availability" here, and not "goodput" - 5 9's of availability is 99.999%, and 4 9's is OK, 5 9's is great.)

How is this done in a related but simpler environment? Well I happen to be familiar with high-performance computing architecure in the data center - which also has networks in it. They are called "Buses" - the memory buses, I/O buses, ... Ask yourself what the "average utilization" of the memory bus which connects CPU to DRAM actually is. Well, It's well under 10%. Well under. That is achieved by caching DRAM contents in the CPU, to achieve a very high hit-ratio in the L3 cache. But L3 cache isn't fast enough to feed 6-10 cores. So each group of cores has L2 cache that caches what is in L3, to get a pretty high hit ratio, etc. And L1 cache can respond to CPU demand much of the time, however, even that bus is too slow for the rate at which the processor can do adds, tests, etc. so each hyperthread actually runs concurrently.

The insight that is common between the Internet and high-performance computing bus architectures is that *you can't design application loads that fit the network architecture* Instead you create very flexible, overprovisioned switching frameworks, and you don't spend your time on what happens when one link is saturated - because when you are at that operating point, your design has FAILED.

The Hotrodder Fallacy infects some of computer architecture, too. The Stream benchmark tells you a throughput number for a completely useless benchmark test of the memory channel throughput in a system. But in fact, no one buys systems or builds applications as if the Stream benchmark matters very much.

Sadly, just as in high performance computer architecture, it is very, very difficult to create a simple model of the workload on a complex highly concurrent system.

And you can't get "data" that isn't corrupted by limitations of a particular architecture and a particular workload at a particular time that would let you design without looking at the real world.

So, what I would suggest is taking a really good look at the real world.

No one is setting up any observation mechanisms for potential serious issues that will pop up when QUIC becomes the replacement for HTTP/1.1 and HTTP/2. Yet that is coming. They claim they have "congestion control". I have reviewed the spec. There are no simulations, no measurements, no thought about "time constants". There's no "flows" so "flow fairness" won't help balance and stabilize traffic among independent users.

But since we won't even have a way to LOOK at the behavior of the rate limiting edges of the network with QUIC, we will be left wondering what the hell happened.

This is exacly what I think is happening in Starlink, too. There are no probe points that can show where the queueing delay is in the starlink design. In fairness, they have been racing to get a product out for Musk to brag about, which encourages the Hotrodder Fallacy to be used for measurements instead of real Internet use.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.bufferbloat.net/private/starlink/attachments/20220730/b295d652/attachment.html>