Sorry I didn’t engage with this, folks — probably came across as rude, but just had a large and unexpected career shift ongoing (https://twitter.com/stub_AS/status/1469283183132876809?s=20), and didn’t feel up to it, especially, as I’m largely abandoning my research along these lines due to these developments.

In any case, I have a lot of respect for you folks educating everyone on latency and buffer bloat, and have been following Dave (Taht)’s great work in the space for awhile.

Best,
Ankit

On Jul 19, 2021, at 17:50, George Burdell <gb@teklibre.net<mailto:gb@teklibre.net>> wrote:

On Sat, Jul 10, 2021 at 01:27:28PM -0700, David Lang wrote:
any buffer sizing based on the number of packets is wrong. Base your buffer
size on transmit time and you have a chance of being reasonable.

This is very true. Packets have a dynamic range of 64 bytes to 64k (GRO) and
sizing queues in terms of packets leads to bad behavior on mixed up and
down traffic particularly.

Also... people doing AQM and TCP designs tend to almost always
test one way traffic only, and this leads to less than desirable behavior
on real world traffic. Strike that. Terrible behavior! a pure
single queue AQM struggles mightily to find a good hit rate when there are a
ton of acks, dns, gaming, voip, etc, mixed in with the capacity seeking
flows.

Nearly every AQM paper you read never tests real, bidir traffic. It's
a huge blind spot, which is why the bufferbloat effort *starts* with
the rrul test and related on testing any new idea we have.

bfifos are better, but harder to implement in hardware.

A fun trick: If you are trying to optimize your traffic for R/T communications
rather than speedtest, you can clamp your tcp "mss" to smaller than 600
bytes *at the router*, and your network gets better.

(we really should get around to publishing something on that, when you are
plagued by a giant upstream FIFO, filling it with smaller packets really
helps, and it's something a smart user could easily do regardless of the
ISP's preferences)


In cases like wifi where packets aren't sent individually, but are sent in
blobs of packets going to the same destination,

yes...

you want to buffer at least
a blobs worth of packets to each destination so that when your transmit slot
comes up, you can maximize it.

Nooooooo! This is one of those harder tradeoffs that is pretty counter
intuitive. You want per station queuing, yes. However the decision as to
how much service time you want to grant each station is absolutely
not in maximizing the transmit slot, but in maximizing the number of
stations you can serve in reasonable time. Simple (and inaccurate) example:

100 stations at 4ms txop each, stuffed full of *udp* data, is 400ms/round.
(plus usually insane numbers of retries).

This breaks a lot of things,
and doesn't respect the closely coupled nature of tcp (please re-read
the codel paper!). Cutting the txop in this case to 1ms cuts interstation
service time... at the cost of "bandwidth" that can't be stuffed into
the slow header + wifi data rate equation.

but what you really want to do is give the sparsest stations quicker
access to the media so they can ramp up to parity (and usually
complete their short flows much faster, and then get off)

I run with BE 2.4ms txops and announce the same in the beacon. I'd
be willing to bet your scale conference network would work
much better if you did that also. (It would be better if we could
scale txop size to the load, but fq_codel on wifi already
does the sparse station optimization which translates into many
shorter txops than you would see from other wifi schedulers, and
the bulk of the problem I see is the *stations*)

lastly, you need to defer constructing the blob as long as possible,
so you can shoot at, mark, or reschedule (FQ), the packets in there
at the last moment before they are committed to the hardware.

Ideally you would not construct any blob at all until a few microseconds
before the transmit opportunity.

Shifting this back to starlink - they have a marvelous opportunity
to do just this, in the dishy, as they are half duplex and could
defer grabbing the packets from a sch_cake buffer until precisely
before that txop to the sat arrives.
(my guess would be no more than 400us based on what I understand
of the arm chip they are using)

This would be much better than what we could do in the ath9k
where we were forced to always have "one in the hardware, one
ready to go" due to limitations in that chip. We're making
some progress on the openwifi fpga here, btw...


Wifi has the added issue that the blob headers are at a much lower data rate
than the dta itself, so you can cram a LOT of data into a blob without
making a significant difference in the airtime used, so you really do want
to be able to send full blobs (not at the cost of delaying tranmission if
you don't have a full blob, a mistake some people make, but you do want to
buffer enough to fill the blobs)

and given that dropped packets results in timeouts and retransmissions that
affect the rest of the network, it's not obviously wrong for a lossy hop
like wifi to retry a failed transmission, it just needs to not retry too
many times.

David Lang


On Sat, 10 Jul 2021, Rodney W. Grimes wrote:

Date: Sat, 10 Jul 2021 04:49:50 -0700 (PDT)
From: Rodney W. Grimes <starlink@gndrsh.dnsmgr.net<mailto:starlink@gndrsh.dnsmgr.net>>
To: Dave Taht <dave.taht@gmail.com<mailto:dave.taht@gmail.com>>
Cc: starlink@lists.bufferbloat.net<mailto:starlink@lists.bufferbloat.net>, Ankit Singla <asingla@ethz.ch<mailto:asingla@ethz.ch>>,
  Sam Kumar <samkumar@cs.berkeley.edu<mailto:samkumar@cs.berkeley.edu>>
Subject: Re: [Starlink] SatNetLab: A call to arms for the next global Internet
   testbed

While it is good to have a call to arms, like this:
...  much information removed as I only one to reply to 1 very
  narrow, but IMHO, very real problem in our networks today ...

Here's another piece of pre-history - alohanet - the TTL field was the
"time to live" field. The intent was that the packet would indicate
how much time it would be valid before it was discarded. It didn't
work out, and was replaced by hopcount, which of course switched
networks ignore and isonly semi-useful for detecting loops and the
like.

TTL works perfectly fine where the original assumptions that a
device along a network path only hangs on to a packet for a
reasonable short duration, and that there is not some "retry"
mechanism in place that is causing this time to explode.  BSD,
and as far as I can recall, almost ALL original IP stacks had
a Q depth limit of 50 packets on egress interfaces.  Everything
pretty much worked well and the net was happy.  Then these base
assumptions got blasted in the name of "measurable bandwidth" and
the concept of packets are so precious we must not loose them,
at almost any cost.  Linux crammed the per interface Q up to 1000,
wifi decided that it was reasable to retry at the link layer so
many times that I have seen packets that are >60 seconds old.

Proposed FIX:  Any device that transmits packets that does not
already have an inherit FIXED transmission time MUST consider
the current TTL of that packet and give up if > 10mS * TTL elapses
while it is trying to transmit.  AND change the default if Q
size in LINUX to 50 for fifo, the codel, etc AQM stuff is fine
at 1000 as it has delay targets that present the issue that
initially bumping this to 1000 caused.

... end of Rods Rant ...

--
Rod Grimes                                                 rgrimes@freebsd.org<mailto:rgrimes@freebsd.org>
_______________________________________________
Starlink mailing list
Starlink@lists.bufferbloat.net<mailto:Starlink@lists.bufferbloat.net>
https://lists.bufferbloat.net/listinfo/starlink
_______________________________________________
Starlink mailing list
Starlink@lists.bufferbloat.net<mailto:Starlink@lists.bufferbloat.net>
https://lists.bufferbloat.net/listinfo/starlink