[Ecn-sane] [bbr-dev] duplicating the BBRv2 tests at iccrg in flent?

Fri Apr 5 11:51:03 EDT 2019

Thanks!

On Fri, Apr 5, 2019 at 5:11 PM Neal Cardwell <ncardwell at google.com> wrote:
>
> On Fri, Apr 5, 2019 at 3:42 AM Dave Taht <dave.taht at gmail.com> wrote:
>>
>> I see from the iccrg preso at 7 minutes 55 s in, that there is a test
>> described as:
>>
>> 20 BBRv2 flows
>> starting each 100ms, 1G, 1ms
>> Linux codel with ECN ce_threshold at 242us sojurn time.
>
>
> Hi, Dave! Thanks for your e-mail.

I have added you to ecn-sane's allowed sender filters.

>
>>
>> I interpret this as
>>
>> 20 flows, starting 100ms apart
>> on a 1G link
>> with a 1ms transit time
>> and linux codel with ce_threshold 242us
>
>
> Yes, except the 1ms is end-to-end two-way propagation time.
>
>>
>> 0) This is iperf? There is no crypto?
>
>
> Each flow is a netperf TCP stream, with no crypto.

OK. I do wish netperf had a tls mode.

>
>>
>>
>> 1) "sojourn time" not as as setting the codel target to 242us?
>>
>> I tend to mentally tie the concept of sojourn time to the target
>> variable, not ce_threshold
>
>
> Right. I didn't mean setting the codel target to 242us. Where the slide says "Linux codel with ECN ce_threshold at 242us sojourn time" I literally mean a Linux machine with a codel qdisc configured as:
>
>   codel ce_threshold 242us
>
> This is using the ce_threshold feature added in:
>   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=80ba92fa1a92dea1
>
> ... for which the commit message says:
>
> "A DCTCP enabled egress port simply have a queue occupancy threshold
> above which ECT packets get CE mark. In codel language this translates to a sojourn time, so that one doesn't have to worry about bytes or bandwidth but delays."

I had attempted to discuss deprecating this option back in august on
the codel list:

https://lists.bufferbloat.net/pipermail/codel/2018-August/002367.html

As well as changing a few other core features. I put most of what I
discussed there into https://github.com/dtaht/fq_codel_fast which I
was using for comparison to the upcoming cake paper, and that is now
where the first cut at the sce work resides also.

> The 242us comes from the seriailization delay for 20 packets at 1Gbps.

I thought it was more because how hard it is to get an accurate
measurement below about ~500us. In our early work on attempting
virtualizations, things like Xen would frequently jitter scheduling by
10-20ms or more. While that situation has got much better, I tend to
still prefer "bare metal" when working on this stuff - and often
"weak" bare metal. like the mips processors we mostly use in the
cerowrt project.

Even then I get nervous below 500us unless it's a r/t kernel.

I used irtt to profile this underlying packet + scheduling jitter on
various virtual machine fabrics from 2ms to 10us a while back (google
cloud, aws, linode) but never got around to publishing the work. I
guess I should go pull those numbers out...

>
>> 2) In our current SCE work we have repurposed ce_threshold to do sce
>> instead (to save on cpu and also to make it possible to fiddle without
>> making a userspace api change). Should we instead create a separate
>> sce_threshold option to allow for backward compatible usage?
>
>
> Yes, you would need to maintain the semantics of ce_threshold for backwards compatibility for users who are relying on the current semantics. IMHO your suggestion to use a separate sce_threshold sounds like the way to go, if adding SCE to qdiscs in Linux.
>
>>
>> 3) Transit time on your typical 1G link is actually 13us for a big
>> packet, why 1ms?
>
>
> The 1ms is the path two-way propagation delay ("min RTT"). We run a range of RTTs in our tests, and the graph happens to be for an RTT of 1ms.
>
OK.

>>
>> is that 1ms from netem?
>
>
> Yes.
>
>>
>> 4) What is the topology here?
>>
>> host -> qdisc -> wire -> host?
>>
>> host -> qdisc -> wire -> router -> host?
>
>
> Those two won't work with Linux TCP, because putting the qdisc on the sender pulls the qdisc delays inside the TSQ control loop, giving a behavior very different from reality (even CUBIC won't bloat if the network emulation qdiscs are on the sender host).
>
> What we use for our testing is:
>
>   host -> wire -> qdiscs -> host
>
> Where "qdiscs" includes netem and whatever AQM is in use, if any.

Normally how I do the "qdiscs" is I call it a "router" :) and then the
qdiscs usually look like this:

eth0 -> netem -> aqm_alg -> eth1
eth0 <- aqm_alg <- netem <- eth1

using ifb for the inbound management.

I didn't get to where I trusted netem to do this right until about a
year ago, up until that point I had also always used a separate
"delay" box.

Was GRO/GSO enabled on the router? host? server?

>
>>
>> 5) What was the result with fq_codel instead?
>
>
> With fq_codel and the same ECN marking threshold (fq_codel ce_threshold 242us), we see slightly smoother fairness properties (not surprising) but with slightly higher latency.
>
> The basic summary:
>
> retransmits: 0
> flow throughput: [46.77 .. 51.48]
> RTT samples at various percentiles:
>   %   | RTT (ms)
> ------+---------
>    0    1.009
>   50    1.334
>   60    1.416
>   70    1.493
>   80    1.569
>   90    1.655
>   95    1.725
>   99    1.902
>   99.9  2.328
>  100    6.414

This is lovely. Is there an open source tool you are using to generate
this from the packet capture? From wireshark? Or is this from sampling
the TCP_INFO parameter of netperf?

>
> Bandwidth share graphs are attached. (Hopefully the graphs will make it through various lists; if not, you can check the bbr-dev group thread.)
>
> best,
> neal
>

-- 

Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740