Discussion of explicit congestion notification's impact on the Internet
 help / color / mirror / Atom feed
From: Dave Taht <dave.taht@gmail.com>
To: Neal Cardwell <ncardwell@google.com>
Cc: ECN-Sane <ecn-sane@lists.bufferbloat.net>,
	 BBR Development <bbr-dev@googlegroups.com>,
	flent-users <flent-users@flent.org>
Subject: Re: [Ecn-sane] [bbr-dev] duplicating the BBRv2 tests at iccrg in flent?
Date: Fri, 5 Apr 2019 17:51:03 +0200	[thread overview]
Message-ID: <CAA93jw6HUzjq1Rk9OsqXRuWze3tzXfhz1p3promaD_Zx1Xbdbw@mail.gmail.com> (raw)
In-Reply-To: <CADVnQy=DfST=dHFkZg9EeQRL0OH9HOqgRfJ0uWgUu4fBLD9tSA@mail.gmail.com>

Thanks!

On Fri, Apr 5, 2019 at 5:11 PM Neal Cardwell <ncardwell@google.com> wrote:
>
> On Fri, Apr 5, 2019 at 3:42 AM Dave Taht <dave.taht@gmail.com> wrote:
>>
>> I see from the iccrg preso at 7 minutes 55 s in, that there is a test
>> described as:
>>
>> 20 BBRv2 flows
>> starting each 100ms, 1G, 1ms
>> Linux codel with ECN ce_threshold at 242us sojurn time.
>
>
> Hi, Dave! Thanks for your e-mail.

I have added you to ecn-sane's allowed sender filters.

>
>>
>> I interpret this as
>>
>> 20 flows, starting 100ms apart
>> on a 1G link
>> with a 1ms transit time
>> and linux codel with ce_threshold 242us
>
>
> Yes, except the 1ms is end-to-end two-way propagation time.
>
>>
>> 0) This is iperf? There is no crypto?
>
>
> Each flow is a netperf TCP stream, with no crypto.

OK. I do wish netperf had a tls mode.

>
>>
>>
>> 1) "sojourn time" not as as setting the codel target to 242us?
>>
>> I tend to mentally tie the concept of sojourn time to the target
>> variable, not ce_threshold
>
>
> Right. I didn't mean setting the codel target to 242us. Where the slide says "Linux codel with ECN ce_threshold at 242us sojourn time" I literally mean a Linux machine with a codel qdisc configured as:
>
>   codel ce_threshold 242us
>
> This is using the ce_threshold feature added in:
>   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=80ba92fa1a92dea1
>
> ... for which the commit message says:
>
> "A DCTCP enabled egress port simply have a queue occupancy threshold
> above which ECT packets get CE mark. In codel language this translates to a sojourn time, so that one doesn't have to worry about bytes or bandwidth but delays."

I had attempted to discuss deprecating this option back in august on
the codel list:

https://lists.bufferbloat.net/pipermail/codel/2018-August/002367.html

As well as changing a few other core features. I put most of what I
discussed there into https://github.com/dtaht/fq_codel_fast which I
was using for comparison to the upcoming cake paper, and that is now
where the first cut at the sce work resides also.

> The 242us comes from the seriailization delay for 20 packets at 1Gbps.

I thought it was more because how hard it is to get an accurate
measurement below about ~500us. In our early work on attempting
virtualizations, things like Xen would frequently jitter scheduling by
10-20ms or more. While that situation has got much better, I tend to
still prefer "bare metal" when working on this stuff - and often
"weak" bare metal. like the mips processors we mostly use in the
cerowrt project.

Even then I get nervous below 500us unless it's a r/t kernel.

I used irtt to profile this underlying packet + scheduling jitter on
various virtual machine fabrics from 2ms to 10us a while back (google
cloud, aws, linode) but never got around to publishing the work. I
guess I should go pull those numbers out...

>
>> 2) In our current SCE work we have repurposed ce_threshold to do sce
>> instead (to save on cpu and also to make it possible to fiddle without
>> making a userspace api change). Should we instead create a separate
>> sce_threshold option to allow for backward compatible usage?
>
>
> Yes, you would need to maintain the semantics of ce_threshold for backwards compatibility for users who are relying on the current semantics. IMHO your suggestion to use a separate sce_threshold sounds like the way to go, if adding SCE to qdiscs in Linux.
>
>>
>> 3) Transit time on your typical 1G link is actually 13us for a big
>> packet, why 1ms?
>
>
> The 1ms is the path two-way propagation delay ("min RTT"). We run a range of RTTs in our tests, and the graph happens to be for an RTT of 1ms.
>
OK.

>>
>> is that 1ms from netem?
>
>
> Yes.
>
>>
>> 4) What is the topology here?
>>
>> host -> qdisc -> wire -> host?
>>
>> host -> qdisc -> wire -> router -> host?
>
>
> Those two won't work with Linux TCP, because putting the qdisc on the sender pulls the qdisc delays inside the TSQ control loop, giving a behavior very different from reality (even CUBIC won't bloat if the network emulation qdiscs are on the sender host).
>
> What we use for our testing is:
>
>   host -> wire -> qdiscs -> host
>
> Where "qdiscs" includes netem and whatever AQM is in use, if any.

Normally how I do the "qdiscs" is I call it a "router" :) and then the
qdiscs usually look like this:

eth0 -> netem -> aqm_alg -> eth1
eth0 <- aqm_alg <- netem <- eth1

using ifb for the inbound management.

I didn't get to where I trusted netem to do this right until about a
year ago, up until that point I had also always used a separate
"delay" box.

Was GRO/GSO enabled on the router? host? server?

>
>>
>> 5) What was the result with fq_codel instead?
>
>
> With fq_codel and the same ECN marking threshold (fq_codel ce_threshold 242us), we see slightly smoother fairness properties (not surprising) but with slightly higher latency.
>
> The basic summary:
>
> retransmits: 0
> flow throughput: [46.77 .. 51.48]
> RTT samples at various percentiles:
>   %   | RTT (ms)
> ------+---------
>    0    1.009
>   50    1.334
>   60    1.416
>   70    1.493
>   80    1.569
>   90    1.655
>   95    1.725
>   99    1.902
>   99.9  2.328
>  100    6.414

This is lovely. Is there an open source tool you are using to generate
this from the packet capture? From wireshark? Or is this from sampling
the TCP_INFO parameter of netperf?

>
> Bandwidth share graphs are attached. (Hopefully the graphs will make it through various lists; if not, you can check the bbr-dev group thread.)
>
> best,
> neal
>


-- 

Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740

  reply	other threads:[~2019-04-05 15:51 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-04-05  7:42 [Ecn-sane] " Dave Taht
2019-04-05 15:10 ` [Ecn-sane] [bbr-dev] " Neal Cardwell
2019-04-05 15:51   ` Dave Taht [this message]
2019-04-05 16:58     ` Neal Cardwell
2019-04-05 16:20   ` Jonathan Morton
2019-04-06 11:56     ` Neal Cardwell
2019-04-06 14:37       ` [Ecn-sane] [Flent-users] " Sebastian Moeller
2019-04-09  1:33         ` Neal Cardwell
2019-04-09  2:09           ` Jonathan Morton
2019-04-09  6:30           ` Sebastian Moeller
2019-04-09 14:33             ` Neal Cardwell
2019-04-09 17:20               ` Sebastian Moeller
2019-04-06 11:49   ` [Ecn-sane] " Dave Taht
2019-04-06 12:31     ` Neal Cardwell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://lists.bufferbloat.net/postorius/lists/ecn-sane.lists.bufferbloat.net/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAA93jw6HUzjq1Rk9OsqXRuWze3tzXfhz1p3promaD_Zx1Xbdbw@mail.gmail.com \
    --to=dave.taht@gmail.com \
    --cc=bbr-dev@googlegroups.com \
    --cc=ecn-sane@lists.bufferbloat.net \
    --cc=flent-users@flent.org \
    --cc=ncardwell@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox