Lets make wifi fast again!
 help / color / mirror / Atom feed
* [Make-wifi-fast] emulating wifi better - coupling qdiscs in netem?
@ 2018-05-30 17:28 Dave Taht
  2018-05-30 18:32 ` Bob McMahon
  2018-05-30 22:57 ` dpreed
  0 siblings, 2 replies; 23+ messages in thread
From: Dave Taht @ 2018-05-30 17:28 UTC (permalink / raw)
  To: make-wifi-fast

The match to reality of my "wifi slotting" code for netem was so
disappointing that I was extremely reluctant to push support for it up
to mainline iproute2.

I've now spent months failing to come up with something that
could emulate in linux the non-duplex behavior and arbitration steps
that wifi goes through in order to find a new station to transmit to,
or receive from, using netem as a base.

Getting that non-duplex behavior right is the *single most important
thing*, I think,  for trying to emulate real wireless behaviors in
real time that I can think of (and to thus be able to run and improve
various e2e transports against it).

A potential tc API seems simple:

tc qdisc add dev veth1 root netem coupled # master (AP)
tc qdisc add dev veth2 root netem couple veth1 # client
tc qdisc add dev veth3 root netme couple veth2 # client

Something more complicated would be to create some sort of
arbitration device and attach that to the qdiscs. (which would make
it more possible to write arbitration devices to emulate lte, gpon,
cable, wireless mesh and other non-duplex behaviors in real time)

But how to convince qdiscs to be arbitrated, only allowing one in a
set to transmit at the same time? (and worse, in the long run,
allowing MU-MIMO-like behaviors).

I'm tempted to *not* put my failed thinking down here in the hope that
someone says, out there, "oh, that's easy, just create this structure
with X API call and use Y function and you're clear of all the
potential deadlock and RCU issues, and we've been doing that for
years, you idiot! Here's the code for how we do it, sorry we didn't
submit it earlier."

What I thought (*and still think*) is of creating a superset of the
qdisc_watchdog_schedule_ns() function is a start at it:

tag = qdisc_watchdog_create_arb("some identifier");
qdisc_watchdog_schedule_arb(nsec, tag); /* null tag = schedule_ns */

which doesn't allow that qdisc instance to be run until the arbitrator
says it can run (essentially overriding the timeout specified)

But I actually wouldn't mind something that worked at the veth, or
device, rather than qdisc level...

thoughts?

PS I just spent several days working on another aspect of the problem,
which is replaying delay distributions (caused by interference and
such)... and that, sigh, to me, also belongs in some sort of
arbitration device rather than directly in netem. Maybe tossing netem
entirely is the answer. I don't know.

-- 

Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Make-wifi-fast] emulating wifi better - coupling qdiscs in netem?
  2018-05-30 17:28 [Make-wifi-fast] emulating wifi better - coupling qdiscs in netem? Dave Taht
@ 2018-05-30 18:32 ` Bob McMahon
  2018-05-30 18:54   ` Dave Taht
  2018-05-30 22:57 ` dpreed
  1 sibling, 1 reply; 23+ messages in thread
From: Bob McMahon @ 2018-05-30 18:32 UTC (permalink / raw)
  To: Dave Taht; +Cc: Make-Wifi-fast

[-- Attachment #1: Type: text/plain, Size: 3106 bytes --]

Sorry, I may be coming late to this.  What exactly is the goal?  Instead of
emulating interference with netem is it possible to create real
interference?

Bob

On Wed, May 30, 2018 at 10:28 AM, Dave Taht <dave.taht@gmail.com> wrote:

> The match to reality of my "wifi slotting" code for netem was so
> disappointing that I was extremely reluctant to push support for it up
> to mainline iproute2.
>
> I've now spent months failing to come up with something that
> could emulate in linux the non-duplex behavior and arbitration steps
> that wifi goes through in order to find a new station to transmit to,
> or receive from, using netem as a base.
>
> Getting that non-duplex behavior right is the *single most important
> thing*, I think,  for trying to emulate real wireless behaviors in
> real time that I can think of (and to thus be able to run and improve
> various e2e transports against it).
>
> A potential tc API seems simple:
>
> tc qdisc add dev veth1 root netem coupled # master (AP)
> tc qdisc add dev veth2 root netem couple veth1 # client
> tc qdisc add dev veth3 root netme couple veth2 # client
>
> Something more complicated would be to create some sort of
> arbitration device and attach that to the qdiscs. (which would make
> it more possible to write arbitration devices to emulate lte, gpon,
> cable, wireless mesh and other non-duplex behaviors in real time)
>
> But how to convince qdiscs to be arbitrated, only allowing one in a
> set to transmit at the same time? (and worse, in the long run,
> allowing MU-MIMO-like behaviors).
>
> I'm tempted to *not* put my failed thinking down here in the hope that
> someone says, out there, "oh, that's easy, just create this structure
> with X API call and use Y function and you're clear of all the
> potential deadlock and RCU issues, and we've been doing that for
> years, you idiot! Here's the code for how we do it, sorry we didn't
> submit it earlier."
>
> What I thought (*and still think*) is of creating a superset of the
> qdisc_watchdog_schedule_ns() function is a start at it:
>
> tag = qdisc_watchdog_create_arb("some identifier");
> qdisc_watchdog_schedule_arb(nsec, tag); /* null tag = schedule_ns */
>
> which doesn't allow that qdisc instance to be run until the arbitrator
> says it can run (essentially overriding the timeout specified)
>
> But I actually wouldn't mind something that worked at the veth, or
> device, rather than qdisc level...
>
> thoughts?
>
> PS I just spent several days working on another aspect of the problem,
> which is replaying delay distributions (caused by interference and
> such)... and that, sigh, to me, also belongs in some sort of
> arbitration device rather than directly in netem. Maybe tossing netem
> entirely is the answer. I don't know.
>
> --
>
> Dave Täht
> CEO, TekLibre, LLC
> http://www.teklibre.com
> Tel: 1-669-226-2619
> _______________________________________________
> Make-wifi-fast mailing list
> Make-wifi-fast@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/make-wifi-fast

[-- Attachment #2: Type: text/html, Size: 3925 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Make-wifi-fast] emulating wifi better - coupling qdiscs in netem?
  2018-05-30 18:32 ` Bob McMahon
@ 2018-05-30 18:54   ` Dave Taht
  2018-05-30 18:58     ` Jonathan Morton
  2018-05-30 19:19     ` Bob McMahon
  0 siblings, 2 replies; 23+ messages in thread
From: Dave Taht @ 2018-05-30 18:54 UTC (permalink / raw)
  To: Bob McMahon; +Cc: Make-Wifi-fast

On Wed, May 30, 2018 at 11:32 AM, Bob McMahon <bob.mcmahon@broadcom.com> wrote:
> Sorry, I may be coming late to this.  What exactly is the goal?  Instead of
> emulating interference with netem is it possible to create real
> interference?

Interference to me is a secondary, but important part of the problem.

The core requirement is somehow emulating the single transmitter at a
time behavior of wireless technologies. In this way of thinking, an
interfere-er is just another transmitter in emulation.

Linux's behaviors are all full duplex, except at the very lowest
driver levels. Being able to move the concept of a
"single bulk transmitter at a time" much higher in stack (at least,
for netem emulation), is what I'd like to do. Being better able to
reliable look at the behaviors of e2e protocols with a decently
correct wireless emulation...

Does that help? Just getting to where I could describe the problem(s)
well enough to talk about 'em
in the mailing list has taken me forever, and if I/we can get to where
we can describe the problem
better, maybe solutions will materialize. ;)

Did anyone but me ever play with the slotting models I put into netem last year?


>
> Bob
>
> On Wed, May 30, 2018 at 10:28 AM, Dave Taht <dave.taht@gmail.com> wrote:
>>
>> The match to reality of my "wifi slotting" code for netem was so
>> disappointing that I was extremely reluctant to push support for it up
>> to mainline iproute2.
>>
>> I've now spent months failing to come up with something that
>> could emulate in linux the non-duplex behavior and arbitration steps
>> that wifi goes through in order to find a new station to transmit to,
>> or receive from, using netem as a base.
>>
>> Getting that non-duplex behavior right is the *single most important
>> thing*, I think,  for trying to emulate real wireless behaviors in
>> real time that I can think of (and to thus be able to run and improve
>> various e2e transports against it).
>>
>> A potential tc API seems simple:
>>
>> tc qdisc add dev veth1 root netem coupled # master (AP)
>> tc qdisc add dev veth2 root netem couple veth1 # client
>> tc qdisc add dev veth3 root netme couple veth2 # client
>>
>> Something more complicated would be to create some sort of
>> arbitration device and attach that to the qdiscs. (which would make
>> it more possible to write arbitration devices to emulate lte, gpon,
>> cable, wireless mesh and other non-duplex behaviors in real time)
>>
>> But how to convince qdiscs to be arbitrated, only allowing one in a
>> set to transmit at the same time? (and worse, in the long run,
>> allowing MU-MIMO-like behaviors).
>>
>> I'm tempted to *not* put my failed thinking down here in the hope that
>> someone says, out there, "oh, that's easy, just create this structure
>> with X API call and use Y function and you're clear of all the
>> potential deadlock and RCU issues, and we've been doing that for
>> years, you idiot! Here's the code for how we do it, sorry we didn't
>> submit it earlier."
>>
>> What I thought (*and still think*) is of creating a superset of the
>> qdisc_watchdog_schedule_ns() function is a start at it:
>>
>> tag = qdisc_watchdog_create_arb("some identifier");
>> qdisc_watchdog_schedule_arb(nsec, tag); /* null tag = schedule_ns */
>>
>> which doesn't allow that qdisc instance to be run until the arbitrator
>> says it can run (essentially overriding the timeout specified)
>>
>> But I actually wouldn't mind something that worked at the veth, or
>> device, rather than qdisc level...
>>
>> thoughts?
>>
>> PS I just spent several days working on another aspect of the problem,
>> which is replaying delay distributions (caused by interference and
>> such)... and that, sigh, to me, also belongs in some sort of
>> arbitration device rather than directly in netem. Maybe tossing netem
>> entirely is the answer. I don't know.
>>
>> --
>>
>> Dave Täht
>> CEO, TekLibre, LLC
>> http://www.teklibre.com
>> Tel: 1-669-226-2619
>> _______________________________________________
>> Make-wifi-fast mailing list
>> Make-wifi-fast@lists.bufferbloat.net
>> https://lists.bufferbloat.net/listinfo/make-wifi-fast
>
>



-- 

Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Make-wifi-fast] emulating wifi better - coupling qdiscs in netem?
  2018-05-30 18:54   ` Dave Taht
@ 2018-05-30 18:58     ` Jonathan Morton
  2018-05-30 19:19     ` Bob McMahon
  1 sibling, 0 replies; 23+ messages in thread
From: Jonathan Morton @ 2018-05-30 18:58 UTC (permalink / raw)
  To: Dave Taht; +Cc: Bob McMahon, Make-Wifi-fast

> On 30 May, 2018, at 9:54 pm, Dave Taht <dave.taht@gmail.com> wrote:
> 
> The core requirement is somehow emulating the single transmitter at a
> time behavior of wireless technologies. In this way of thinking, an
> interfere-er is just another transmitter in emulation.

Does it help if you redirect all that traffic through a single ifb device, and thus a single qdisc?

 - Jonathan Morton


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Make-wifi-fast] emulating wifi better - coupling qdiscs in netem?
  2018-05-30 18:54   ` Dave Taht
  2018-05-30 18:58     ` Jonathan Morton
@ 2018-05-30 19:19     ` Bob McMahon
  2018-05-30 23:26       ` Dave Taht
  1 sibling, 1 reply; 23+ messages in thread
From: Bob McMahon @ 2018-05-30 19:19 UTC (permalink / raw)
  To: Dave Taht; +Cc: Make-Wifi-fast

[-- Attachment #1: Type: text/plain, Size: 5247 bytes --]

I'm still confused.  Not exactly sure what "wifi single transmitter at time
behavior" means.  Is this phy level testing such as energy detect and NAV
<http://www.revolutionwifi.net/revolutionwifi/2011/03/understanding-wi-fi-carrier-sense.html>?
Are you trying to consume time "slots" from a transmitter perspective and
seeing how that peer device responds w/respect to its transmit scheduling?

To move the noise floor on either a peer RX or TX we just send tones for a
duration on a frequency band at a power level without worrying about any
slotting (but that's a special tool in our chip not released.)

Bob


On Wed, May 30, 2018 at 11:54 AM, Dave Taht <dave.taht@gmail.com> wrote:

> On Wed, May 30, 2018 at 11:32 AM, Bob McMahon <bob.mcmahon@broadcom.com>
> wrote:
> > Sorry, I may be coming late to this.  What exactly is the goal?  Instead
> of
> > emulating interference with netem is it possible to create real
> > interference?
>
> Interference to me is a secondary, but important part of the problem.
>
> The core requirement is somehow emulating the single transmitter at a
> time behavior of wireless technologies. In this way of thinking, an
> interfere-er is just another transmitter in emulation.
>
> Linux's behaviors are all full duplex, except at the very lowest
> driver levels. Being able to move the concept of a
> "single bulk transmitter at a time" much higher in stack (at least,
> for netem emulation), is what I'd like to do. Being better able to
> reliable look at the behaviors of e2e protocols with a decently
> correct wireless emulation...
>
> Does that help? Just getting to where I could describe the problem(s)
> well enough to talk about 'em
> in the mailing list has taken me forever, and if I/we can get to where
> we can describe the problem
> better, maybe solutions will materialize. ;)
>
> Did anyone but me ever play with the slotting models I put into netem last
> year?
>
>
> >
> > Bob
> >
> > On Wed, May 30, 2018 at 10:28 AM, Dave Taht <dave.taht@gmail.com> wrote:
> >>
> >> The match to reality of my "wifi slotting" code for netem was so
> >> disappointing that I was extremely reluctant to push support for it up
> >> to mainline iproute2.
> >>
> >> I've now spent months failing to come up with something that
> >> could emulate in linux the non-duplex behavior and arbitration steps
> >> that wifi goes through in order to find a new station to transmit to,
> >> or receive from, using netem as a base.
> >>
> >> Getting that non-duplex behavior right is the *single most important
> >> thing*, I think,  for trying to emulate real wireless behaviors in
> >> real time that I can think of (and to thus be able to run and improve
> >> various e2e transports against it).
> >>
> >> A potential tc API seems simple:
> >>
> >> tc qdisc add dev veth1 root netem coupled # master (AP)
> >> tc qdisc add dev veth2 root netem couple veth1 # client
> >> tc qdisc add dev veth3 root netme couple veth2 # client
> >>
> >> Something more complicated would be to create some sort of
> >> arbitration device and attach that to the qdiscs. (which would make
> >> it more possible to write arbitration devices to emulate lte, gpon,
> >> cable, wireless mesh and other non-duplex behaviors in real time)
> >>
> >> But how to convince qdiscs to be arbitrated, only allowing one in a
> >> set to transmit at the same time? (and worse, in the long run,
> >> allowing MU-MIMO-like behaviors).
> >>
> >> I'm tempted to *not* put my failed thinking down here in the hope that
> >> someone says, out there, "oh, that's easy, just create this structure
> >> with X API call and use Y function and you're clear of all the
> >> potential deadlock and RCU issues, and we've been doing that for
> >> years, you idiot! Here's the code for how we do it, sorry we didn't
> >> submit it earlier."
> >>
> >> What I thought (*and still think*) is of creating a superset of the
> >> qdisc_watchdog_schedule_ns() function is a start at it:
> >>
> >> tag = qdisc_watchdog_create_arb("some identifier");
> >> qdisc_watchdog_schedule_arb(nsec, tag); /* null tag = schedule_ns */
> >>
> >> which doesn't allow that qdisc instance to be run until the arbitrator
> >> says it can run (essentially overriding the timeout specified)
> >>
> >> But I actually wouldn't mind something that worked at the veth, or
> >> device, rather than qdisc level...
> >>
> >> thoughts?
> >>
> >> PS I just spent several days working on another aspect of the problem,
> >> which is replaying delay distributions (caused by interference and
> >> such)... and that, sigh, to me, also belongs in some sort of
> >> arbitration device rather than directly in netem. Maybe tossing netem
> >> entirely is the answer. I don't know.
> >>
> >> --
> >>
> >> Dave Täht
> >> CEO, TekLibre, LLC
> >> http://www.teklibre.com
> >> Tel: 1-669-226-2619
> >> _______________________________________________
> >> Make-wifi-fast mailing list
> >> Make-wifi-fast@lists.bufferbloat.net
> >> https://lists.bufferbloat.net/listinfo/make-wifi-fast
> >
> >
>
>
>
> --
>
> Dave Täht
> CEO, TekLibre, LLC
> http://www.teklibre.com
> Tel: 1-669-226-2619
>

[-- Attachment #2: Type: text/html, Size: 6855 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Make-wifi-fast] emulating wifi better - coupling qdiscs in netem?
  2018-05-30 17:28 [Make-wifi-fast] emulating wifi better - coupling qdiscs in netem? Dave Taht
  2018-05-30 18:32 ` Bob McMahon
@ 2018-05-30 22:57 ` dpreed
  2018-06-15 22:30   ` Dave Taht
  1 sibling, 1 reply; 23+ messages in thread
From: dpreed @ 2018-05-30 22:57 UTC (permalink / raw)
  To: Dave Taht; +Cc: make-wifi-fast

[-- Attachment #1: Type: text/plain, Size: 4245 bytes --]


I would toss netem rather than kludging around what appears to be a fundamental design choice made in ins conceptualization. Make a "netem2".
 
FreeBSD has a very nice framework for emulating far more general packet queuing/routing/... in the kernel, called NetGraph. It's incredibly general, and could straightforwardly, with high performance, have modules that do exactly the right emulations of network structures with such blocking, etc. and even random delays.
 
I know this because in my day job at TidalScale, we heavily use NetGraph to implement new very low level protocols, which is pretty straightforward, even including complex multi-adapter adaptive forwarding of our private protocols on 10 and 40 GigE links. Super flexible, entirely in the kernel, running either at real-time priority or not, in a mix.
 
In contrast, the Linux TC framework seems very inflexible, as you've found, in trying to push it to do what it is not designed to do.
 
So tossing netem might be far better. I wonder if NetGraph has ever been ported into some Linux kernel environment...
-----Original Message-----
From: "Dave Taht" <dave.taht@gmail.com>
Sent: Wednesday, May 30, 2018 1:28pm
To: make-wifi-fast@lists.bufferbloat.net
Subject: [Make-wifi-fast] emulating wifi better - coupling qdiscs in netem?



The match to reality of my "wifi slotting" code for netem was so
disappointing that I was extremely reluctant to push support for it up
to mainline iproute2.

I've now spent months failing to come up with something that
could emulate in linux the non-duplex behavior and arbitration steps
that wifi goes through in order to find a new station to transmit to,
or receive from, using netem as a base.

Getting that non-duplex behavior right is the *single most important
thing*, I think, for trying to emulate real wireless behaviors in
real time that I can think of (and to thus be able to run and improve
various e2e transports against it).

A potential tc API seems simple:

tc qdisc add dev veth1 root netem coupled # master (AP)
tc qdisc add dev veth2 root netem couple veth1 # client
tc qdisc add dev veth3 root netme couple veth2 # client

Something more complicated would be to create some sort of
arbitration device and attach that to the qdiscs. (which would make
it more possible to write arbitration devices to emulate lte, gpon,
cable, wireless mesh and other non-duplex behaviors in real time)

But how to convince qdiscs to be arbitrated, only allowing one in a
set to transmit at the same time? (and worse, in the long run,
allowing MU-MIMO-like behaviors).

I'm tempted to *not* put my failed thinking down here in the hope that
someone says, out there, "oh, that's easy, just create this structure
with X API call and use Y function and you're clear of all the
potential deadlock and RCU issues, and we've been doing that for
years, you idiot! Here's the code for how we do it, sorry we didn't
submit it earlier."

What I thought (*and still think*) is of creating a superset of the
qdisc_watchdog_schedule_ns() function is a start at it:

tag = qdisc_watchdog_create_arb("some identifier");
qdisc_watchdog_schedule_arb(nsec, tag); /* null tag = schedule_ns */

which doesn't allow that qdisc instance to be run until the arbitrator
says it can run (essentially overriding the timeout specified)

But I actually wouldn't mind something that worked at the veth, or
device, rather than qdisc level...

thoughts?

PS I just spent several days working on another aspect of the problem,
which is replaying delay distributions (caused by interference and
such)... and that, sigh, to me, also belongs in some sort of
arbitration device rather than directly in netem. Maybe tossing netem
entirely is the answer. I don't know.

-- 

Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619
_______________________________________________
Make-wifi-fast mailing list
Make-wifi-fast@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/make-wifi-fast
-- 
Reed Online Ltd is a company registered in England and Wales. Company 
Registration Number: 6317279.Registered Office: Academy Court, 94 Chancery 
Lane, London WC2A 1DT.


[-- Attachment #2: Type: text/html, Size: 5690 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Make-wifi-fast] emulating wifi better - coupling qdiscs in netem?
  2018-05-30 19:19     ` Bob McMahon
@ 2018-05-30 23:26       ` Dave Taht
  0 siblings, 0 replies; 23+ messages in thread
From: Dave Taht @ 2018-05-30 23:26 UTC (permalink / raw)
  To: Bob McMahon; +Cc: Make-Wifi-fast

On Wed, May 30, 2018 at 12:19 PM, Bob McMahon <bob.mcmahon@broadcom.com> wrote:
> I'm still confused.  Not exactly sure what "wifi single transmitter at time
> behavior" means.  Is this phy level testing such as energy detect and NAV?
> Are you trying to consume time "slots" from a transmitter perspective and
> seeing how that peer device responds w/respect to its transmit scheduling?
>
> To move the noise floor on either a peer RX or TX we just send tones for a
> duration on a frequency band at a power level without worrying about any
> slotting (but that's a special tool in our chip not released.)

That's cool.
But *way* lower level than what I'm trying for.

take a faked wifi topology like this

AP <-> A
     <-> B
     <-> C
     <-> D

There are 9 possibilities as to who will transmit where at any given
time, with a number of packets and bytes governed by the current rate
achieved between the AP and the client (+ multicast). + possible
interferers.

All the top level queues in linux assume bidirectionality, where here,
we don't have that, and we can have
arbitrary delays of about 20ms for a burst of traffic (on average,
assuming perfect random distribution for access to the medium - and
averages lie in this case, and way more stations can be competing than
this). And the bursts can range in size from 1 64 byte packet to a few
hundred (802.11ac). So the goal with the netem slotting model was
to at least accumulate bursts, which it does, and to scale to the
randomness and delays common in wifi to various stations (which it
doesn't).




> Bob
>
>
> On Wed, May 30, 2018 at 11:54 AM, Dave Taht <dave.taht@gmail.com> wrote:
>>
>> On Wed, May 30, 2018 at 11:32 AM, Bob McMahon <bob.mcmahon@broadcom.com>
>> wrote:
>> > Sorry, I may be coming late to this.  What exactly is the goal?  Instead
>> > of
>> > emulating interference with netem is it possible to create real
>> > interference?
>>
>> Interference to me is a secondary, but important part of the problem.
>>
>> The core requirement is somehow emulating the single transmitter at a
>> time behavior of wireless technologies. In this way of thinking, an
>> interfere-er is just another transmitter in emulation.
>>
>> Linux's behaviors are all full duplex, except at the very lowest
>> driver levels. Being able to move the concept of a
>> "single bulk transmitter at a time" much higher in stack (at least,
>> for netem emulation), is what I'd like to do. Being better able to
>> reliable look at the behaviors of e2e protocols with a decently
>> correct wireless emulation...
>>
>> Does that help? Just getting to where I could describe the problem(s)
>> well enough to talk about 'em
>> in the mailing list has taken me forever, and if I/we can get to where
>> we can describe the problem
>> better, maybe solutions will materialize. ;)
>>
>> Did anyone but me ever play with the slotting models I put into netem last
>> year?
>>
>>
>> >
>> > Bob
>> >
>> > On Wed, May 30, 2018 at 10:28 AM, Dave Taht <dave.taht@gmail.com> wrote:
>> >>
>> >> The match to reality of my "wifi slotting" code for netem was so
>> >> disappointing that I was extremely reluctant to push support for it up
>> >> to mainline iproute2.
>> >>
>> >> I've now spent months failing to come up with something that
>> >> could emulate in linux the non-duplex behavior and arbitration steps
>> >> that wifi goes through in order to find a new station to transmit to,
>> >> or receive from, using netem as a base.
>> >>
>> >> Getting that non-duplex behavior right is the *single most important
>> >> thing*, I think,  for trying to emulate real wireless behaviors in
>> >> real time that I can think of (and to thus be able to run and improve
>> >> various e2e transports against it).
>> >>
>> >> A potential tc API seems simple:
>> >>
>> >> tc qdisc add dev veth1 root netem coupled # master (AP)
>> >> tc qdisc add dev veth2 root netem couple veth1 # client
>> >> tc qdisc add dev veth3 root netme couple veth2 # client
>> >>
>> >> Something more complicated would be to create some sort of
>> >> arbitration device and attach that to the qdiscs. (which would make
>> >> it more possible to write arbitration devices to emulate lte, gpon,
>> >> cable, wireless mesh and other non-duplex behaviors in real time)
>> >>
>> >> But how to convince qdiscs to be arbitrated, only allowing one in a
>> >> set to transmit at the same time? (and worse, in the long run,
>> >> allowing MU-MIMO-like behaviors).
>> >>
>> >> I'm tempted to *not* put my failed thinking down here in the hope that
>> >> someone says, out there, "oh, that's easy, just create this structure
>> >> with X API call and use Y function and you're clear of all the
>> >> potential deadlock and RCU issues, and we've been doing that for
>> >> years, you idiot! Here's the code for how we do it, sorry we didn't
>> >> submit it earlier."
>> >>
>> >> What I thought (*and still think*) is of creating a superset of the
>> >> qdisc_watchdog_schedule_ns() function is a start at it:
>> >>
>> >> tag = qdisc_watchdog_create_arb("some identifier");
>> >> qdisc_watchdog_schedule_arb(nsec, tag); /* null tag = schedule_ns */
>> >>
>> >> which doesn't allow that qdisc instance to be run until the arbitrator
>> >> says it can run (essentially overriding the timeout specified)
>> >>
>> >> But I actually wouldn't mind something that worked at the veth, or
>> >> device, rather than qdisc level...
>> >>
>> >> thoughts?
>> >>
>> >> PS I just spent several days working on another aspect of the problem,
>> >> which is replaying delay distributions (caused by interference and
>> >> such)... and that, sigh, to me, also belongs in some sort of
>> >> arbitration device rather than directly in netem. Maybe tossing netem
>> >> entirely is the answer. I don't know.
>> >>
>> >> --
>> >>
>> >> Dave Täht
>> >> CEO, TekLibre, LLC
>> >> http://www.teklibre.com
>> >> Tel: 1-669-226-2619
>> >> _______________________________________________
>> >> Make-wifi-fast mailing list
>> >> Make-wifi-fast@lists.bufferbloat.net
>> >> https://lists.bufferbloat.net/listinfo/make-wifi-fast
>> >
>> >
>>
>>
>>
>> --
>>
>> Dave Täht
>> CEO, TekLibre, LLC
>> http://www.teklibre.com
>> Tel: 1-669-226-2619
>
>



-- 

Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Make-wifi-fast] emulating wifi better - coupling qdiscs in netem?
  2018-05-30 22:57 ` dpreed
@ 2018-06-15 22:30   ` Dave Taht
  2018-06-16 22:53     ` Pete Heist
  0 siblings, 1 reply; 23+ messages in thread
From: Dave Taht @ 2018-06-15 22:30 UTC (permalink / raw)
  To: dpreed; +Cc: Make-Wifi-fast

I think tossing netem entirely, ditching the slot models I added to it
last year, and going to userspace to better emulate wifi, is the
answer. Eric just suggested using the iptables NFQUEUE ability to toss
packets to userspace.

https://home.regit.org/netfilter-en/using-nfqueue-and-libnetfilter_queue/

nfqueue has batching support built in, so an arbitrary number of
packets can be released as determined by userspace.

# Crappy incorrect pseudocode for setup

stations=2

# or match on the multicast mac address

iptables -A INPUT -i veth-ap0 -d 224.0.0.0/8 --j NFQUEUE --queue-num 0

for i in `seq 1 $stations`
do
iptables -A INPUT -i veth-ap0 -d 10.0.0.$i--j NFQUEUE --queue-num $i
done

for i in `seq $stations+1 $stations*2`
do
iptables -A OUTPUT -o veth-ap0 -s 10.0.0.$i --j NFQUEUE --queue-num $i
- ($stations + 1 )
done

The wifi-emulating daemon then listens on these queues and decides
when to deliver each, and how many packets in a batch.

For wifi, at least, timings are not hugely critical, a few hundred
usec is something userspace can handle reasonably accurately. I like
very much being able to separate out mcast and treat that correctly in
userspace, also. I did want to be below 10usec (wifi "bus"
arbitration), which I am dubious about....

Maybe something "out there" already does this? ns3 comes close... I've
burned the last 4 months of my life trying to do this in-kernel...

Now as for an implementation language? C++ C? Go? Python? The
condition of the wrapper library for go leaves a bit to be desired
( https://github.com/chifflier/nfqueue-go ) and given a choice I'd
MUCH rather use a go than a C.

There is of course a hideous amount of complexity moved to the daemon,
as a pure fifo ap queue forms aggregregates much differently
than a fq_codeled one. But, yea! userspace....


On Wed, May 30, 2018 at 3:57 PM, dpreed@deepplum.com
<dpreed@deepplum.com> wrote:
> I would toss netem rather than kludging around what appears to be a
> fundamental design choice made in ins conceptualization. Make a "netem2".
>
>
>
> FreeBSD has a very nice framework for emulating far more general packet
> queuing/routing/... in the kernel, called NetGraph. It's incredibly general,
> and could straightforwardly, with high performance, have modules that do
> exactly the right emulations of network structures with such blocking, etc.
> and even random delays.
>
>
>
> I know this because in my day job at TidalScale, we heavily use NetGraph to
> implement new very low level protocols, which is pretty straightforward,
> even including complex multi-adapter adaptive forwarding of our private
> protocols on 10 and 40 GigE links. Super flexible, entirely in the kernel,
> running either at real-time priority or not, in a mix.
>
>
>
> In contrast, the Linux TC framework seems very inflexible, as you've found,
> in trying to push it to do what it is not designed to do.
>
>
>
> So tossing netem might be far better. I wonder if NetGraph has ever been
> ported into some Linux kernel environment...
>
> -----Original Message-----
> From: "Dave Taht" <dave.taht@gmail.com>
> Sent: Wednesday, May 30, 2018 1:28pm
> To: make-wifi-fast@lists.bufferbloat.net
> Subject: [Make-wifi-fast] emulating wifi better - coupling qdiscs in netem?
>
> The match to reality of my "wifi slotting" code for netem was so
> disappointing that I was extremely reluctant to push support for it up
> to mainline iproute2.
>
> I've now spent months failing to come up with something that
> could emulate in linux the non-duplex behavior and arbitration steps
> that wifi goes through in order to find a new station to transmit to,
> or receive from, using netem as a base.
>
> Getting that non-duplex behavior right is the *single most important
> thing*, I think, for trying to emulate real wireless behaviors in
> real time that I can think of (and to thus be able to run and improve
> various e2e transports against it).
>
> A potential tc API seems simple:
>
> tc qdisc add dev veth1 root netem coupled # master (AP)
> tc qdisc add dev veth2 root netem couple veth1 # client
> tc qdisc add dev veth3 root netme couple veth2 # client
>
> Something more complicated would be to create some sort of
> arbitration device and attach that to the qdiscs. (which would make
> it more possible to write arbitration devices to emulate lte, gpon,
> cable, wireless mesh and other non-duplex behaviors in real time)
>
> But how to convince qdiscs to be arbitrated, only allowing one in a
> set to transmit at the same time? (and worse, in the long run,
> allowing MU-MIMO-like behaviors).
>
> I'm tempted to *not* put my failed thinking down here in the hope that
> someone says, out there, "oh, that's easy, just create this structure
> with X API call and use Y function and you're clear of all the
> potential deadlock and RCU issues, and we've been doing that for
> years, you idiot! Here's the code for how we do it, sorry we didn't
> submit it earlier."
>
> What I thought (*and still think*) is of creating a superset of the
> qdisc_watchdog_schedule_ns() function is a start at it:
>
> tag = qdisc_watchdog_create_arb("some identifier");
> qdisc_watchdog_schedule_arb(nsec, tag); /* null tag = schedule_ns */
>
> which doesn't allow that qdisc instance to be run until the arbitrator
> says it can run (essentially overriding the timeout specified)
>
> But I actually wouldn't mind something that worked at the veth, or
> device, rather than qdisc level...
>
> thoughts?
>
> PS I just spent several days working on another aspect of the problem,
> which is replaying delay distributions (caused by interference and
> such)... and that, sigh, to me, also belongs in some sort of
> arbitration device rather than directly in netem. Maybe tossing netem
> entirely is the answer. I don't know.
>
> --
>
> Dave Täht
> CEO, TekLibre, LLC
> http://www.teklibre.com
> Tel: 1-669-226-2619
> _______________________________________________
> Make-wifi-fast mailing list
> Make-wifi-fast@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/make-wifi-fast
> --
> Reed Online Ltd is a company registered in England and Wales. Company
> Registration Number: 6317279.Registered Office: Academy Court, 94 Chancery
> Lane, London WC2A 1DT.
>



-- 

Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Make-wifi-fast] emulating wifi better - coupling qdiscs in netem?
  2018-06-15 22:30   ` Dave Taht
@ 2018-06-16 22:53     ` Pete Heist
  2018-06-17 11:19       ` Jesper Dangaard Brouer
  2018-06-18  0:59       ` Eric Dumazet
  0 siblings, 2 replies; 23+ messages in thread
From: Pete Heist @ 2018-06-16 22:53 UTC (permalink / raw)
  To: Dave Taht; +Cc: Make-Wifi-fast

[-- Attachment #1: Type: text/plain, Size: 7043 bytes --]


> On Jun 16, 2018, at 12:30 AM, Dave Taht <dave.taht@gmail.com> wrote:
> 
> Eric just suggested using the iptables NFQUEUE ability to toss
> packets to userspace.
> 
> https://home.regit.org/netfilter-en/using-nfqueue-and-libnetfilter_queue/ <https://home.regit.org/netfilter-en/using-nfqueue-and-libnetfilter_queue/>
> For wifi, at least, timings are not hugely critical, a few hundred
> usec is something userspace can handle reasonably accurately. I like
> very much being able to separate out mcast and treat that correctly in
> userspace, also. I did want to be below 10usec (wifi "bus"
> arbitration), which I am dubious about....
> 
> Now as for an implementation language? C++ C? Go? Python? The
> condition of the wrapper library for go leaves a bit to be desired
> ( https://github.com/chifflier/nfqueue-go <https://github.com/chifflier/nfqueue-go> ) and given a choice I'd
> MUCH rather use a go than a C.

This sounds cool... So for fun, I compared ping and iperf3 with no-op nfqueue callbacks in both C and Go. As for the hardware setup, I used two lxc containers (effectively just veth) on an APU2.

For the Go program, I used test_nfqueue from the wrapper above (which yes, does need some work) and removed debugging / logging.

For the C program I used this:
https://github.com/irontec/netfilter-nfqueue-samples/blob/master/sample-helloworld.c
I removed any per-packet printf calls and compiled with "gcc sample-helloworld.c -o nfq -lnfnetlink -lnetfilter_queue”.

Ping results:

ping without nfqueue:
root@lsrv:~# iptables -F OUTPUT
root@lsrv:~# ping -c 500 -i 0.01 -q 10.182.122.11
500 packets transmitted, 500 received, 0% packet loss, time 7985ms
rtt min/avg/max/mdev = 0.056/0.058/0.185/0.011 ms

ping with no-op nfqueue callback in C:
root@lsrv:~# iptables -A OUTPUT -d 10.182.122.11/32 -j NFQUEUE --queue-num 0
root@lsrv:~/nfqueue# ping -c 500 -i 0.01 -q 10.182.122.11
500 packets transmitted, 500 received, 0% packet loss, time 7981ms
rtt min/avg/max/mdev = 0.117/0.123/0.384/0.020 ms

ping with no-op nfqueue callback in Go:
root@lsrv:~# iptables -A OUTPUT -d 10.182.122.11/32 -j NFQUEUE --queue-num 0
root@lsrv:~# ping -c 500 -i 0.01 -q 10.182.122.11
500 packets transmitted, 500 received, 0% packet loss, time 7982ms
rtt min/avg/max/mdev = 0.095/0.172/0.532/0.042 ms

The mean induced latency of 65us for C or 114us for Go might be within your parameters, except you mentioned 10us for WiFi bus arbitration, which does indeed look impossible with this setup, even in C.

Iperf3 results:

iperf3 without nfqueue:
root@lsrv:~# iptables -F OUTPUT
root@lsrv:~# iperf3 -t 5 -c 10.182.122.11
Connecting to host 10.182.122.11, port 5201
[  4] local 10.182.122.1 port 55810 connected to 10.182.122.11 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec   452 MBytes  3.79 Gbits/sec    0    178 KBytes       
[  4]   1.00-2.00   sec   454 MBytes  3.82 Gbits/sec    0    320 KBytes       
[  4]   2.00-3.00   sec   450 MBytes  3.77 Gbits/sec    0    320 KBytes       
[  4]   3.00-4.00   sec   451 MBytes  3.79 Gbits/sec    0    352 KBytes       
[  4]   4.00-5.00   sec   451 MBytes  3.79 Gbits/sec    0    352 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-5.00   sec  2.21 GBytes  3.79 Gbits/sec    0             sender
[  4]   0.00-5.00   sec  2.21 GBytes  3.79 Gbits/sec                  receiver
iperf Done.

iperf3 with no-op nfqueue callback in C:
root@lsrv:~# iptables -A OUTPUT -d 10.182.122.11/32 -j NFQUEUE --queue-num 0
root@lsrv:~/nfqueue# iperf3 -t 5 -c 10.182.122.11
Connecting to host 10.182.122.11, port 5201
[  4] local 10.182.122.1 port 55868 connected to 10.182.122.11 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec  17.4 MBytes   146 Mbits/sec    0    107 KBytes       
[  4]   1.00-2.00   sec  16.9 MBytes   142 Mbits/sec    0    107 KBytes       
[  4]   2.00-3.00   sec  17.0 MBytes   142 Mbits/sec    0    107 KBytes       
[  4]   3.00-4.00   sec  17.0 MBytes   142 Mbits/sec    0    107 KBytes       
[  4]   4.00-5.00   sec  17.0 MBytes   143 Mbits/sec    0    115 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-5.00   sec  85.3 MBytes   143 Mbits/sec    0             sender
[  4]   0.00-5.00   sec  84.7 MBytes   142 Mbits/sec                  receiver

iperf3 with no-op nfqueue callback in Go:
root@lsrv:~# iptables -A OUTPUT -d 10.182.122.11/32 -j NFQUEUE --queue-num 0
root@lsrv:~# iperf3 -t 5 -c 10.182.122.11
Connecting to host 10.182.122.11, port 5201
[  4] local 10.182.122.1 port 55864 connected to 10.182.122.11 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec  14.6 MBytes   122 Mbits/sec    0   96.2 KBytes       
[  4]   1.00-2.00   sec  14.1 MBytes   118 Mbits/sec    0   96.2 KBytes       
[  4]   2.00-3.00   sec  14.0 MBytes   118 Mbits/sec    0    102 KBytes       
[  4]   3.00-4.00   sec  14.0 MBytes   117 Mbits/sec    0    102 KBytes       
[  4]   4.00-5.00   sec  13.7 MBytes   115 Mbits/sec    0    107 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-5.00   sec  70.5 MBytes   118 Mbits/sec    0             sender
[  4]   0.00-5.00   sec  69.9 MBytes   117 Mbits/sec                  receiver
iperf Done.

So rats, throughput gets brutalized for both C and Go. For Go, a rate of 117 Mbit with a 1500 byte MTU is 9750 packets/sec, which is 103us / packet. Mean induced latency measured by ping is 114us, which is not far off 103us, so the rate slowdown looks to be mostly caused by the per-packet nfqueue calls. The core running test_nfqueue is pinned at 100% during the test. "nice -n -20" does nothing.

Presumably you’ll sometimes be releasing more than one packet at a time(?) so I guess whether or not this is workable depends on how many you release at once, what hardware you’re on and what rates you need to test at. But when you’re trying to test a qdisc, I guess you’d want to minimize the burden you add to the CPU, or else move it to a core the qdisc isn’t running on, or something, so the qdisc itself isn’t affected by the test rig.

> There is of course a hideous amount of complexity moved to the daemon,

I can only imagine.

> as a pure fifo ap queue forms aggregregates much differently
> than a fq_codeled one. But, yea! userspace....

This would be awesome if it works out! After that iperf3 test though, I think I may have smashed my dreams of writing a libnetfilter_queue userspace qdisc in Go, or C for that matter.

If this does somehow turn out to be good enough performance-wise, I think you’d have a lot more fun and spend a lot less time on it in Go than C, but that’s just an opinion... :)


[-- Attachment #2: Type: text/html, Size: 21499 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Make-wifi-fast] emulating wifi better - coupling qdiscs in netem?
  2018-06-16 22:53     ` Pete Heist
@ 2018-06-17 11:19       ` Jesper Dangaard Brouer
  2018-06-17 15:16         ` Pete Heist
  2018-06-18  0:59       ` Eric Dumazet
  1 sibling, 1 reply; 23+ messages in thread
From: Jesper Dangaard Brouer @ 2018-06-17 11:19 UTC (permalink / raw)
  To: Pete Heist
  Cc: Dave Taht, Make-Wifi-fast, brouer, Florian Westphal, Marek Majkowski


Hi Pete,

Happened to be at the Netfilter Workshop, and discussed nfqueue with
Florian and Marek, and I saw this attempt to use nfqueue, and Florian
points out that you are not using the GRO facility of nfqueue.

I'll quote what Florian said below:

On Sun, 17 Jun 2018 12:45:52 +0200 Florian Westphal <fw@strlen.de> wrote:
 
> The linked example code is old and does not set
> 	mnl_attr_put_u32(nlh, NFQA_CFG_FLAGS, htonl(NFQA_CFG_F_GSO));
> 
> When requesting the queue.
> 
> This means kernel has to do software segmentation of GSO skbs.
> 
> Consider using
> https://git.netfilter.org/libnetfilter_queue/tree/examples/nf-queue.c
> 
> instead if you need a template, it does this correctly.

--Jesper


On Sun, 17 Jun 2018 00:53:03 +0200 Pete Heist <pete@heistp.net> wrote:

> > On Jun 16, 2018, at 12:30 AM, Dave Taht <dave.taht@gmail.com> wrote:
> > 
> > Eric just suggested using the iptables NFQUEUE ability to toss
> > packets to userspace.
> > 
> > https://home.regit.org/netfilter-en/using-nfqueue-and-libnetfilter_queue/ <https://home.regit.org/netfilter-en/using-nfqueue-and-libnetfilter_queue/>
> > For wifi, at least, timings are not hugely critical, a few hundred
> > usec is something userspace can handle reasonably accurately. I like
> > very much being able to separate out mcast and treat that correctly in
> > userspace, also. I did want to be below 10usec (wifi "bus"
> > arbitration), which I am dubious about....
> > 
> > Now as for an implementation language? C++ C? Go? Python? The
> > condition of the wrapper library for go leaves a bit to be desired
> > ( https://github.com/chifflier/nfqueue-go <https://github.com/chifflier/nfqueue-go> ) and given a choice I'd
> > MUCH rather use a go than a C.  
> 
> This sounds cool... So for fun, I compared ping and iperf3 with no-op nfqueue callbacks in both C and Go. As for the hardware setup, I used two lxc containers (effectively just veth) on an APU2.
> 
> For the Go program, I used test_nfqueue from the wrapper above (which yes, does need some work) and removed debugging / logging.
> 
> For the C program I used this:
> https://github.com/irontec/netfilter-nfqueue-samples/blob/master/sample-helloworld.c
> I removed any per-packet printf calls and compiled with "gcc sample-helloworld.c -o nfq -lnfnetlink -lnetfilter_queue”.
> 
> Ping results:
> 
> ping without nfqueue:
> root@lsrv:~# iptables -F OUTPUT
> root@lsrv:~# ping -c 500 -i 0.01 -q 10.182.122.11
> 500 packets transmitted, 500 received, 0% packet loss, time 7985ms
> rtt min/avg/max/mdev = 0.056/0.058/0.185/0.011 ms
> 
> ping with no-op nfqueue callback in C:
> root@lsrv:~# iptables -A OUTPUT -d 10.182.122.11/32 -j NFQUEUE --queue-num 0
> root@lsrv:~/nfqueue# ping -c 500 -i 0.01 -q 10.182.122.11
> 500 packets transmitted, 500 received, 0% packet loss, time 7981ms
> rtt min/avg/max/mdev = 0.117/0.123/0.384/0.020 ms
> 
> ping with no-op nfqueue callback in Go:
> root@lsrv:~# iptables -A OUTPUT -d 10.182.122.11/32 -j NFQUEUE --queue-num 0
> root@lsrv:~# ping -c 500 -i 0.01 -q 10.182.122.11
> 500 packets transmitted, 500 received, 0% packet loss, time 7982ms
> rtt min/avg/max/mdev = 0.095/0.172/0.532/0.042 ms
> 
> The mean induced latency of 65us for C or 114us for Go might be within your parameters, except you mentioned 10us for WiFi bus arbitration, which does indeed look impossible with this setup, even in C.
> 
> Iperf3 results:
> 
> iperf3 without nfqueue:
> root@lsrv:~# iptables -F OUTPUT
> root@lsrv:~# iperf3 -t 5 -c 10.182.122.11
> Connecting to host 10.182.122.11, port 5201
> [  4] local 10.182.122.1 port 55810 connected to 10.182.122.11 port 5201
> [ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
> [  4]   0.00-1.00   sec   452 MBytes  3.79 Gbits/sec    0    178 KBytes       
> [  4]   1.00-2.00   sec   454 MBytes  3.82 Gbits/sec    0    320 KBytes       
> [  4]   2.00-3.00   sec   450 MBytes  3.77 Gbits/sec    0    320 KBytes       
> [  4]   3.00-4.00   sec   451 MBytes  3.79 Gbits/sec    0    352 KBytes       
> [  4]   4.00-5.00   sec   451 MBytes  3.79 Gbits/sec    0    352 KBytes       
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bandwidth       Retr
> [  4]   0.00-5.00   sec  2.21 GBytes  3.79 Gbits/sec    0             sender
> [  4]   0.00-5.00   sec  2.21 GBytes  3.79 Gbits/sec                  receiver
> iperf Done.
> 
> iperf3 with no-op nfqueue callback in C:
> root@lsrv:~# iptables -A OUTPUT -d 10.182.122.11/32 -j NFQUEUE --queue-num 0
> root@lsrv:~/nfqueue# iperf3 -t 5 -c 10.182.122.11
> Connecting to host 10.182.122.11, port 5201
> [  4] local 10.182.122.1 port 55868 connected to 10.182.122.11 port 5201
> [ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
> [  4]   0.00-1.00   sec  17.4 MBytes   146 Mbits/sec    0    107 KBytes       
> [  4]   1.00-2.00   sec  16.9 MBytes   142 Mbits/sec    0    107 KBytes       
> [  4]   2.00-3.00   sec  17.0 MBytes   142 Mbits/sec    0    107 KBytes       
> [  4]   3.00-4.00   sec  17.0 MBytes   142 Mbits/sec    0    107 KBytes       
> [  4]   4.00-5.00   sec  17.0 MBytes   143 Mbits/sec    0    115 KBytes       
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bandwidth       Retr
> [  4]   0.00-5.00   sec  85.3 MBytes   143 Mbits/sec    0             sender
> [  4]   0.00-5.00   sec  84.7 MBytes   142 Mbits/sec                  receiver
> 
> iperf3 with no-op nfqueue callback in Go:
> root@lsrv:~# iptables -A OUTPUT -d 10.182.122.11/32 -j NFQUEUE --queue-num 0
> root@lsrv:~# iperf3 -t 5 -c 10.182.122.11
> Connecting to host 10.182.122.11, port 5201
> [  4] local 10.182.122.1 port 55864 connected to 10.182.122.11 port 5201
> [ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
> [  4]   0.00-1.00   sec  14.6 MBytes   122 Mbits/sec    0   96.2 KBytes       
> [  4]   1.00-2.00   sec  14.1 MBytes   118 Mbits/sec    0   96.2 KBytes       
> [  4]   2.00-3.00   sec  14.0 MBytes   118 Mbits/sec    0    102 KBytes       
> [  4]   3.00-4.00   sec  14.0 MBytes   117 Mbits/sec    0    102 KBytes       
> [  4]   4.00-5.00   sec  13.7 MBytes   115 Mbits/sec    0    107 KBytes       
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bandwidth       Retr
> [  4]   0.00-5.00   sec  70.5 MBytes   118 Mbits/sec    0             sender
> [  4]   0.00-5.00   sec  69.9 MBytes   117 Mbits/sec                  receiver
> iperf Done.
> 
> So rats, throughput gets brutalized for both C and Go. For Go, a rate of 117 Mbit with a 1500 byte MTU is 9750 packets/sec, which is 103us / packet. Mean induced latency measured by ping is 114us, which is not far off 103us, so the rate slowdown looks to be mostly caused by the per-packet nfqueue calls. The core running test_nfqueue is pinned at 100% during the test. "nice -n -20" does nothing.
> 
> Presumably you’ll sometimes be releasing more than one packet at a time(?) so I guess whether or not this is workable depends on how many you release at once, what hardware you’re on and what rates you need to test at. But when you’re trying to test a qdisc, I guess you’d want to minimize the burden you add to the CPU, or else move it to a core the qdisc isn’t running on, or something, so the qdisc itself isn’t affected by the test rig.
> 
> > There is of course a hideous amount of complexity moved to the daemon,  
> 
> I can only imagine.
> 
> > as a pure fifo ap queue forms aggregregates much differently
> > than a fq_codeled one. But, yea! userspace....  
> 
> This would be awesome if it works out! After that iperf3 test though, I think I may have smashed my dreams of writing a libnetfilter_queue userspace qdisc in Go, or C for that matter.
> 
> If this does somehow turn out to be good enough performance-wise, I think you’d have a lot more fun and spend a lot less time on it in Go than C, but that’s just an opinion... :)
> 



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Make-wifi-fast] emulating wifi better - coupling qdiscs in netem?
  2018-06-17 11:19       ` Jesper Dangaard Brouer
@ 2018-06-17 15:16         ` Pete Heist
  2018-06-17 16:09           ` Dave Taht
  0 siblings, 1 reply; 23+ messages in thread
From: Pete Heist @ 2018-06-17 15:16 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Dave Taht, Florian Westphal, Marek Majkowski, Make-Wifi-fast

[-- Attachment #1: Type: text/plain, Size: 10147 bytes --]

Hi Jesper/Florian, thanks for noticing that, not surprisingly it doesn’t change the ping results much, but it improves throughput a lot (now only ~20% less than without nfqueue):

root@lsrv:~# iperf3 -t 5 -c 10.182.122.11
Connecting to host 10.182.122.11, port 5201
[  4] local 10.182.122.1 port 55936 connected to 10.182.122.11 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec   375 MBytes  3.14 Gbits/sec  173    372 KBytes       
[  4]   1.00-2.00   sec   365 MBytes  3.06 Gbits/sec  316    382 KBytes       
[  4]   2.00-3.00   sec   372 MBytes  3.13 Gbits/sec  368    427 KBytes       
[  4]   3.00-4.00   sec   364 MBytes  3.05 Gbits/sec  137    402 KBytes       
[  4]   4.00-5.00   sec   364 MBytes  3.05 Gbits/sec  342    382 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-5.00   sec  1.80 GBytes  3.09 Gbits/sec  1336             sender
[  4]   0.00-5.00   sec  1.79 GBytes  3.08 Gbits/sec                  receiver
iperf Done.

I don’t know if/how the use of GSO affects Dave’s simulation work, but I’ll leave that to him. I only wanted to contribute a quick evaluation. :)

Pete

> On Jun 17, 2018, at 1:19 PM, Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> 
> 
> Hi Pete,
> 
> Happened to be at the Netfilter Workshop, and discussed nfqueue with
> Florian and Marek, and I saw this attempt to use nfqueue, and Florian
> points out that you are not using the GRO facility of nfqueue.
> 
> I'll quote what Florian said below:
> 
> On Sun, 17 Jun 2018 12:45:52 +0200 Florian Westphal <fw@strlen.de <mailto:fw@strlen.de>> wrote:
> 
>> The linked example code is old and does not set
>> 	mnl_attr_put_u32(nlh, NFQA_CFG_FLAGS, htonl(NFQA_CFG_F_GSO));
>> 
>> When requesting the queue.
>> 
>> This means kernel has to do software segmentation of GSO skbs.
>> 
>> Consider using
>> https://git.netfilter.org/libnetfilter_queue/tree/examples/nf-queue.c <https://git.netfilter.org/libnetfilter_queue/tree/examples/nf-queue.c>
>> 
>> instead if you need a template, it does this correctly.
> 
> --Jesper
> 
> 
> On Sun, 17 Jun 2018 00:53:03 +0200 Pete Heist <pete@heistp.net <mailto:pete@heistp.net>> wrote:
> 
>>> On Jun 16, 2018, at 12:30 AM, Dave Taht <dave.taht@gmail.com <mailto:dave.taht@gmail.com>> wrote:
>>> 
>>> Eric just suggested using the iptables NFQUEUE ability to toss
>>> packets to userspace.
>>> 
>>> https://home.regit.org/netfilter-en/using-nfqueue-and-libnetfilter_queue/ <https://home.regit.org/netfilter-en/using-nfqueue-and-libnetfilter_queue/> <https://home.regit.org/netfilter-en/using-nfqueue-and-libnetfilter_queue/ <https://home.regit.org/netfilter-en/using-nfqueue-and-libnetfilter_queue/>>
>>> For wifi, at least, timings are not hugely critical, a few hundred
>>> usec is something userspace can handle reasonably accurately. I like
>>> very much being able to separate out mcast and treat that correctly in
>>> userspace, also. I did want to be below 10usec (wifi "bus"
>>> arbitration), which I am dubious about....
>>> 
>>> Now as for an implementation language? C++ C? Go? Python? The
>>> condition of the wrapper library for go leaves a bit to be desired
>>> ( https://github.com/chifflier/nfqueue-go <https://github.com/chifflier/nfqueue-go> <https://github.com/chifflier/nfqueue-go <https://github.com/chifflier/nfqueue-go>> ) and given a choice I'd
>>> MUCH rather use a go than a C.  
>> 
>> This sounds cool... So for fun, I compared ping and iperf3 with no-op nfqueue callbacks in both C and Go. As for the hardware setup, I used two lxc containers (effectively just veth) on an APU2.
>> 
>> For the Go program, I used test_nfqueue from the wrapper above (which yes, does need some work) and removed debugging / logging.
>> 
>> For the C program I used this:
>> https://github.com/irontec/netfilter-nfqueue-samples/blob/master/sample-helloworld.c
>> I removed any per-packet printf calls and compiled with "gcc sample-helloworld.c -o nfq -lnfnetlink -lnetfilter_queue”.
>> 
>> Ping results:
>> 
>> ping without nfqueue:
>> root@lsrv:~# iptables -F OUTPUT
>> root@lsrv:~# ping -c 500 -i 0.01 -q 10.182.122.11
>> 500 packets transmitted, 500 received, 0% packet loss, time 7985ms
>> rtt min/avg/max/mdev = 0.056/0.058/0.185/0.011 ms
>> 
>> ping with no-op nfqueue callback in C:
>> root@lsrv:~# iptables -A OUTPUT -d 10.182.122.11/32 -j NFQUEUE --queue-num 0
>> root@lsrv:~/nfqueue# ping -c 500 -i 0.01 -q 10.182.122.11
>> 500 packets transmitted, 500 received, 0% packet loss, time 7981ms
>> rtt min/avg/max/mdev = 0.117/0.123/0.384/0.020 ms
>> 
>> ping with no-op nfqueue callback in Go:
>> root@lsrv:~# iptables -A OUTPUT -d 10.182.122.11/32 -j NFQUEUE --queue-num 0
>> root@lsrv:~# ping -c 500 -i 0.01 -q 10.182.122.11
>> 500 packets transmitted, 500 received, 0% packet loss, time 7982ms
>> rtt min/avg/max/mdev = 0.095/0.172/0.532/0.042 ms
>> 
>> The mean induced latency of 65us for C or 114us for Go might be within your parameters, except you mentioned 10us for WiFi bus arbitration, which does indeed look impossible with this setup, even in C.
>> 
>> Iperf3 results:
>> 
>> iperf3 without nfqueue:
>> root@lsrv:~# iptables -F OUTPUT
>> root@lsrv:~# iperf3 -t 5 -c 10.182.122.11
>> Connecting to host 10.182.122.11, port 5201
>> [  4] local 10.182.122.1 port 55810 connected to 10.182.122.11 port 5201
>> [ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
>> [  4]   0.00-1.00   sec   452 MBytes  3.79 Gbits/sec    0    178 KBytes       
>> [  4]   1.00-2.00   sec   454 MBytes  3.82 Gbits/sec    0    320 KBytes       
>> [  4]   2.00-3.00   sec   450 MBytes  3.77 Gbits/sec    0    320 KBytes       
>> [  4]   3.00-4.00   sec   451 MBytes  3.79 Gbits/sec    0    352 KBytes       
>> [  4]   4.00-5.00   sec   451 MBytes  3.79 Gbits/sec    0    352 KBytes       
>> - - - - - - - - - - - - - - - - - - - - - - - - -
>> [ ID] Interval           Transfer     Bandwidth       Retr
>> [  4]   0.00-5.00   sec  2.21 GBytes  3.79 Gbits/sec    0             sender
>> [  4]   0.00-5.00   sec  2.21 GBytes  3.79 Gbits/sec                  receiver
>> iperf Done.
>> 
>> iperf3 with no-op nfqueue callback in C:
>> root@lsrv:~# iptables -A OUTPUT -d 10.182.122.11/32 -j NFQUEUE --queue-num 0
>> root@lsrv:~/nfqueue# iperf3 -t 5 -c 10.182.122.11
>> Connecting to host 10.182.122.11, port 5201
>> [  4] local 10.182.122.1 port 55868 connected to 10.182.122.11 port 5201
>> [ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
>> [  4]   0.00-1.00   sec  17.4 MBytes   146 Mbits/sec    0    107 KBytes       
>> [  4]   1.00-2.00   sec  16.9 MBytes   142 Mbits/sec    0    107 KBytes       
>> [  4]   2.00-3.00   sec  17.0 MBytes   142 Mbits/sec    0    107 KBytes       
>> [  4]   3.00-4.00   sec  17.0 MBytes   142 Mbits/sec    0    107 KBytes       
>> [  4]   4.00-5.00   sec  17.0 MBytes   143 Mbits/sec    0    115 KBytes       
>> - - - - - - - - - - - - - - - - - - - - - - - - -
>> [ ID] Interval           Transfer     Bandwidth       Retr
>> [  4]   0.00-5.00   sec  85.3 MBytes   143 Mbits/sec    0             sender
>> [  4]   0.00-5.00   sec  84.7 MBytes   142 Mbits/sec                  receiver
>> 
>> iperf3 with no-op nfqueue callback in Go:
>> root@lsrv:~# iptables -A OUTPUT -d 10.182.122.11/32 -j NFQUEUE --queue-num 0
>> root@lsrv:~# iperf3 -t 5 -c 10.182.122.11
>> Connecting to host 10.182.122.11, port 5201
>> [  4] local 10.182.122.1 port 55864 connected to 10.182.122.11 port 5201
>> [ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
>> [  4]   0.00-1.00   sec  14.6 MBytes   122 Mbits/sec    0   96.2 KBytes       
>> [  4]   1.00-2.00   sec  14.1 MBytes   118 Mbits/sec    0   96.2 KBytes       
>> [  4]   2.00-3.00   sec  14.0 MBytes   118 Mbits/sec    0    102 KBytes       
>> [  4]   3.00-4.00   sec  14.0 MBytes   117 Mbits/sec    0    102 KBytes       
>> [  4]   4.00-5.00   sec  13.7 MBytes   115 Mbits/sec    0    107 KBytes       
>> - - - - - - - - - - - - - - - - - - - - - - - - -
>> [ ID] Interval           Transfer     Bandwidth       Retr
>> [  4]   0.00-5.00   sec  70.5 MBytes   118 Mbits/sec    0             sender
>> [  4]   0.00-5.00   sec  69.9 MBytes   117 Mbits/sec                  receiver
>> iperf Done.
>> 
>> So rats, throughput gets brutalized for both C and Go. For Go, a rate of 117 Mbit with a 1500 byte MTU is 9750 packets/sec, which is 103us / packet. Mean induced latency measured by ping is 114us, which is not far off 103us, so the rate slowdown looks to be mostly caused by the per-packet nfqueue calls. The core running test_nfqueue is pinned at 100% during the test. "nice -n -20" does nothing.
>> 
>> Presumably you’ll sometimes be releasing more than one packet at a time(?) so I guess whether or not this is workable depends on how many you release at once, what hardware you’re on and what rates you need to test at. But when you’re trying to test a qdisc, I guess you’d want to minimize the burden you add to the CPU, or else move it to a core the qdisc isn’t running on, or something, so the qdisc itself isn’t affected by the test rig.
>> 
>>> There is of course a hideous amount of complexity moved to the daemon,  
>> 
>> I can only imagine.
>> 
>>> as a pure fifo ap queue forms aggregregates much differently
>>> than a fq_codeled one. But, yea! userspace....  
>> 
>> This would be awesome if it works out! After that iperf3 test though, I think I may have smashed my dreams of writing a libnetfilter_queue userspace qdisc in Go, or C for that matter.
>> 
>> If this does somehow turn out to be good enough performance-wise, I think you’d have a lot more fun and spend a lot less time on it in Go than C, but that’s just an opinion... :)
>> 
> 
> 
> 
> -- 
> Best regards,
>  Jesper Dangaard Brouer
>  MSc.CS, Principal Kernel Engineer at Red Hat
>  LinkedIn: http://www.linkedin.com/in/brouer <http://www.linkedin.com/in/brouer>

[-- Attachment #2: Type: text/html, Size: 30504 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Make-wifi-fast] emulating wifi better - coupling qdiscs in netem?
  2018-06-17 15:16         ` Pete Heist
@ 2018-06-17 16:09           ` Dave Taht
  2018-06-17 18:38             ` Pete Heist
  0 siblings, 1 reply; 23+ messages in thread
From: Dave Taht @ 2018-06-17 16:09 UTC (permalink / raw)
  To: Pete Heist
  Cc: Jesper Dangaard Brouer, Florian Westphal, Marek Majkowski,
	Make-Wifi-fast

Regrettably I consider most of the gains of gro/gso to be illusory for
a router - in my ideal world packets are well distributed,
and like jesper's 40+gigE work I care a lot (in terms of simulation
baseline) about the overhead of a single ack packet in the wifi case.

I'm pleased that these are apu2 results, I have some hope that a
heftier box can do better.

And these are not using the bulk verdict facility?

Still it seems like keeping trying to couple qdiscs in netem in the
kernel would be in the 6-10us range. If only...



On Sun, Jun 17, 2018 at 8:16 AM, Pete Heist <pete@heistp.net> wrote:
> Hi Jesper/Florian, thanks for noticing that, not surprisingly it doesn’t
> change the ping results much, but it improves throughput a lot (now only
> ~20% less than without nfqueue):
>
> root@lsrv:~# iperf3 -t 5 -c 10.182.122.11
> Connecting to host 10.182.122.11, port 5201
> [  4] local 10.182.122.1 port 55936 connected to 10.182.122.11 port 5201
> [ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
> [  4]   0.00-1.00   sec   375 MBytes  3.14 Gbits/sec  173    372 KBytes
> [  4]   1.00-2.00   sec   365 MBytes  3.06 Gbits/sec  316    382 KBytes
> [  4]   2.00-3.00   sec   372 MBytes  3.13 Gbits/sec  368    427 KBytes
> [  4]   3.00-4.00   sec   364 MBytes  3.05 Gbits/sec  137    402 KBytes
> [  4]   4.00-5.00   sec   364 MBytes  3.05 Gbits/sec  342    382 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bandwidth       Retr
> [  4]   0.00-5.00   sec  1.80 GBytes  3.09 Gbits/sec  1336
> sender
> [  4]   0.00-5.00   sec  1.79 GBytes  3.08 Gbits/sec
> receiver
> iperf Done.
>
> I don’t know if/how the use of GSO affects Dave’s simulation work, but I’ll
> leave that to him. I only wanted to contribute a quick evaluation. :)
>
> Pete
>
> On Jun 17, 2018, at 1:19 PM, Jesper Dangaard Brouer <brouer@redhat.com>
> wrote:
>
>
> Hi Pete,
>
> Happened to be at the Netfilter Workshop, and discussed nfqueue with
> Florian and Marek, and I saw this attempt to use nfqueue, and Florian
> points out that you are not using the GRO facility of nfqueue.
>
> I'll quote what Florian said below:
>
> On Sun, 17 Jun 2018 12:45:52 +0200 Florian Westphal <fw@strlen.de> wrote:
>
> The linked example code is old and does not set
> mnl_attr_put_u32(nlh, NFQA_CFG_FLAGS, htonl(NFQA_CFG_F_GSO));
>
> When requesting the queue.
>
> This means kernel has to do software segmentation of GSO skbs.
>
> Consider using
> https://git.netfilter.org/libnetfilter_queue/tree/examples/nf-queue.c
>
> instead if you need a template, it does this correctly.
>
>
> --Jesper
>
>
> On Sun, 17 Jun 2018 00:53:03 +0200 Pete Heist <pete@heistp.net> wrote:
>
> On Jun 16, 2018, at 12:30 AM, Dave Taht <dave.taht@gmail.com> wrote:
>
> Eric just suggested using the iptables NFQUEUE ability to toss
> packets to userspace.
>
> https://home.regit.org/netfilter-en/using-nfqueue-and-libnetfilter_queue/
> <https://home.regit.org/netfilter-en/using-nfqueue-and-libnetfilter_queue/>
> For wifi, at least, timings are not hugely critical, a few hundred
> usec is something userspace can handle reasonably accurately. I like
> very much being able to separate out mcast and treat that correctly in
> userspace, also. I did want to be below 10usec (wifi "bus"
> arbitration), which I am dubious about....
>
> Now as for an implementation language? C++ C? Go? Python? The
> condition of the wrapper library for go leaves a bit to be desired
> ( https://github.com/chifflier/nfqueue-go
> <https://github.com/chifflier/nfqueue-go> ) and given a choice I'd
> MUCH rather use a go than a C.
>
>
> This sounds cool... So for fun, I compared ping and iperf3 with no-op
> nfqueue callbacks in both C and Go. As for the hardware setup, I used two
> lxc containers (effectively just veth) on an APU2.
>
> For the Go program, I used test_nfqueue from the wrapper above (which yes,
> does need some work) and removed debugging / logging.
>
> For the C program I used this:
> https://github.com/irontec/netfilter-nfqueue-samples/blob/master/sample-helloworld.c
> I removed any per-packet printf calls and compiled with "gcc
> sample-helloworld.c -o nfq -lnfnetlink -lnetfilter_queue”.
>
> Ping results:
>
> ping without nfqueue:
> root@lsrv:~# iptables -F OUTPUT
> root@lsrv:~# ping -c 500 -i 0.01 -q 10.182.122.11
> 500 packets transmitted, 500 received, 0% packet loss, time 7985ms
> rtt min/avg/max/mdev = 0.056/0.058/0.185/0.011 ms
>
> ping with no-op nfqueue callback in C:
> root@lsrv:~# iptables -A OUTPUT -d 10.182.122.11/32 -j NFQUEUE --queue-num 0
> root@lsrv:~/nfqueue# ping -c 500 -i 0.01 -q 10.182.122.11
> 500 packets transmitted, 500 received, 0% packet loss, time 7981ms
> rtt min/avg/max/mdev = 0.117/0.123/0.384/0.020 ms
>
> ping with no-op nfqueue callback in Go:
> root@lsrv:~# iptables -A OUTPUT -d 10.182.122.11/32 -j NFQUEUE --queue-num 0
> root@lsrv:~# ping -c 500 -i 0.01 -q 10.182.122.11
> 500 packets transmitted, 500 received, 0% packet loss, time 7982ms
> rtt min/avg/max/mdev = 0.095/0.172/0.532/0.042 ms
>
> The mean induced latency of 65us for C or 114us for Go might be within your
> parameters, except you mentioned 10us for WiFi bus arbitration, which does
> indeed look impossible with this setup, even in C.
>
> Iperf3 results:
>
> iperf3 without nfqueue:
> root@lsrv:~# iptables -F OUTPUT
> root@lsrv:~# iperf3 -t 5 -c 10.182.122.11
> Connecting to host 10.182.122.11, port 5201
> [  4] local 10.182.122.1 port 55810 connected to 10.182.122.11 port 5201
> [ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
> [  4]   0.00-1.00   sec   452 MBytes  3.79 Gbits/sec    0    178 KBytes
> [  4]   1.00-2.00   sec   454 MBytes  3.82 Gbits/sec    0    320 KBytes
> [  4]   2.00-3.00   sec   450 MBytes  3.77 Gbits/sec    0    320 KBytes
> [  4]   3.00-4.00   sec   451 MBytes  3.79 Gbits/sec    0    352 KBytes
> [  4]   4.00-5.00   sec   451 MBytes  3.79 Gbits/sec    0    352 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bandwidth       Retr
> [  4]   0.00-5.00   sec  2.21 GBytes  3.79 Gbits/sec    0             sender
> [  4]   0.00-5.00   sec  2.21 GBytes  3.79 Gbits/sec
> receiver
> iperf Done.
>
> iperf3 with no-op nfqueue callback in C:
> root@lsrv:~# iptables -A OUTPUT -d 10.182.122.11/32 -j NFQUEUE --queue-num 0
> root@lsrv:~/nfqueue# iperf3 -t 5 -c 10.182.122.11
> Connecting to host 10.182.122.11, port 5201
> [  4] local 10.182.122.1 port 55868 connected to 10.182.122.11 port 5201
> [ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
> [  4]   0.00-1.00   sec  17.4 MBytes   146 Mbits/sec    0    107 KBytes
> [  4]   1.00-2.00   sec  16.9 MBytes   142 Mbits/sec    0    107 KBytes
> [  4]   2.00-3.00   sec  17.0 MBytes   142 Mbits/sec    0    107 KBytes
> [  4]   3.00-4.00   sec  17.0 MBytes   142 Mbits/sec    0    107 KBytes
> [  4]   4.00-5.00   sec  17.0 MBytes   143 Mbits/sec    0    115 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bandwidth       Retr
> [  4]   0.00-5.00   sec  85.3 MBytes   143 Mbits/sec    0             sender
> [  4]   0.00-5.00   sec  84.7 MBytes   142 Mbits/sec
> receiver
>
> iperf3 with no-op nfqueue callback in Go:
> root@lsrv:~# iptables -A OUTPUT -d 10.182.122.11/32 -j NFQUEUE --queue-num 0
> root@lsrv:~# iperf3 -t 5 -c 10.182.122.11
> Connecting to host 10.182.122.11, port 5201
> [  4] local 10.182.122.1 port 55864 connected to 10.182.122.11 port 5201
> [ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
> [  4]   0.00-1.00   sec  14.6 MBytes   122 Mbits/sec    0   96.2 KBytes
> [  4]   1.00-2.00   sec  14.1 MBytes   118 Mbits/sec    0   96.2 KBytes
> [  4]   2.00-3.00   sec  14.0 MBytes   118 Mbits/sec    0    102 KBytes
> [  4]   3.00-4.00   sec  14.0 MBytes   117 Mbits/sec    0    102 KBytes
> [  4]   4.00-5.00   sec  13.7 MBytes   115 Mbits/sec    0    107 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bandwidth       Retr
> [  4]   0.00-5.00   sec  70.5 MBytes   118 Mbits/sec    0             sender
> [  4]   0.00-5.00   sec  69.9 MBytes   117 Mbits/sec
> receiver
> iperf Done.
>
> So rats, throughput gets brutalized for both C and Go. For Go, a rate of 117
> Mbit with a 1500 byte MTU is 9750 packets/sec, which is 103us / packet. Mean
> induced latency measured by ping is 114us, which is not far off 103us, so
> the rate slowdown looks to be mostly caused by the per-packet nfqueue calls.
> The core running test_nfqueue is pinned at 100% during the test. "nice -n
> -20" does nothing.
>
> Presumably you’ll sometimes be releasing more than one packet at a time(?)
> so I guess whether or not this is workable depends on how many you release
> at once, what hardware you’re on and what rates you need to test at. But
> when you’re trying to test a qdisc, I guess you’d want to minimize the
> burden you add to the CPU, or else move it to a core the qdisc isn’t running
> on, or something, so the qdisc itself isn’t affected by the test rig.
>
> There is of course a hideous amount of complexity moved to the daemon,
>
>
> I can only imagine.
>
> as a pure fifo ap queue forms aggregregates much differently
> than a fq_codeled one. But, yea! userspace....
>
>
> This would be awesome if it works out! After that iperf3 test though, I
> think I may have smashed my dreams of writing a libnetfilter_queue userspace
> qdisc in Go, or C for that matter.
>
> If this does somehow turn out to be good enough performance-wise, I think
> you’d have a lot more fun and spend a lot less time on it in Go than C, but
> that’s just an opinion... :)
>
>
>
>
> --
> Best regards,
>  Jesper Dangaard Brouer
>  MSc.CS, Principal Kernel Engineer at Red Hat
>  LinkedIn: http://www.linkedin.com/in/brouer
>
>



-- 

Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Make-wifi-fast] emulating wifi better - coupling qdiscs in netem?
  2018-06-17 16:09           ` Dave Taht
@ 2018-06-17 18:38             ` Pete Heist
  2018-06-17 18:47               ` Jonathan Morton
  2018-06-17 20:42               ` Dave Taht
  0 siblings, 2 replies; 23+ messages in thread
From: Pete Heist @ 2018-06-17 18:38 UTC (permalink / raw)
  To: Dave Taht
  Cc: Jesper Dangaard Brouer, Florian Westphal, Marek Majkowski,
	Make-Wifi-fast


> On Jun 17, 2018, at 6:09 PM, Dave Taht <dave.taht@gmail.com> wrote:
> 
> I'm pleased that these are apu2 results, I have some hope that a
> heftier box can do better.

I suspect 10-20x is possible on more modern, desktop class hardware.

> And these are not using the bulk verdict facility?

It looks like not to me. nfq_nlmsg_verdict_put is called with a single ID and I see no references to the batch verdict support there. Presumably there are significant gains to be had there.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Make-wifi-fast] emulating wifi better - coupling qdiscs in netem?
  2018-06-17 18:38             ` Pete Heist
@ 2018-06-17 18:47               ` Jonathan Morton
  2018-06-18  9:24                 ` Pete Heist
  2018-06-17 20:42               ` Dave Taht
  1 sibling, 1 reply; 23+ messages in thread
From: Jonathan Morton @ 2018-06-17 18:47 UTC (permalink / raw)
  To: Pete Heist; +Cc: Dave Taht, Make-Wifi-fast, Marek Majkowski, Florian Westphal

[-- Attachment #1: Type: text/plain, Size: 402 bytes --]

On 9:38pm, Sun, 17 Jun 2018 Pete Heist, <pete@heistp.net> wrote:
>
>
> > On Jun 17, 2018, at 6:09 PM, Dave Taht <dave.taht@gmail.com> wrote:
> >
> > I'm pleased that these are apu2 results, I have some hope that a
> > heftier box can do better.
>
> I suspect 10-20x is possible on more modern, desktop class hardware.

If you have instructions for setting up a test, I could try it.

- Jonathan Morton

[-- Attachment #2: Type: text/html, Size: 620 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Make-wifi-fast] emulating wifi better - coupling qdiscs in netem?
  2018-06-17 18:38             ` Pete Heist
  2018-06-17 18:47               ` Jonathan Morton
@ 2018-06-17 20:42               ` Dave Taht
  2018-06-18  1:02                 ` Eric Dumazet
  1 sibling, 1 reply; 23+ messages in thread
From: Dave Taht @ 2018-06-17 20:42 UTC (permalink / raw)
  To: Pete Heist
  Cc: Jesper Dangaard Brouer, Florian Westphal, Marek Majkowski,
	Make-Wifi-fast

On Sun, Jun 17, 2018 at 11:38 AM, Pete Heist <pete@heistp.net> wrote:
>
>> On Jun 17, 2018, at 6:09 PM, Dave Taht <dave.taht@gmail.com> wrote:
>>
>> I'm pleased that these are apu2 results, I have some hope that a
>> heftier box can do better.
>
> I suspect 10-20x is possible on more modern, desktop class hardware.

I appreciate the optimism, but it's context switch time that dominates
here which does not scale.

>> And these are not using the bulk verdict facility?
>
> It looks like not to me. nfq_nlmsg_verdict_put is called with a single ID and I see no references to the batch verdict support there. > Presumably there are significant gains to be had there.

But: I have high hopes for the batch verdict capability, we issue a
verdict for that batch, then (wait the actual time to simulated-ly
deliver an wifi aggregate - projected cpu latency), then release the
next batch. this seems to me to be able to overlap and ignore the bus
arbitration step almost entirely.

I'll get on this myself after I'm done with the cake talk next week.
Thx for the thoughts and evaluations!

(still wish I could do it in-kernel)
-- 

Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Make-wifi-fast] emulating wifi better - coupling qdiscs in netem?
  2018-06-16 22:53     ` Pete Heist
  2018-06-17 11:19       ` Jesper Dangaard Brouer
@ 2018-06-18  0:59       ` Eric Dumazet
  1 sibling, 0 replies; 23+ messages in thread
From: Eric Dumazet @ 2018-06-18  0:59 UTC (permalink / raw)
  To: make-wifi-fast



On 06/16/2018 03:53 PM, Pete Heist wrote:
 
> The mean induced latency of 65us for C or 114us for Go might be within your parameters, except you mentioned 10us for WiFi bus arbitration, which does indeed look impossible with this setup, even in C.
>

You guys have to learn about busy polling, to avoid scheduler being in the picture.

Basically, do not ever block in a system call, dedicate one thread/cpu for this netem stuff.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Make-wifi-fast] emulating wifi better - coupling qdiscs in netem?
  2018-06-17 20:42               ` Dave Taht
@ 2018-06-18  1:02                 ` Eric Dumazet
  0 siblings, 0 replies; 23+ messages in thread
From: Eric Dumazet @ 2018-06-18  1:02 UTC (permalink / raw)
  To: make-wifi-fast



On 06/17/2018 01:42 PM, Dave Taht wrote:
> On Sun, Jun 17, 2018 at 11:38 AM, Pete Heist <pete@heistp.net> wrote:
>>
>>> On Jun 17, 2018, at 6:09 PM, Dave Taht <dave.taht@gmail.com> wrote:
>>>
>>> I'm pleased that these are apu2 results, I have some hope that a
>>> heftier box can do better.
>>
>> I suspect 10-20x is possible on more modern, desktop class hardware.
> 
> I appreciate the optimism, but it's context switch time that dominates
> here which does not scale.

Hey, do not use context switches, if you want low latencies on something that will
be used for testing purposes.




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Make-wifi-fast] emulating wifi better - coupling qdiscs in netem?
  2018-06-17 18:47               ` Jonathan Morton
@ 2018-06-18  9:24                 ` Pete Heist
  2018-06-18 16:08                   ` Eric Dumazet
  0 siblings, 1 reply; 23+ messages in thread
From: Pete Heist @ 2018-06-18  9:24 UTC (permalink / raw)
  To: Jonathan Morton; +Cc: Dave Taht, Make-Wifi-fast

[-- Attachment #1: Type: text/plain, Size: 1368 bytes --]


> On Jun 17, 2018, at 8:47 PM, Jonathan Morton <chromatix99@gmail.com> wrote:
> If you have instructions for setting up a test, I could try it.
> 

Ok, thanks for that, code and scripts are attached, see README.txt.

I now use plain netns (no lxc containers), which is easier to set up and has lower RTTs and higher throughputs, probably due to no bridge device.

Per Eric’s tip, the nfq no-op code is run with chrt -rr 99, which reduces RTTs somewhat and increases throughputs ~2-3x. No busy polling yet.

I also tried it on VMWare with a 2011 MBP, which looks radically better than the APU2 (for nfq, RTTs ~16% of APU2 and throughput 4x higher in the non-GSO case, 11x with GSO). Results attached, and to summarize:

ping mean (min-max) RTTs:

APU2, no nfq: 35 us (23-291)
APU2, nfq without GSO: 80 us (53-288)
APU2, nfq with GSO: 85 us (56-270)
2011 MBP, no nfq: 4 us (4-529) [11% of APU2]
2011 MBP, nfq without GSO: 13 us (11-197) [16% of APU2]
2011 MBP, nfq with GSO: 14 us (11-1568) [16% of APU2]

iperf3 throughputs (Gbps):

APU2, no nfq: 5.01 Gbps
APU2, nfq without GSO: 391 Mbps
APU2, nfq with GSO: 3.35 Gbps
2011 MBP, no nfq: 39.8 Gbps [7.9x APU2]
2011 MBP, nfq without GSO: 1.48 Gbps [3.8x APU2]
2011 MBP, nfq with GSO: 38.0 Gbps [11.3x APU2]

Results from a decent physical box instead of a VM may be interesting to see.

[-- Attachment #2.1: Type: text/html, Size: 2466 bytes --]

[-- Attachment #2.2: nfqtest.tar.gz --]
[-- Type: application/x-gzip, Size: 5226 bytes --]

[-- Attachment #2.3: Type: text/html, Size: 233 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Make-wifi-fast] emulating wifi better - coupling qdiscs in netem?
  2018-06-18  9:24                 ` Pete Heist
@ 2018-06-18 16:08                   ` Eric Dumazet
  2018-06-18 19:33                     ` Pete Heist
  0 siblings, 1 reply; 23+ messages in thread
From: Eric Dumazet @ 2018-06-18 16:08 UTC (permalink / raw)
  To: make-wifi-fast



On 06/18/2018 02:24 AM, Pete Heist wrote:
> 
>> On Jun 17, 2018, at 8:47 PM, Jonathan Morton <chromatix99@gmail.com <mailto:chromatix99@gmail.com>> wrote:
>>
>> If you have instructions for setting up a test, I could try it.
>>
> Ok, thanks for that, code and scripts are attached, see README.txt.
> 
> I now use plain netns (no lxc containers), which is easier to set up and has lower RTTs and higher throughputs, probably due to no bridge device.
> 
> Per Eric’s tip, the nfq no-op code is run with chrt -rr 99, which reduces RTTs somewhat and increases throughputs ~2-3x. No busy polling yet.
> 
> I also tried it on VMWare with a 2011 MBP, which looks radically better than the APU2 (for nfq, RTTs ~16% of APU2 and throughput 4x higher in the non-GSO case, 11x with GSO). Results attached, and to summarize:
> 
> ping mean (min-max) RTTs:
> 
> APU2, no nfq: 35 us (23-291)
> APU2, nfq without GSO: 80 us (53-288)
> APU2, nfq with GSO: 85 us (56-270)
> 2011 MBP, no nfq: 4 us (4-529) [11% of APU2]
> 2011 MBP, nfq without GSO: 13 us (11-197) [16% of APU2]
> 2011 MBP, nfq with GSO: 14 us (11-1568) [16% of APU2]
> 
> iperf3 throughputs (Gbps):
> 
> APU2, no nfq: 5.01 Gbps
> APU2, nfq without GSO: 391 Mbps
> APU2, nfq with GSO: 3.35 Gbps
> 2011 MBP, no nfq: 39.8 Gbps [7.9x APU2]
> 2011 MBP, nfq without GSO: 1.48 Gbps [3.8x APU2]
> 2011 MBP, nfq with GSO: 38.0 Gbps [11.3x APU2]
> 
> Results from a decent physical box instead of a VM may be interesting to see.


If all you want to achieve is to delay packets given a set of rules, you do not want the content of them.

Copying the packets to user space is not needed, you are not doing deep packet inspection.

Really the throughput should be the same, only latencies should be somehow increased,

nfq_set_mode(qh, NFQNL_COPY_PACKET, 0xffff)

->

nfq_set_mode(qh, NFQNL_COPY_PACKET, 128); // assuming you want to inspect headers



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Make-wifi-fast] emulating wifi better - coupling qdiscs in netem?
  2018-06-18 16:08                   ` Eric Dumazet
@ 2018-06-18 19:33                     ` Pete Heist
  2018-06-18 19:44                       ` Dave Taht
  0 siblings, 1 reply; 23+ messages in thread
From: Pete Heist @ 2018-06-18 19:33 UTC (permalink / raw)
  To: Make-Wifi-fast

[-- Attachment #1: Type: text/plain, Size: 868 bytes --]


> On Jun 18, 2018, at 6:08 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> nfq_set_mode(qh, NFQNL_COPY_PACKET, 0xffff)
> 
> ->
> 
> nfq_set_mode(qh, NFQNL_COPY_PACKET, 128); // assuming you want to inspect headers


Thanks for that. I see flat RTTs, and a sometimes significant increase in throughputs. Unexpectedly, nfq with GSO throughputs are higher than without nfq at all.

ping mean (min-max) RTTs:

APU2, nfq without GSO: 80 us -> 82 us
APU2, nfq with GSO: 85 us -> 83 us
2011 MBP, nfq without GSO: 13 us -> 14 us
2011 MBP, nfq with GSO: 14 us -> 13 us

iperf3 throughputs:

APU2, nfq without GSO: 391 -> 415 Mbps
APU2, nfq with GSO: 3.35 -> 6.07 Gbps [higher than no nfqueue, 5.55 -> 6.07]
2011 MBP, nfq without GSO: 1.48 Gbps -> 2.73 Gbps
2011 MBP, nfq with GSO: 38.0 Gbps -> 45.5 Gbps [higher than no nfqueue, 39.2 -> 45.5]


[-- Attachment #2.1: Type: text/html, Size: 4455 bytes --]

[-- Attachment #2.2: nfqtest.tar.gz --]
[-- Type: application/x-gzip, Size: 6112 bytes --]

[-- Attachment #2.3: Type: text/html, Size: 273 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Make-wifi-fast] emulating wifi better - coupling qdiscs in netem?
  2018-06-18 19:33                     ` Pete Heist
@ 2018-06-18 19:44                       ` Dave Taht
  2018-06-18 21:54                         ` Pete Heist
  0 siblings, 1 reply; 23+ messages in thread
From: Dave Taht @ 2018-06-18 19:44 UTC (permalink / raw)
  To: Pete Heist; +Cc: Make-Wifi-fast

On Mon, Jun 18, 2018 at 12:33 PM, Pete Heist <pete@heistp.net> wrote:
>
> On Jun 18, 2018, at 6:08 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
> nfq_set_mode(qh, NFQNL_COPY_PACKET, 0xffff)
>
> ->
>
> nfq_set_mode(qh, NFQNL_COPY_PACKET, 128); // assuming you want to inspect
> headers
>
>
> Thanks for that. I see flat RTTs, and a sometimes significant increase in
> throughputs. Unexpectedly, nfq with GSO throughputs are higher than without
> nfq at all.
>
> ping mean (min-max) RTTs:
>
> APU2, nfq without GSO: 80 us -> 82 us
> APU2, nfq with GSO: 85 us -> 83 us
> 2011 MBP, nfq without GSO: 13 us -> 14 us
> 2011 MBP, nfq with GSO: 14 us -> 13 us
>
> iperf3 throughputs:
>
> APU2, nfq without GSO: 391 -> 415 Mbps
> APU2, nfq with GSO: 3.35 -> 6.07 Gbps [higher than no nfqueue, 5.55 -> 6.07]
> 2011 MBP, nfq without GSO: 1.48 Gbps -> 2.73 Gbps
> 2011 MBP, nfq with GSO: 38.0 Gbps -> 45.5 Gbps [higher than no nfqueue, 39.2
> -> 45.5]

This is still without batch releases, yes?

In any case, the now achieved rates and latencies seem sufficient to
try and adapt these methods to emulating wifi/lte etc better! We only
need to get to a gbit. Obviously doing more expensive userspace
processing is going to hurt, and, well, for the sake of argument
emulating a 32 station wifi 802.11n network would be proof of the
pudding, but I'd settle for even the simplest case of one ap and two
stations
actually rendering sane-looking behavior.

Originally, when thinking about this, I'd thought we'd use one veth
per station and toss packets to userspace based on one nfqueue per
input/output interface. I still lean that way (do we get multicast mac
addrs on packets this way?), but perhaps a single interface could be
used and we could
sort out the src/dst ips and batching in userspace, starting with
fifos to represent current behavior and gradually working our way back
up to the fq_codel on wifi emulation. Or, with one veth per station,
still use a fq_codel qdisc, but I don't see how we can create
backpressure for that actually to engage.

Better to be reordering the verdict on packets in the batch for an
fq_codel emulation. I think.



>
>
>
> _______________________________________________
> Make-wifi-fast mailing list
> Make-wifi-fast@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/make-wifi-fast



-- 

Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Make-wifi-fast] emulating wifi better - coupling qdiscs in netem?
  2018-06-18 19:44                       ` Dave Taht
@ 2018-06-18 21:54                         ` Pete Heist
  2018-06-18 22:27                           ` Eric Dumazet
  0 siblings, 1 reply; 23+ messages in thread
From: Pete Heist @ 2018-06-18 21:54 UTC (permalink / raw)
  To: Dave Taht; +Cc: Make-Wifi-fast

[-- Attachment #1: Type: text/plain, Size: 2558 bytes --]


> On Jun 18, 2018, at 9:44 PM, Dave Taht <dave.taht@gmail.com> wrote:
> 
> This is still without batch releases, yes?

Yes, I should've tried that earlier, but I’m scratching my head now as to how it works. Perhaps it’s because the old example I’m using for the non-GSO case uses deprecated functions and I ought to just ditch it, but I thought if in my callback I just switched:

return nfq_set_verdict(qh, id, NF_ACCEPT, 0, NULL);

to

return nfq_set_verdict_batch(qh, id + 8, NF_ACCEPT);

that my callback might not be called for the subsequent 8 packets I’ve accepted, however it continues to be called for each id sequentially anyway and throughput is no better. If I change 8 to something unreasonable, like 1000000, throughput is cut in half, so it’s doing “something”.

There are functions in the newer GSO example like nfq_nlmsg_verdict_put, but I don’t see a batch version of that. So, I’m likely missing something…

BTW I don’t see a change setting SO_BUSY_POLL on nfq’s fd (tried 1000 - 1000000 usec).

> In any case, the now achieved rates and latencies seem sufficient to
> try and adapt these methods to emulating wifi/lte etc better! We only
> need to get to a gbit.

Indeed, it’s there. :)

> Obviously doing more expensive userspace
> processing is going to hurt, and, well, for the sake of argument
> emulating a 32 station wifi 802.11n network would be proof of the
> pudding, but I'd settle for even the simplest case of one ap and two
> stations
> actually rendering sane-looking behavior.
> Originally, when thinking about this, I'd thought we'd use one veth
> per station and toss packets to userspace based on one nfqueue per
> input/output interface. I still lean that way (do we get multicast mac
> addrs on packets this way?), but perhaps a single interface could be
> used and we could
> sort out the src/dst ips and batching in userspace, starting with
> fifos to represent current behavior and gradually working our way back
> up to the fq_codel on wifi emulation. Or, with one veth per station,
> still use a fq_codel qdisc, but I don't see how we can create
> backpressure for that actually to engage.
> 
> Better to be reordering the verdict on packets in the batch for an
> fq_codel emulation. I think.

Is it worth measuring the aggregate throughput of 32 iperf3 client veth devices to one server device?

Worth trying to get the newer code into Go? I may have to start over without the wrapper and just write something simpler with newer code.


[-- Attachment #2: Type: text/html, Size: 18349 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Make-wifi-fast] emulating wifi better - coupling qdiscs in netem?
  2018-06-18 21:54                         ` Pete Heist
@ 2018-06-18 22:27                           ` Eric Dumazet
  0 siblings, 0 replies; 23+ messages in thread
From: Eric Dumazet @ 2018-06-18 22:27 UTC (permalink / raw)
  To: make-wifi-fast



On 06/18/2018 02:54 PM, Pete Heist wrote:
> 
>> On Jun 18, 2018, at 9:44 PM, Dave Taht <dave.taht@gmail.com <mailto:dave.taht@gmail.com>> wrote:
>>
>> This is still without batch releases, yes?
> 
> Yes, I should've tried that earlier, but I’m scratching my head now as to how it works. Perhaps it’s because the old example I’m using for the non-GSO case uses deprecated functions and I ought to just ditch it, but I thought if in my callback I just switched:
> 
> return nfq_set_verdict(qh, id, NF_ACCEPT, 0, NULL);
> 
> to
> 
> return nfq_set_verdict_batch(qh, id + 8, NF_ACCEPT);
> 
> that my callback might not be called for the subsequent 8 packets I’ve accepted, however it continues to be called for each id sequentially anyway and throughput is no better. If I change 8 to something unreasonable, like 1000000, throughput is cut in half, so it’s doing “something”.
> 
> There are functions in the newer GSO example like nfq_nlmsg_verdict_put, but I don’t see a batch version of that. So, I’m likely missing something…
> 
> BTW I don’t see a change setting SO_BUSY_POLL on nfq’s fd (tried 1000 - 1000000 usec).
>

busy polling does not request SO_BUSY_POLL.

Usually, we simply use non blocking system calls in a big loop.
( nl_socket_get_fd() -> then set the file in O_NDELAY mode)

SO_BUSY_POLL is a way to directly call the NAPI handler of the device (or more precisely RX queue)
feeding packets. This saves the hard interrupt latency.

For NFQUEUE, that would require a bit of plumbing I guess.

Each NF Queue would have to record (in the kernel) the NAPI id of intercepted packets.
-> A bit complicated since the number of RX queues on a NIC/host is quite variable.


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2018-06-18 22:27 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-30 17:28 [Make-wifi-fast] emulating wifi better - coupling qdiscs in netem? Dave Taht
2018-05-30 18:32 ` Bob McMahon
2018-05-30 18:54   ` Dave Taht
2018-05-30 18:58     ` Jonathan Morton
2018-05-30 19:19     ` Bob McMahon
2018-05-30 23:26       ` Dave Taht
2018-05-30 22:57 ` dpreed
2018-06-15 22:30   ` Dave Taht
2018-06-16 22:53     ` Pete Heist
2018-06-17 11:19       ` Jesper Dangaard Brouer
2018-06-17 15:16         ` Pete Heist
2018-06-17 16:09           ` Dave Taht
2018-06-17 18:38             ` Pete Heist
2018-06-17 18:47               ` Jonathan Morton
2018-06-18  9:24                 ` Pete Heist
2018-06-18 16:08                   ` Eric Dumazet
2018-06-18 19:33                     ` Pete Heist
2018-06-18 19:44                       ` Dave Taht
2018-06-18 21:54                         ` Pete Heist
2018-06-18 22:27                           ` Eric Dumazet
2018-06-17 20:42               ` Dave Taht
2018-06-18  1:02                 ` Eric Dumazet
2018-06-18  0:59       ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox