This is still without batch releases, yes?
Yes, I should've tried that earlier, but I’m scratching my head now as to how it works. Perhaps it’s because the old example I’m using for the non-GSO case uses deprecated functions and I ought to just ditch it, but I thought if in my callback I just switched:
return nfq_set_verdict(qh, id, NF_ACCEPT, 0, NULL);
to
return nfq_set_verdict_batch(qh, id + 8, NF_ACCEPT);
that my callback might not be called for the subsequent 8 packets I’ve accepted, however it continues to be called for each id sequentially anyway and throughput is no better. If I change 8 to something unreasonable, like 1000000, throughput is cut in half, so it’s doing “something”.
There are functions in the newer GSO example like nfq_nlmsg_verdict_put, but I don’t see a batch version of that. So, I’m likely missing something…
BTW I don’t see a change setting SO_BUSY_POLL on nfq’s fd (tried 1000 - 1000000 usec).
In any case, the now achieved rates and latencies seem sufficient to
try and adapt these methods to emulating wifi/lte etc better! We only
need to get to a gbit.
Indeed, it’s there. :)
Obviously doing more expensive userspace
processing is going to hurt, and, well, for the sake of argument
emulating a 32 station wifi 802.11n network would be proof of the
pudding, but I'd settle for even the simplest case of one ap and two
stations
actually rendering sane-looking behavior.
Originally, when thinking about this, I'd thought we'd use one veth
per station and toss packets to userspace based on one nfqueue per
input/output interface. I still lean that way (do we get multicast mac
addrs on packets this way?), but perhaps a single interface could be
used and we could
sort out the src/dst ips and batching in userspace, starting with
fifos to represent current behavior and gradually working our way back
up to the fq_codel on wifi emulation. Or, with one veth per station,
still use a fq_codel qdisc, but I don't see how we can create
backpressure for that actually to engage.
Better to be reordering the verdict on packets in the batch for an
fq_codel emulation. I think.
Is it worth measuring the aggregate throughput of 32 iperf3 client veth devices to one server device?
Worth trying to get the newer code into Go? I may have to start over without the wrapper and just write something simpler with newer code.