[Bloat] *** GMX Spamverdacht *** RE: DETNET

Matthias Tafelmeier matthias.tafelmeier at gmx.net
Sun Dec 17 07:46:26 EST 2017

> Well, the trick of running RDMA side by side with TCP inside a datacenter using 2 Diffsrv classes would fit the broad theme of deterministic traffic classes, provided that you use RDMA in just the right way.  You get the highest level of determinism for the reliable one-sided write case, provided that your network only has switches and no routers in it (so in a COS network, rack-scale cases, or perhaps racks with one TOR switch, but not the leaf and spine routers).  The reason for this is that with routers you can have resource delays (RDMA sends only with permission, in the form of credits).  Switches always allow sending, and have full bisection bandwidth, and in this specific configuration of RDMA, the receiver grants permission at the time the one-sided receive buffer is registered, so after that setup, the delays will be a function of (1) traffic on the sender NIC, (2) traffic on the receiver NIC, (3) queue priorities, when there are multiple queues sharing one NIC.
> Other sources of non-determinism for hardware RDMA would include limited resources within the NIC itself.  An RDMA NIC has to cache the DMA mappings for pages you are using, as well as qpair information for the connected qpairs.  The DMA mapping itself has a two-level structure.  So there are three kinds of caches, and each of these can become overfull.  If that happens, the needed mapping is fetched from host memory, but this evicts data, so you can see a form of cache-miss-thrashing occur in which performance will degrade sharply.  Derecho avoids putting too much pressure on these NIC resources, but some systems accidentally overload one cache or another and then they see performance collapse as they scale.
> But you can control for essentially all of these factors.
> You would then only see non-determinism to the extent that your application triggers it, through paging, scheduling effects, poor memory allocation area affinities (e.g. core X allocates block B, but then core Y tries to read or write into it), locking, etc.  Those effects can be quite large.  Getting Derecho to run at the full 100Gbps network rates was really hard because of issues of these kinds -- and there are more and more papers reporting similar issues for Linux and Mesos as a whole.  Copying will also kill performance: 100Gbps is faster than memcpy for a large, non-cached object.  So even a single copy operation, or a single checksum computation, can actually turn out to be by far the bottleneck -- and can be a huge source of non-determinism if you trigger this but only now and then, as with a garbage collected language.
> Priority inversions are another big issue, at the OS thread level or in threaded applications.  What happens with this case is that you have a lock and accidentally end up sharing it between a high priority thread (like an RDMA NIC, which acts like a super-thread with the highest possible priority), and a lower priority thread (like any random application thread).  If the application uses the thread priorities features of Linux/Mesos, this can exacerbate the chance of causing inversions.
> So an inversion would arise if for high priority thread A to do something, like check a qpair for an enabled transfer, a lower priority thread B needs to run (like if B holds the lock but then got preempted).  This is a rare race-condition sort of problem, but when it bites, A gets stuck until B runs.  If C is high priority and doing something like busy-waiting for a doorbell from the RDMA NIC, or for a completion, C prevents B from running, and we get a form of deadlock that can persist until something manages to stall C. Then B finishes, A resumes, etc.  Non-deterministic network delay ensues.
> So those are the kinds of examples I spend a lot of my time thinking about.  The puzzle for me isn't at the lower levels -- RDMA and the routers already have reasonable options.  The puzzle is that the software can introduce tons of non-determinism even at the very lowest kernel or container layers, more or less in the NIC itself or in the driver, or perhaps in memory management and thread scheduling.
> I could actually give more examples that relate to interactions between devices: networks plus DMA into a frame buffer for a video or GPU, for example (in this case the real issue is barriers: how do you know if the cache and other internal pipelines of that device flushed when the transfer into its memory occurred?  Turns out that there is no hardware standard for this, and it might not always be a sure thing).  If they use a sledgehammer solution, like a bus reset (available with RDMA), that's going to have a BIG impact on perceived network determinism...  yet it actually is an end-host "issue", not a network issue.
Nothing to challenge here. For the scheduler part I only want to add -
you certainly know that better than me -, that there are quite nifty
software techniques to literally erradicate at least the priority
inversion problem. Only speaking for LNX, know that there's quite some
movement in general for scheduler amendmends for network processing at
the moment. Not sure if vendors of embedded versions or the RT-patch of
it haven't made it extinct already. Though, not the point you're making.
Further, it would still leave the clock-source as non-determism introducer.

Quite a valuable research endeavor would be to quantify all of those
traits and compare or make it comparable to  the characteristics of
certain other approaches, e.g., the quite promising LNX kernel busy
polling[1] mechanisms. All are suffering from similar weaknesses,
remaining the question for the which. Saying that a little briskly w\o
clearly thinking through its feasibility for the time being.

[1] https://www.netdevconf.org/2.1/papers/BusyPollingNextGen.pdf

Besten Gruß

Matthias Tafelmeier

-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0x8ADF343B.asc
Type: application/pgp-keys
Size: 4729 bytes
Desc: not available
URL: <https://lists.bufferbloat.net/pipermail/bloat/attachments/20171217/74962166/attachment-0001.key>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 538 bytes
Desc: OpenPGP digital signature
URL: <https://lists.bufferbloat.net/pipermail/bloat/attachments/20171217/74962166/attachment-0001.sig>

More information about the Bloat mailing list