[Bloat] *** GMX Spamverdacht *** RE: DETNET

Ken Birman kpb3 at cornell.edu
Mon Nov 20 14:04:39 EST 2017


Well, the trick of running RDMA side by side with TCP inside a datacenter using 2 Diffsrv classes would fit the broad theme of deterministic traffic classes, provided that you use RDMA in just the right way.  You get the highest level of determinism for the reliable one-sided write case, provided that your network only has switches and no routers in it (so in a COS network, rack-scale cases, or perhaps racks with one TOR switch, but not the leaf and spine routers).  The reason for this is that with routers you can have resource delays (RDMA sends only with permission, in the form of credits).  Switches always allow sending, and have full bisection bandwidth, and in this specific configuration of RDMA, the receiver grants permission at the time the one-sided receive buffer is registered, so after that setup, the delays will be a function of (1) traffic on the sender NIC, (2) traffic on the receiver NIC, (3) queue priorities, when there are multiple queues sharing one NIC.

Other sources of non-determinism for hardware RDMA would include limited resources within the NIC itself.  An RDMA NIC has to cache the DMA mappings for pages you are using, as well as qpair information for the connected qpairs.  The DMA mapping itself has a two-level structure.  So there are three kinds of caches, and each of these can become overfull.  If that happens, the needed mapping is fetched from host memory, but this evicts data, so you can see a form of cache-miss-thrashing occur in which performance will degrade sharply.  Derecho avoids putting too much pressure on these NIC resources, but some systems accidentally overload one cache or another and then they see performance collapse as they scale.

But you can control for essentially all of these factors.

You would then only see non-determinism to the extent that your application triggers it, through paging, scheduling effects, poor memory allocation area affinities (e.g. core X allocates block B, but then core Y tries to read or write into it), locking, etc.  Those effects can be quite large.  Getting Derecho to run at the full 100Gbps network rates was really hard because of issues of these kinds -- and there are more and more papers reporting similar issues for Linux and Mesos as a whole.  Copying will also kill performance: 100Gbps is faster than memcpy for a large, non-cached object.  So even a single copy operation, or a single checksum computation, can actually turn out to be by far the bottleneck -- and can be a huge source of non-determinism if you trigger this but only now and then, as with a garbage collected language.

Priority inversions are another big issue, at the OS thread level or in threaded applications.  What happens with this case is that you have a lock and accidentally end up sharing it between a high priority thread (like an RDMA NIC, which acts like a super-thread with the highest possible priority), and a lower priority thread (like any random application thread).  If the application uses the thread priorities features of Linux/Mesos, this can exacerbate the chance of causing inversions.

So an inversion would arise if for high priority thread A to do something, like check a qpair for an enabled transfer, a lower priority thread B needs to run (like if B holds the lock but then got preempted).  This is a rare race-condition sort of problem, but when it bites, A gets stuck until B runs.  If C is high priority and doing something like busy-waiting for a doorbell from the RDMA NIC, or for a completion, C prevents B from running, and we get a form of deadlock that can persist until something manages to stall C.  Then B finishes, A resumes, etc.  Non-deterministic network delay ensues.

So those are the kinds of examples I spend a lot of my time thinking about.  The puzzle for me isn't at the lower levels -- RDMA and the routers already have reasonable options.  The puzzle is that the software can introduce tons of non-determinism even at the very lowest kernel or container layers, more or less in the NIC itself or in the driver, or perhaps in memory management and thread scheduling.

I could actually give more examples that relate to interactions between devices: networks plus DMA into a frame buffer for a video or GPU, for example (in this case the real issue is barriers: how do you know if the cache and other internal pipelines of that device flushed when the transfer into its memory occurred?  Turns out that there is no hardware standard for this, and it might not always be a sure thing).  If they use a sledgehammer solution, like a bus reset (available with RDMA), that's going to have a BIG impact on perceived network determinism...  yet it actually is an end-host "issue", not a network issue.

Ken

-----Original Message-----
From: Matthias Tafelmeier [mailto:matthias.tafelmeier at gmx.net] 
Sent: Monday, November 20, 2017 12:56 PM
To: Ken Birman <kpb3 at cornell.edu>; 'Dave Taht' <dave at taht.net>
Cc: Bob Briscoe <ietf at bobbriscoe.net>; bloat at lists.bufferbloat.net
Subject: Re: *** GMX Spamverdacht *** RE: [Bloat] DETNET


> If this thread is about a specific scenario, maybe someone could point me to the OP where the scenario was first described?

I've forwarded the root of the thread to you - no specific scenario, only I was sharing/referring DETNET IETF papers.

--
Besten Gruß

Matthias Tafelmeier



More information about the Bloat mailing list