[Bloat] *** GMX Spamverdacht *** RE: DETNET

Ken Birman kpb3 at cornell.edu
Sun Dec 17 11:06:38 EST 2017


I see this as a situation that really argues for systematic experiments, maybe a paper you could aim towards SIGCOMM IMC.  Beyond a certain point, you simply need to pull out the stops and try to understand what the main obstacles to determinism turn out to be in practice, for realistic examples of systems that might need determinism (medical monitoring in a hospital, for example, or wide-area tracking of transients in the power grid, things of that sort).

In fact I can see how a case could be made for doing a series of such papers: one purely in network settings, one looking at datacenter environments (not virtualized, unless you want to do even one more paper), one looking at WAN distributed infrastructures.  Maybe one for embedded systems like self-driving cars or planes.

I would find that sort of paper interesting if the work was done really well, and really broke down the causes for unpredictability, traced each one to a root case, maybe even showed how to fix the issues identified.

There are also several dimensions to consider (one paper could still tackle multiple aspects): latency, throughput.  And then there are sometimes important tradeoffs for people trying to run at the highest practical data rates versus those satisfied with very low rates, so you might also want to look at how the presented load impacts the stability of the system.

But I think the level of interest in this topic would be very high.  The key is to acknowledge that with so many layers of hardware and software playing roles, only an experimental study can really shed might light.  Very likely, this is like anything else: 99% of the variability is coming from 1% of the end-to-end pathway... fix that 1% and you'll find that there is a similar issue but emerging from some other place. But fix enough of them, and you could have a significant impact -- and industry would adopt solutions that really work...

Ken 

-----Original Message-----
From: Matthias Tafelmeier [mailto:matthias.tafelmeier at gmx.net] 
Sent: Sunday, December 17, 2017 7:46 AM
To: Ken Birman <kpb3 at cornell.edu>; 'Dave Taht' <dave at taht.net>
Cc: Bob Briscoe <ietf at bobbriscoe.net>; bloat at lists.bufferbloat.net
Subject: Re: *** GMX Spamverdacht *** RE: [Bloat] DETNET


> Well, the trick of running RDMA side by side with TCP inside a datacenter using 2 Diffsrv classes would fit the broad theme of deterministic traffic classes, provided that you use RDMA in just the right way.  You get the highest level of determinism for the reliable one-sided write case, provided that your network only has switches and no routers in it (so in a COS network, rack-scale cases, or perhaps racks with one TOR switch, but not the leaf and spine routers).  The reason for this is that with routers you can have resource delays (RDMA sends only with permission, in the form of credits).  Switches always allow sending, and have full bisection bandwidth, and in this specific configuration of RDMA, the receiver grants permission at the time the one-sided receive buffer is registered, so after that setup, the delays will be a function of (1) traffic on the sender NIC, (2) traffic on the receiver NIC, (3) queue priorities, when there are multiple queues sharing one NIC.
>
> Other sources of non-determinism for hardware RDMA would include limited resources within the NIC itself.  An RDMA NIC has to cache the DMA mappings for pages you are using, as well as qpair information for the connected qpairs.  The DMA mapping itself has a two-level structure.  So there are three kinds of caches, and each of these can become overfull.  If that happens, the needed mapping is fetched from host memory, but this evicts data, so you can see a form of cache-miss-thrashing occur in which performance will degrade sharply.  Derecho avoids putting too much pressure on these NIC resources, but some systems accidentally overload one cache or another and then they see performance collapse as they scale.
>
> But you can control for essentially all of these factors.
>
> You would then only see non-determinism to the extent that your application triggers it, through paging, scheduling effects, poor memory allocation area affinities (e.g. core X allocates block B, but then core Y tries to read or write into it), locking, etc.  Those effects can be quite large.  Getting Derecho to run at the full 100Gbps network rates was really hard because of issues of these kinds -- and there are more and more papers reporting similar issues for Linux and Mesos as a whole.  Copying will also kill performance: 100Gbps is faster than memcpy for a large, non-cached object.  So even a single copy operation, or a single checksum computation, can actually turn out to be by far the bottleneck -- and can be a huge source of non-determinism if you trigger this but only now and then, as with a garbage collected language.
>
> Priority inversions are another big issue, at the OS thread level or in threaded applications.  What happens with this case is that you have a lock and accidentally end up sharing it between a high priority thread (like an RDMA NIC, which acts like a super-thread with the highest possible priority), and a lower priority thread (like any random application thread).  If the application uses the thread priorities features of Linux/Mesos, this can exacerbate the chance of causing inversions.
>
> So an inversion would arise if for high priority thread A to do something, like check a qpair for an enabled transfer, a lower priority thread B needs to run (like if B holds the lock but then got preempted).  This is a rare race-condition sort of problem, but when it bites, A gets stuck until B runs.  If C is high priority and doing something like busy-waiting for a doorbell from the RDMA NIC, or for a completion, C prevents B from running, and we get a form of deadlock that can persist until something manages to stall C. Then B finishes, A resumes, etc.  Non-deterministic network delay ensues.
> So those are the kinds of examples I spend a lot of my time thinking about.  The puzzle for me isn't at the lower levels -- RDMA and the routers already have reasonable options.  The puzzle is that the software can introduce tons of non-determinism even at the very lowest kernel or container layers, more or less in the NIC itself or in the driver, or perhaps in memory management and thread scheduling.
>
> I could actually give more examples that relate to interactions between devices: networks plus DMA into a frame buffer for a video or GPU, for example (in this case the real issue is barriers: how do you know if the cache and other internal pipelines of that device flushed when the transfer into its memory occurred?  Turns out that there is no hardware standard for this, and it might not always be a sure thing).  If they use a sledgehammer solution, like a bus reset (available with RDMA), that's going to have a BIG impact on perceived network determinism...  yet it actually is an end-host "issue", not a network issue.
>
>
Nothing to challenge here. For the scheduler part I only want to add - you certainly know that better than me -, that there are quite nifty software techniques to literally erradicate at least the priority inversion problem. Only speaking for LNX, know that there's quite some movement in general for scheduler amendmends for network processing at the moment. Not sure if vendors of embedded versions or the RT-patch of it haven't made it extinct already. Though, not the point you're making.
Further, it would still leave the clock-source as non-determism introducer.

Quite a valuable research endeavor would be to quantify all of those traits and compare or make it comparable to  the characteristics of certain other approaches, e.g., the quite promising LNX kernel busy polling[1] mechanisms. All are suffering from similar weaknesses, remaining the question for the which. Saying that a little briskly w\o clearly thinking through its feasibility for the time being.

[1] https://www.netdevconf.org/2.1/papers/BusyPollingNextGen.pdf

--
Besten Gruß

Matthias Tafelmeier



More information about the Bloat mailing list