[Bloat] DETNET

Sat Nov 18 14:47:02 EST 2017

“clan” is an iPad typo.  “vlan”

From: Ken Birman
Sent: Saturday, November 18, 2017 2:44 PM
To: Matthias Tafelmeier <matthias.tafelmeier at gmx.net>
Cc: Dave Taht <dave at taht.net>; Bob Briscoe <ietf at bobbriscoe.net>; ken at cs.cornell.edu; bloat at lists.bufferbloat.net
Subject: Re: [Bloat] DETNET

Several remarks:
- if you have hardware RDMA, you can use it instead of TCP, but only within a data center, or at most, between two side by side data centers.  In such situations the guarantees of RDMA are identical to TCP: lossless, uncorrupted, ordered data delivery.  In fact there are versions of TCP that just map your requests to RDMA.  But for peak speed, and lowest latency, you need the RDMA transfer to start in user space, and terminate in user space (end to end).  Any kernel involvement will slow things down, even with DMA scatter gather (copying would kill performance, but as it turns out, scheduling delays between user and kernel, or interrupts, are almost as bad)

- RDMA didn’t work well on Ethernet until recently, but this was fixed by a technique called DCQCN (Mellanox), or its cousin TIMELY (Google).  Microsoft recently had a SIGCOMM paper on running RDMA+DCQCN side by side with TCP/IP support their Azure platform, using a single data center network, 100Gb.  They found it very feasible, although configuration of the system requires some sophistication.  Azure supports Linux, Mesos, Windows, you name it.  The one thing they didn’t try was heavy virtualization.  In fact they disabled enterprise clan functionality in their routers.  So if you need that, you might have issues.

- One-way latencies tend to be in the range reported earlier today, maybe 1-2us for medium sized transfers.  A weakness of RDMA is that it has a smallest latency that might still be larger than you would wish, in its TCP-like mode.  Latency is lower for unreliable RDMA, or for direct writes into shared remote memory regions.  You can get down to perhaps 0.1us in those cases, for a small write like a single integer.  In fact the wire transfer format always moves fairly large numbers of bytes, maybe 96?  It varies by wire speed.  But the effect is that writing one bit or writing 512 bytes can actually be pretty much identical in terms of latency.

- the HPC people figured out how to solve this issue of not having hardware RDMA on development machines.  The main package they use is called MPI, and it has an internal split: the user-mode half talks to an adaptor library called LibFabrics, and then this maps to RDMA.  It can also discover that you lack RDMA hardware and in that case, will automatically use TCP.  We plan to port Derecho to run on this soon, almost certainly by early spring 2018.   Perhaps sooner: the API mimics the RDMA one, so it wont be hard to do.  I would recommend this for anyone doing new development.  The only issue is that for now LibFabrics is a fancy C header file that uses C macro expansion, which means you can’t use it directly from C++...you need a stub library, which can add a tiny bit of delay.  I’m told that the C++ library folks are going to create a variadic templates version, which would eliminate the limitation and offer the same inline code expansion as with the C header, but I don’t know when that will happen.

We are doing some testing of pure LibFabrics performance now, both in data centers and in WAN networks (after all, you can get 100Gbps over substantial distances these days... Cornell has it from Ithaca to New York City where we have a hospital and our new Tech campus).  We think this could let us run Derecho over a WAN with no hardware RDMA at all.

- there is also a way to run Derecho on SoftRoCE, which is a Linux software emulation of RDMA.  We tried this, and it is a solid 34-100x slower, so not interesting except for development.  I would steer away from SoftRoCE as an option.  It also pegs two cores at 100%...  one in the kernel and one in user space.  So maybe this is just in need of tuning, but it certainly seems like that code path is just not well optimized.    A poor option for developing code that would be interoperable between software or hardware accelerated RDMA at this stage.  LibFabrics probably isn’t a superstar either in terms of speed, but so far it appears to be faster, quite stable, easier to install, and much less of a background load...

Ken

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.bufferbloat.net/pipermail/bloat/attachments/20171118/813cf76d/attachment-0001.html>