General list for discussing Bufferbloat
 help / color / mirror / Atom feed
From: Ken Birman <kpb3@cornell.edu>
To: Matthias Tafelmeier <matthias.tafelmeier@gmx.net>
Cc: Dave Taht <dave@taht.net>, Bob Briscoe <ietf@bobbriscoe.net>,
	"ken@cs.cornell.edu" <ken@cs.cornell.edu>,
	"bloat@lists.bufferbloat.net" <bloat@lists.bufferbloat.net>
Subject: Re: [Bloat] DETNET
Date: Sat, 18 Nov 2017 19:47:02 -0000	[thread overview]
Message-ID: <BN6PR04MB11871AED343EF0703811F61B9F2C0@BN6PR04MB1187.namprd04.prod.outlook.com> (raw)
In-Reply-To: <A200909C-1FB0-439B-AC61-E241AEF95D12@cornell.edu>

[-- Attachment #1: Type: text/plain, Size: 4398 bytes --]

“clan” is an iPad typo.  “vlan”

From: Ken Birman
Sent: Saturday, November 18, 2017 2:44 PM
To: Matthias Tafelmeier <matthias.tafelmeier@gmx.net>
Cc: Dave Taht <dave@taht.net>; Bob Briscoe <ietf@bobbriscoe.net>; ken@cs.cornell.edu; bloat@lists.bufferbloat.net
Subject: Re: [Bloat] DETNET

Several remarks:
- if you have hardware RDMA, you can use it instead of TCP, but only within a data center, or at most, between two side by side data centers.  In such situations the guarantees of RDMA are identical to TCP: lossless, uncorrupted, ordered data delivery.  In fact there are versions of TCP that just map your requests to RDMA.  But for peak speed, and lowest latency, you need the RDMA transfer to start in user space, and terminate in user space (end to end).  Any kernel involvement will slow things down, even with DMA scatter gather (copying would kill performance, but as it turns out, scheduling delays between user and kernel, or interrupts, are almost as bad)

- RDMA didn’t work well on Ethernet until recently, but this was fixed by a technique called DCQCN (Mellanox), or its cousin TIMELY (Google).  Microsoft recently had a SIGCOMM paper on running RDMA+DCQCN side by side with TCP/IP support their Azure platform, using a single data center network, 100Gb.  They found it very feasible, although configuration of the system requires some sophistication.  Azure supports Linux, Mesos, Windows, you name it.  The one thing they didn’t try was heavy virtualization.  In fact they disabled enterprise clan functionality in their routers.  So if you need that, you might have issues.

- One-way latencies tend to be in the range reported earlier today, maybe 1-2us for medium sized transfers.  A weakness of RDMA is that it has a smallest latency that might still be larger than you would wish, in its TCP-like mode.  Latency is lower for unreliable RDMA, or for direct writes into shared remote memory regions.  You can get down to perhaps 0.1us in those cases, for a small write like a single integer.  In fact the wire transfer format always moves fairly large numbers of bytes, maybe 96?  It varies by wire speed.  But the effect is that writing one bit or writing 512 bytes can actually be pretty much identical in terms of latency.

- the HPC people figured out how to solve this issue of not having hardware RDMA on development machines.  The main package they use is called MPI, and it has an internal split: the user-mode half talks to an adaptor library called LibFabrics, and then this maps to RDMA.  It can also discover that you lack RDMA hardware and in that case, will automatically use TCP.  We plan to port Derecho to run on this soon, almost certainly by early spring 2018.   Perhaps sooner: the API mimics the RDMA one, so it wont be hard to do.  I would recommend this for anyone doing new development.  The only issue is that for now LibFabrics is a fancy C header file that uses C macro expansion, which means you can’t use it directly from C++...you need a stub library, which can add a tiny bit of delay.  I’m told that the C++ library folks are going to create a variadic templates version, which would eliminate the limitation and offer the same inline code expansion as with the C header, but I don’t know when that will happen.

We are doing some testing of pure LibFabrics performance now, both in data centers and in WAN networks (after all, you can get 100Gbps over substantial distances these days... Cornell has it from Ithaca to New York City where we have a hospital and our new Tech campus).  We think this could let us run Derecho over a WAN with no hardware RDMA at all.

- there is also a way to run Derecho on SoftRoCE, which is a Linux software emulation of RDMA.  We tried this, and it is a solid 34-100x slower, so not interesting except for development.  I would steer away from SoftRoCE as an option.  It also pegs two cores at 100%...  one in the kernel and one in user space.  So maybe this is just in need of tuning, but it certainly seems like that code path is just not well optimized.    A poor option for developing code that would be interoperable between software or hardware accelerated RDMA at this stage.  LibFabrics probably isn’t a superstar either in terms of speed, but so far it appears to be faster, quite stable, easier to install, and much less of a background load...

Ken



[-- Attachment #2: Type: text/html, Size: 7639 bytes --]

  reply	other threads:[~2017-11-18 19:47 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-11-04 13:45 Matthias Tafelmeier
     [not found] ` <87shdr0vt6.fsf@nemesis.taht.net>
2017-11-12 14:58   ` Matthias Tafelmeier
2017-11-12 19:58     ` Bob Briscoe
2017-11-13 17:56       ` Matthias Tafelmeier
2017-11-15 19:31         ` Dave Taht
2017-11-15 19:45           ` Ken Birman
2017-11-15 20:09             ` Matthias Tafelmeier
2017-11-15 20:16               ` Dave Taht
2017-11-15 21:01                 ` Ken Birman
2017-11-18 15:56                   ` Matthias Tafelmeier
2017-12-11 20:32                   ` Toke Høiland-Jørgensen
2017-12-11 20:43                     ` Ken Birman
2017-11-18 15:38           ` Matthias Tafelmeier
2017-11-18 15:45             ` Ken Birman
2017-11-19 18:33             ` Dave Taht
2017-11-19 20:24               ` Ken Birman
2017-11-20 17:56                 ` [Bloat] *** GMX Spamverdacht *** DETNET Matthias Tafelmeier
2017-11-20 19:04                   ` Ken Birman
2017-12-17 12:46                     ` Matthias Tafelmeier
2017-12-17 16:06                       ` Ken Birman
2017-11-18 17:55           ` [Bloat] DETNET Matthias Tafelmeier
2017-11-18 19:43             ` Ken Birman
2017-11-18 19:47               ` Ken Birman [this message]
2017-11-20 18:32               ` Matthias Tafelmeier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://lists.bufferbloat.net/postorius/lists/bloat.lists.bufferbloat.net/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=BN6PR04MB11871AED343EF0703811F61B9F2C0@BN6PR04MB1187.namprd04.prod.outlook.com \
    --to=kpb3@cornell.edu \
    --cc=bloat@lists.bufferbloat.net \
    --cc=dave@taht.net \
    --cc=ietf@bobbriscoe.net \
    --cc=ken@cs.cornell.edu \
    --cc=matthias.tafelmeier@gmx.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox