From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mout.gmx.net (mout.gmx.net [212.227.15.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by lists.bufferbloat.net (Postfix) with ESMTPS id A15313BA8E for ; Sun, 17 Dec 2017 07:47:12 -0500 (EST) Received: from [192.168.10.50] ([93.233.74.83]) by mail.gmx.com (mrgmx003 [212.227.17.190]) with ESMTPSA (Nemesis) id 0Ln7wj-1ewssz1UXl-00hKaT; Sun, 17 Dec 2017 13:46:28 +0100 To: Ken Birman , 'Dave Taht' Cc: Bob Briscoe , "bloat@lists.bufferbloat.net" References: <4d54f24f-ce83-34a0-41f3-9f728420d548@gmx.net> <87shdr0vt6.fsf@nemesis.taht.net> <79f4d92c-74f4-8cd0-9d38-e51a668cb9b6@gmx.net> <796aa11e-9e35-cf34-e456-6ae98d1875d6@bobbriscoe.net> <87fu9f72za.fsf@nemesis.taht.net> <87a7zirue9.fsf@nemesis.taht.net> <1300dd71-f267-b45d-21d0-f34a3dfa0aaa@gmx.net> From: Matthias Tafelmeier Message-ID: <6c758e2d-4cae-390c-b75b-e829ed6221ef@gmx.net> Date: Sun, 17 Dec 2017 13:46:26 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.5.0 MIME-Version: 1.0 In-Reply-To: Content-Type: multipart/signed; micalg=pgp-sha384; protocol="application/pgp-signature"; boundary="eHHl3cOPMdJqwS8ce7SCekQ7FAkLco8j6" X-Provags-ID: V03:K0:3ujeoawBgDUB2RqDEPtovBRoTN7YWZuWGCjQCWMQ9MO+FgFfCsh 0hAIeELwvULK66txY0H6Qd3A4Cg+e/eVd7YBYoP4qZbVOnjrXb0GK7uCpGdzIUjxkL9abSS FbIAbYt/Fb5zvlwBNpDllqtE3RGTGcolFchFo9pjyC9p5TvNNJFMYbB/HqS63neTyOI21q2 9+20NSG/P320wNqwtTENA== X-UI-Out-Filterresults: notjunk:1;V01:K0:ZR0M4uc/rGo=:qDlRYwvcI0Dq3fG3H358vr f8ta15c00rZb7tH46f8jMnfg7Slm5a8y9OF0swE8Besu3nlxjLCmZvaTAvOOa/qR6H5hV0l3r Sgb7jqkUVIhZ41MtSLhTfMfKpiDiJFGvsyiR7DnyjBIaT9C/pB6oCiwx17yT23IqkrM/UAoe8 n+53sEMXsdIhgTyoo5UXYNObiYQnvU6D8wqz/NG9FNK7ZWggDnqD1vwyCbARnYugeK22mu7XR 89i9Lzz/NVl/9vd6SeHCKOzm/FE1iXI+gyUekD5CNJq27ddwBaX0KVusJURkMK7dLIvSsjLI7 W0kRxBEC/h5kLJiD88PuKIM1IYtUiZQWgD0MEHaXfpRnDfC/Suyaclvh1hv7FP2KuQI5wAq47 gM1x6ol6RPQhwG6UQvrm5AxGko3pt15FF8yx4buvzAm8bgjn7FkLpGbUVFQdCyT83h1eyx2bB nxDgU0qKUIXlKdcAbH4lxRyXg9FTARjj69tZLlDfgzLzDMW0HI/Rc7/IcDsFP6UIOuToqs4IF ej4ynDhOteyB7rLDjtZPGIAoRJrKx/bUPh/vd/vhzEz3F+szm1TokIU7pElDNmSYSpF2MrI7X KMwPaqFYgdwzJEQnWneZGu+yZHyk7wrOLkmS9tj7B4eLhT3WILB6bL20u/5WXWKhsxLbd9ZY3 Y7XJf/SyqX15f4ynUNl/m3mGhbr8FFFWbZbYLkqq27vGyvPpwE+E91bu6StPqKq7MdBzGOCpv Mte2kPpRGWQq4iett5U5HnIjlXOUT6LNhC9iWqxrDy1hIGC3oQ2YbJz6Lc0VNvsBHFqm05di2 A5IXATLN5sIQ8b2c27lH53GOWrhoQ== Subject: Re: [Bloat] *** GMX Spamverdacht *** RE: DETNET X-BeenThere: bloat@lists.bufferbloat.net X-Mailman-Version: 2.1.20 Precedence: list List-Id: General list for discussing Bufferbloat List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 17 Dec 2017 12:47:13 -0000 This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --eHHl3cOPMdJqwS8ce7SCekQ7FAkLco8j6 Content-Type: multipart/mixed; boundary="aFwUOg9AO6mK6xAV902OqvP6iRJpFO4DU"; protected-headers="v1" From: Matthias Tafelmeier To: Ken Birman , 'Dave Taht' Cc: Bob Briscoe , "bloat@lists.bufferbloat.net" Message-ID: <6c758e2d-4cae-390c-b75b-e829ed6221ef@gmx.net> Subject: Re: *** GMX Spamverdacht *** RE: [Bloat] DETNET References: <4d54f24f-ce83-34a0-41f3-9f728420d548@gmx.net> <87shdr0vt6.fsf@nemesis.taht.net> <79f4d92c-74f4-8cd0-9d38-e51a668cb9b6@gmx.net> <796aa11e-9e35-cf34-e456-6ae98d1875d6@bobbriscoe.net> <87fu9f72za.fsf@nemesis.taht.net> <87a7zirue9.fsf@nemesis.taht.net> <1300dd71-f267-b45d-21d0-f34a3dfa0aaa@gmx.net> In-Reply-To: --aFwUOg9AO6mK6xAV902OqvP6iRJpFO4DU Content-Type: multipart/mixed; boundary="------------E234435C8E6BADD254236766" Content-Language: de-DE This is a multi-part message in MIME format. --------------E234435C8E6BADD254236766 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable > Well, the trick of running RDMA side by side with TCP inside a datacent= er using 2 Diffsrv classes would fit the broad theme of deterministic tra= ffic classes, provided that you use RDMA in just the right way. You get = the highest level of determinism for the reliable one-sided write case, p= rovided that your network only has switches and no routers in it (so in a= COS network, rack-scale cases, or perhaps racks with one TOR switch, but= not the leaf and spine routers). The reason for this is that with route= rs you can have resource delays (RDMA sends only with permission, in the = form of credits). Switches always allow sending, and have full bisection= bandwidth, and in this specific configuration of RDMA, the receiver gran= ts permission at the time the one-sided receive buffer is registered, so = after that setup, the delays will be a function of (1) traffic on the sen= der NIC, (2) traffic on the receiver NIC, (3) queue priorities, when ther= e are multiple queues sharing one NIC. > > Other sources of non-determinism for hardware RDMA would include limite= d resources within the NIC itself. An RDMA NIC has to cache the DMA mapp= ings for pages you are using, as well as qpair information for the connec= ted qpairs. The DMA mapping itself has a two-level structure. So there = are three kinds of caches, and each of these can become overfull. If tha= t happens, the needed mapping is fetched from host memory, but this evict= s data, so you can see a form of cache-miss-thrashing occur in which perf= ormance will degrade sharply. Derecho avoids putting too much pressure o= n these NIC resources, but some systems accidentally overload one cache o= r another and then they see performance collapse as they scale. > > But you can control for essentially all of these factors. > > You would then only see non-determinism to the extent that your applica= tion triggers it, through paging, scheduling effects, poor memory allocat= ion area affinities (e.g. core X allocates block B, but then core Y tries= to read or write into it), locking, etc. Those effects can be quite lar= ge. Getting Derecho to run at the full 100Gbps network rates was really = hard because of issues of these kinds -- and there are more and more pape= rs reporting similar issues for Linux and Mesos as a whole. Copying will= also kill performance: 100Gbps is faster than memcpy for a large, non-ca= ched object. So even a single copy operation, or a single checksum compu= tation, can actually turn out to be by far the bottleneck -- and can be a= huge source of non-determinism if you trigger this but only now and then= , as with a garbage collected language. > > Priority inversions are another big issue, at the OS thread level or in= threaded applications. What happens with this case is that you have a l= ock and accidentally end up sharing it between a high priority thread (li= ke an RDMA NIC, which acts like a super-thread with the highest possible = priority), and a lower priority thread (like any random application threa= d). If the application uses the thread priorities features of Linux/Meso= s, this can exacerbate the chance of causing inversions. > > So an inversion would arise if for high priority thread A to do somethi= ng, like check a qpair for an enabled transfer, a lower priority thread B= needs to run (like if B holds the lock but then got preempted). This is= a rare race-condition sort of problem, but when it bites, A gets stuck u= ntil B runs. If C is high priority and doing something like busy-waiting= for a doorbell from the RDMA NIC, or for a completion, C prevents B from= running, and we get a form of deadlock that can persist until something = manages to stall C. Then B finishes, A resumes, etc. Non-deterministic n= etwork delay ensues. > So those are the kinds of examples I spend a lot of my time thinking ab= out. The puzzle for me isn't at the lower levels -- RDMA and the routers= already have reasonable options. The puzzle is that the software can in= troduce tons of non-determinism even at the very lowest kernel or contain= er layers, more or less in the NIC itself or in the driver, or perhaps in= memory management and thread scheduling. > > I could actually give more examples that relate to interactions between= devices: networks plus DMA into a frame buffer for a video or GPU, for e= xample (in this case the real issue is barriers: how do you know if the c= ache and other internal pipelines of that device flushed when the transfe= r into its memory occurred? Turns out that there is no hardware standard= for this, and it might not always be a sure thing). If they use a sledg= ehammer solution, like a bus reset (available with RDMA), that's going to= have a BIG impact on perceived network determinism... yet it actually i= s an end-host "issue", not a network issue. > > Nothing to challenge here. For the scheduler part I only want to add - you certainly know that better than me -, that there are quite nifty software techniques to literally erradicate at least the priority inversion problem. Only speaking for LNX, know that there's quite some movement in general for scheduler amendmends for network processing at the moment. Not sure if vendors of embedded versions or the RT-patch of it haven't made it extinct already. Though, not the point you're making. Further, it would still leave the clock-source as non-determism introduce= r. Quite a valuable research endeavor would be to quantify all of those traits and compare or make it comparable to=C2=A0 the characteristics of certain other approaches, e.g., the quite promising LNX kernel busy polling[1] mechanisms. All are suffering from similar weaknesses, remaining the question for the which. Saying that a little briskly w\o clearly thinking through its feasibility for the time being. [1] https://www.netdevconf.org/2.1/papers/BusyPollingNextGen.pdf --=20 Besten Gru=C3=9F Matthias Tafelmeier --------------E234435C8E6BADD254236766 Content-Type: application/pgp-keys; name="0x8ADF343B.asc" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="0x8ADF343B.asc" -----BEGIN PGP PUBLIC KEY BLOCK----- Version: GnuPG v2 mQENBFJ0FzIBCADZ/hkwcprVGydMOqeqM+2k6v5e5kb4YDMKU7nMbCVmH4sn01T7 Yh9kDwG5LOMLD06BB2txjLBvTY+c0mpK+hE4pWr+i3qhU5CbVvx7jppJqCD6ZT/T A3I7NxsdixRvLIF4UXgKQOMKPIx+aw/sp86NqzCLAMse7F0vXUjAP5YANtJid2rf r/B37BGKhqDGhi4Appz4UZOzpRov/v8JD4XScuvJnl09/oi5cDj3Mn2uqOc/G6hA t7HXsbHh4dKxd3AftqPPzEkJAmm+9Z4ASG9hy8IXms8Czimr+BGL0CnfsJlX6DCU m6mVDqT1GJyzmP4zkWcPi+2fOI4KtpV+C7+bABEBAAG0Ok1hdHRoaWFzIFRhZmVs bWVpZXIgKHByaXZhdCkgPG1hdHRoaWFzLnRhZmVsbWVpZXJAZ214Lm5ldD6JAUEE EwECACsCGwMGCwkIBwMCBhUIAgkKCwQWAgMBAh4BAheAAhkBBQJaAMzNBQkLTxyN AAoJEOAWT1uK3zQ7u+4IAL+W82wbz1FwGfNHhOgOheCh/wlTLssgQ7XVGRduJ/m3 k22aodOKSV5aH3AUy9c9zkgkRHUU5XCG9FRujVeVYhvLP1JTG97oEjk8YGBAOqN7 D4hUHh0c3ZBpTqeE9cndXr504GMauh5mY74qdNl9nL+Gcv7CekENML1nLWBnoDWV NTpvZktPZpHozQHPBV6lk09ICxOocb7VHl+lyorStqiUcLciHTdOByC35ekJebv9 dDZ9oloI8tvLytyle1kuVQLJj0LrpkUjcjLSYoa7ZFVKCNK6FM1pWd1XRHmWqhAn i4+4l92+UHU2TASBFIUVO3kPEDOXn7kK4q6tQ2pRexuJARwEEwECAAYFAlP4qFwA CgkQc1YJs62PXiMJwggAgwa8bM1DVdB5wdWVbsEvjDWgoD4CZOH+3/nCAKFv+eKf d3GrJUtOh3T/QVpmbVgNwnyqqNLGlyIOHltVkrn9WqSC33kuXsIStR6KM+LXnA99 FjyAiTcVbzbfl/XNlIQrgV9+niSSUFCUge5242itPjBBCtlYHUkQ5Y9hsNwV9Hb7 dpVxUf01CJcNKlWscC7lTt2FqjJrIOw2NHxgHWxRlDqo9dFg4uwI+O90orKAyJ6N Sowu2Ca6DXB3jHgoG7WbAh1nEVus/JkyVsnTIMCsfOpwhJNd1fvy6JMVAe7+/p6/ JGZcMjUTadmsHeJHwqOJSVOoX3Y8CZkV8/PHIqmYvokBPgQTAQIAKAUCU/hT+gIb AwUJAeEzgAYLCQgHAwIGFQgCCQoLBBYCAwECHgECF4AACgkQ4BZPW4rfNDvHZAf7 BpqfTpLdR25q72DZ4T6SrkcZCOJ8jJQhQ1K9cAc0snpK+jcWg9iUgCpV8QJiXpGL dAkux/YCu7SRstOSbMv4G4Qb/g8y2bowFI/mAzK1o6s6CYt3URNBe7zRLK6sJbK0 f5fDpWoRufW9/Ppj1/S7dki5JpkUlyGa/y2O+X4C/P0Rh3HfL5HicRHamc7PVElh z/8nVA+KUkcA9ksVJqe50LahTbDyqOmd8cjSdUKlH2dsP/cAmZfU3IAa26UBKWVn rxnQ61VV5QLpcYvW3jTfzBy7xv1s7YSj2rFpIC5WKgPC/p+2gqVMSGI1hRlrNaSA qH9XgWNPCa5NlOxuN4mSN4kBQQQTAQIAKwIbAwYLCQgHAwIGFQgCCQoLBBYCAwEC HgECF4AFCQSrhqYFAlRXtDwCGQEACgkQ4BZPW4rfNDvqmQf/Wa2LXaOvzftegDI+ LAiFOB/Dq5yhFp6urk5+yC+YzCFin8HfP+LVXR8Xkei6fMmFMjfRU0MrNLBxFd3I UrIgRrtmJGaHB+vkIqNGgU8LcpHBdd6nprtIF53IhtOINwkmCgLzWi3sGYJ4yQyj 9OSNnh7j7ENFeZd8LgN/FgB5GjPisN3zJD19z065jlfeXvHIZOL90PaTqih90x6n oTr4dbKhk1t9zZYhY5W812gCMVn2g4wLLO+iijKOe8uNrOw22xDGckoL5UFRE8Vj Twup3eYyzb/2TVpAmM5GhnI+PodZ6GGcQRVKGMYwYyFVLDEcDRxAUpwfXfUzHwvD op2fQ7Q/TWF0dGhpYXMgVGFmZWxtZWllciAoSGFwbG9ycmhpbmkpIDxtYXR0aGlh cy50YWZlbG1laWVyQGdteC5uZXQ+iQEfBDABAgAJBQJWPZtPAh0AAAoJEOAWT1uK 3zQ7FrIH/3OFr/bZ2UQeJrn9n67G9o9neJv4ES9Lcq6xnCIc+ZRHqrBTwsYkfYC5 MEMTF7TMFNUJTr2Np3OG7iKcHVePpeMpHicXppJ4hUsIQ0kwXlynRAScrAqoQHBD IKzu5qHDME0UKIWr9iTASFHJgZGyH6OoPh7LIifV8cGdVPQ/5FF1kqM4YMZ3IygO C3CaYtEaOz0B1L00zJan8rbEnpsI1msZ3hjacGB2SD5kFAUMDbpoXVOOE07GSLF0 KKhMv02WdrKO8iedStubO9BON9Vf7IIq21RpDEhhjAzt4Ui2q0UEodTvnBX4ifFM UEU/+NdC3deuRwdxOq1ozSQlUTzVFASJATQEEwECAB4CGwMCHgECF4ACCwkCFQoC FgMFAlRXs+UFCQSrhqYACgkQ4BZPW4rfNDs/HQf+K5swcPreRRBXQbTBCgTQAoAI JtvG+TLlPPnpYMqkQoKIhw/USN2Je4Gqm3DhRcCteA5wbmhlHj9DbapbCOwE7vfK 3YC/hpntvnmgCl6atT2QbE4Ak7xeT2ljLiRYD1re7oE8fAUqkI2S+vePiK1+b8Cc OKPmuAJYmgAMmNVMKcknryNoFc7xseNEy58T+AoyCKcxV9ZJdyd+6Ye48LkRlmyf lfRnCvgfS74TEq7Gr5uCJPgqcjrl8SS3G6jgUrzPcV2mFROt9EH3d80T+GOIy4pB SeYGdfkeqUbflj0CeRIyazzAZurllCQWjpaeh009Y/wuzLm91zrVWwADP/5oRIkB HAQTAQIABgUCU/ioXAAKCRBzVgmzrY9eI23xB/461lM4c/08tEwmd0oC1jdwyidO ZRCj/vOqZ+Af6oB3FdpWseuKWdJ7zb8NR+BcEUQRqbaF/677cCrKIEnRoq7IzNsw KiqK9K5cFHLtm9TNZ0Mf5QP/PG47Jrex5l59LMMz+LW1Rv/uXJjTQDeQYsrYDAPK mJx2c0OyzZnr+CRrHJKUH0P+oVBvSQEnDTbCT2W9wuLrDIHF6H3YQLAuCS2sslq5 teAinTjTGnPmkP7hKcK3CC0BdiUgFFybIZOGthFm+bTG+V3qGUiam95dqDmQ2VTX QxCetGDTjUrvKdODh5qLFUM+StWLNEP+QtMcymZfrRIHayKS3GyjKBOPoCZpiQE0 BBMBAgAeAhsDBQkB4TOAAh4BAheABQJSdB8LAgsJAhUKAhYDAAoJEOAWT1uK3zQ7 5nMH/iscBMT7fEnIBeYZlFaxiJmFobRVWFP/A2IfzvKVIdY9vDqjN5M5chrRfsk9 HI0EPbiF3kmVwmRdl8J5fgN7O8QFbhW4ojda8UXBAsgF50kurqk2NrdgM+xLy2TI jVZdwQYK+R11SpO9xaf+nqLKV8r9VkQ5mzb+BEzDiaX7z9IBrm28v0BfelDZVRzW cOrskVnwX1PySt3xvCwwo3cwY39yno7H4AlgTXhAvhwI6DQMxZXm4MZcugSfB5b4 2uyslOOxMkgvLW1CmpJxzbWXYT+40vW0DbQAUC1VIr7hiPrunrRAWfbV/RWZl6lr 6gMJJXkLGN70MUX40FF0IhZw5Ja5AQ0EUnQXMgEIAPICr+5yNyuVcmsv5xpmRKnz KoTjJ7xt0EPiru895LEUPN25tJyi8PZmLciNJnEoQ231jjAloQvx1pb3cr35zzGX PTPJ5fEECZDxMWMjVvCMb4XK0YjqCF9i/uKic5zqjwRNAPEGTO/ZgS+e21lUJSmu KR6m5WQcKgBH+tqS3rodgjunnIN4UiNMxbq/VVGICWPjdgoTkqWE3r8QthKLg4Lj zILEH3HbG39l+vTwEKaP4q5xShFZjRUrZC2anBP+gQx/FbBff3ufpCL9LF5dkywZ P3NHVyaa/D8T66CTj7Rynd4NZy/qdqGMjMjEuf0RkgXF7Uar4GXmuOfrIcf/X6EA EQEAAYkBJQQYAQIADwIbDAUCVj2bwwUJB4vrkQAKCRDgFk9bit80O/0ZCACoWtov fl7vH3YNW0K3xWil9wj43X2OxwKiGBdfbI48bW+b6LJQwNwFePFQ/RQCBgg1eerU Oys0ymcmp0VeMdwpw27qWcMcbsDn3Pucqp1C2IUuXesbUcRo+QDqhl96KxAAWY5O JO16dfRrIxyX6Pb0uImqpDetT4Kbr1doTF8cIfRH2rszHKU9BEWag/us9V8H5S8h F7Ws2wH2JlWbQpP8E7/Z5kVM9psdqX9rwbrUAqpyNhtILoC0+zXkdnOxz1WZaBpr ckEYQS4/CQOliYJyd5nXYcwVXCNpdy1Vt53ArN9j/EIvfFxOPCPoBQj1b9nkHqNn qnlCR0LVNOnSE3Eb =3Droc5 -----END PGP PUBLIC KEY BLOCK----- --------------E234435C8E6BADD254236766-- --aFwUOg9AO6mK6xAV902OqvP6iRJpFO4DU-- --eHHl3cOPMdJqwS8ce7SCekQ7FAkLco8j6 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBCQAGBQJaNmcjAAoJEOAWT1uK3zQ7yu4H/Ry9qQr67NLrjr7qBOS9+AmM 64LZ7n/Zv3T4QpoPEjUV74rpjDtt2xDz4pie6Lzx5vbIBXfe7rwLHfGHch48VRnD aAMszum0I1HeWmzPARq0h9pUvhP7QpGBPnsOhvH1y5/X/+OoWzpx2y3njLAdKYR6 AjF05FTLIGGY3/QDTuM9nRv5FQp89zB2pjpiZSl0VsEcuIZHM/buXQ6sny2jrQZw mzPClkGRDJ4eSefqrgsNsyIuIwWWQ04joUzE1FBSngC7hcSAlg4XpcNDv8wKQ+s4 7plFTtMdbIAz+BIIyRP5MvI4MmcsxwMf1cEv/QVNCa+q5a6LP7pi7AzrXdf8mAU= =nyv6 -----END PGP SIGNATURE----- --eHHl3cOPMdJqwS8ce7SCekQ7FAkLco8j6--