From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ivan.Harhan.ORG (ivan.Harhan.ORG [208.221.139.1]) by huchra.bufferbloat.net (Postfix) with SMTP id F3DFB21F15E for ; Sat, 13 Oct 2012 20:41:58 -0700 (PDT) Received: by ivan.Harhan.ORG (5.61.1.6/1.36) id AA29907; Sun, 14 Oct 2012 03:41:57 GMT Date: Sun, 14 Oct 2012 03:41:57 GMT From: msokolov@ivan.Harhan.ORG (Michael Spacefalcon) Message-Id: <1210140341.AA29907@ivan.Harhan.ORG> To: bloat@lists.bufferbloat.NET Subject: [Bloat] Designer of a new HW gadget wishes to avoid bufferbloat X-BeenThere: bloat@lists.bufferbloat.net X-Mailman-Version: 2.1.13 Precedence: list List-Id: General list for discussing Bufferbloat List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 14 Oct 2012 03:42:00 -0000 Hello esteemed anti-bufferbloat folks, I am designing a new networking-related *hardware* gadget, and I wish to design it in such a way that won't be guilty of bufferbloat. I am posting on this mailing list in order to solicit some buffering and bloat-related design advice. The HW gadget I am designing will be an improved-performance successor to this OSHW design: http://ifctfvax.Harhan.ORG/OpenWAN/OSDCU/ The device targets a vanishingly small audience of those few wretched souls who are still voluntarily using SDSL, i.e., deliberately paying more per month for less bandwidth. (I am one of those wretched souls, and my reasons have to do with a very precious non-portable IPv4 address block assignment that is inseparably tied to its associated 384 kbps SDSL circuit.) What my current OSDCU board does (the new one is intended to do the exact same thing, but better) is convert SDSL to V.35/HDLC. My own SDSL line (the one with the precious IPv4 block) is served via a Nokia D50 DSLAM operated by what used to be Covad, and to the best of my knowledge the same holds for all other still-remaining SDSL lines in the USA-occupied territories, now that the last CM DSLAM operator has bit the dust. The unfortunate thing about the Nokia/Covad flavor of SDSL is that the bit stream sent toward the CPE (and expected from the CPE in return) is that abomination called ATM. Hence my hardware device is essentially a converter between ATM cells on the SDSL side and HDLC packets on the V.35 side. On my current OSDCU board the conversion is mediated by the CPU, which has to handle every packet and manage its reassembly from or chopping into ATM cells. The performance sucks, unfortunately. I am now designing a new version in which the entire Layer 2 conversion function will be implemented in a single FPGA. The CPU will stay out of the data path, and the FPGA will contain two independent and autonomous logic functions: HDLC->SDSL and SDSL->HDLC bit stream reformatters. The SDSL->HDLC direction involves no bufferbloat issues: I can set things up so that no received packet ever has to be dropped, and the greatest latency that may be experienced by any packet is the HDLC side (DSU->DTE router) transmission time of the longest packet size allowed by the static configuration - and I can statically prove that both conditions I've just stated will be satisfied given a rather small buffer of only M+1 ATM cells, where M is the maximum packet size set by the static configuration, translated into ATM cells. (For IPv4 packets of up to 1500 octets, including the IPv4 header, using the standard RFC 1483 encapsulation, M=32.) However, the HDLC->SDSL direction is the tricky one in terms of bufferbloat issues, and that's the one I am soliciting advice for. Unlike the SDSL->HDLC direction, HDLC->SDSL can't be designed in such a way that no packets will ever have to be dropped. Aside from the infamous cell tax (the Nokia SDSL frame structure imposes 6 octets of overhead, including both cell headers and SDSL-specific crud, for every 48 octets of payload), which is data-independent, the ATM creep imposes some data-dependent overhead: the padding of every AAL5 packet to the next-up multiple of 48 octets, and the RFC 1483 headers and trailers which are longer than their Frame Relay counterparts on the HDLC/V.35 side of the DSU. Both of the latter need to be viewed as data-dependent overhead because both are incurred per packet, rather than per octet of bulk payload, and thus penalize small packets more than large ones. Just to clarify, I can set the bit rate on the V.35 side to whatever I want (put a trivial programmable clock divider in the FPGA), and I can set different bit rates for the DSU->router and router->DSU directions. (Setting the bit rate for the DSU->router direction to at least the SDSL bit rate times 1.07 is part of the trick for ensuring that the SDSL->HDLC direction can never overflow its tiny buffer.) Strictly speaking, one could set the bit rate for the router->DSU direction of the V.35 interface so low that no matter what the router sends, that packet stream will always fit on the SDSL side without a packet ever having to be dropped. However, because the worst case expansion in the HDLC->SDSL direction is so high (in one hypothetical case I've considered, UDP packets with 5 octets of payload, such that each IPv4 packet is 33 octets long, the RFC 1490->1483 expansion is 2.4x *before* the cell tax!), setting the clock so slow that even a continuous HDLC line rate stream of worst-case packets will fit is not a serious proposition. Thus I have to design the HDLC->SDSL logic function in the FPGA with the expectation that the packet stream it receives from the HDLC side may be such that it exceeds the line capacity on the SDSL side, and because the attached V.35 router "has the right" to send a continuous line rate stream of such packets, a no-drop policy would require an infinite buffer in the DSU. Whatever finite buffer size I implement, my logic will have to be prepared for the possibility of that buffer filling up, and has to have a policy for dropping packets. What I am soliciting from the bufferbloat-experienced minds of this list is some advice with the sizing of my HDLC->SDSL buffer and the choice of the packet dropping policy. Because the smallest indivisible unit of transmission on the SDSL side (the output side of the HDLC->SDSL logic function in question) is one ATM cell (48 octets of payload + 6 octets of overhead, averaged over the rigidly repeating SDSL frame structure), one sensible way to structure the buffer would be to provide enough FPGA RAM resources to hold a certain number of ATM cells, call it N. Wire it up as a ring buffer, such that the HDLC Rx side adds ATM cells at the tail, while the SDSL Tx side takes ATM cells from the head. With this design the simplest packet drop policy would be in the form of a latency limit: a configurable register in the FPGA would set the maximum allowed latency in ATM cells, call it L. At the beginning of each incoming packet, the HDLC Rx logic would check the number of ATM cells queued up in the buffer, waiting for SDSL Tx: if that number exceeds L, drop the incoming packet, otherwise accept it, adding more cells to the tail of the queue as the bits trickle in from V.35. The constrainst on L is that L+M (the max packet size in ATM cells) must never exceed N (the number of cells that the HW is capable of storing). If I choose the design just described, I know what M is (32 for the standard IPv4 usage), and L would be a configuration parameter, but N affects the HW design, i.e., I need to know how many FPGA RAM blocks I should reserve. And because I need N >= L+M, in order to decide on the N for my HW design, I need to have some idea of what would be a reasonable value for L. L is the maximum allowed HDLC->SDSL packet latency measured in ATM cells, which directly translates into milliseconds for each given SDSL kbps tier, of which there are only 5: 192, 384, 768, 1152 and 1536. At 384 kbps, one ATM cell (which has to be reckoned as 54 octets rather than 53 because of Nokia SDSL) is 1.125 ms; scale accordingly for other kbps tiers. A packet of 1500 octets (32 ATM cells) will take 36 ms to transmit - or just 9 ms at the top SDSL tier of 1536 kbps. With the logic design proposed above, the HDLC->SDSL latency of every packet (from the moment the V.35 router starts transmitting that packet on the HDLC interface to the moment its first cell starts Tx on the physical SDSL pipe) will be exactly known to the logic in the FPGA the moment when the starts begins to arrive from the V.35 port: it will be simply equal to the number of ATM cells in the Tx queue at that moment. My proposed logic design will drop the packet if that latency measure exceeds a set threshold, or allow it through otherwise. My questions to the list are: a) would it be a good packet drop policy, or not? b) if it is a good policy, what would be a reasonable value for the latency threshold L? (In ATM cells or in ms, I can convert :) The only major downside I can see with the approach I've just outlined is that it is a tail drop. I've heard it said in the bufferbloat community that tail drop is bad and head drop is better. However, implementing head drop or any other policy besides tail drop with the HW logic design outlined above would be very difficult: if the buffer is physically structured as a queue of ATM cells, rather than packets, then deleting a packet from the middle of the queue (it does no good to abort the transmission of a packet already started, hence head drop effectively becomes middle drop in terms of ATM cells) becomes quite a challenge. Another approach I have considered (actually my first idea, before I came up with the ring-of-cells buffer idea above) is to have a more old-fashioned correspondence of 1 buffer = 1 packet. Size each buffer in the HW for the expected max number of cells M (e.g., a 2 KiB HW RAM block would allow M<=42), and have some fixed number of these packet buffers, say, 2, 4 or 8. Each buffer would have a "fill level" register associated with it, giving the number of ready-to-Tx cells in it, so the SDSL Tx block can still begin transmitting a packet before it's been fully received from HDLC Rx. (In the very unlikely case that SDSL Tx is faster than HDLC Rx, SDSL Tx can always put idle cells in the middle of a packet, which ATM allows.) Advantage over the ring-of-cells approach: head-drop turned middle-drop becomes easy: simply drop the complete buffer right after the head (the one whose Tx is already in progress.) Disadvantage: less of a direct relationship between the packet drop policy and the latency equivalent of the buffered-up ATM cells for Tx. Which approach would the bufferbloat experts here recommend? TIA for reading my ramblings and for any technical advice, Michael Spacefalcon, retro-telecom nut