From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ivan.Harhan.ORG (unknown [208.221.139.33]) by huchra.bufferbloat.net (Postfix) with SMTP id AA5F121F0EA for ; Fri, 26 Oct 2012 11:55:01 -0700 (PDT) Received: by ivan.Harhan.ORG (5.61.1.6/1.36) id AA17675; Fri, 26 Oct 2012 18:54:56 GMT Date: Fri, 26 Oct 2012 18:54:56 GMT From: msokolov@ivan.Harhan.ORG (Michael Spacefalcon) Message-Id: <1210261854.AA17675@ivan.Harhan.ORG> To: bloat@lists.bufferbloat.NET Cc: opensdsl@ifctfvax.Harhan.ORG Subject: Re: [Bloat] Designer of a new HW gadget wishes to avoid bufferbloat X-BeenThere: bloat@lists.bufferbloat.net X-Mailman-Version: 2.1.13 Precedence: list List-Id: General list for discussing Bufferbloat List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 26 Oct 2012 18:55:12 -0000 Albert Rafetseder wrote: > Hi Michael, > Not that I'm an expert of any means on the topics you touch, but I'll = > share my point of view on some of the questions raised. Yay, someone was interested enough to respond! > Please excuse my = > aggressive shortening of your original post. No problem. :-) > > http://ifctfvax.Harhan.ORG/OpenWAN/OSDCU/ > The site appears down, It was indeed down for about 31 h, from about 2012-10-21T09:11 to about 2012-10-22T16:28 GMT. It is back up now, but on a *very* slow connection - the 384 kbps pipe it is supposed to be on is still down, and I've had to activate the emergency dial backup mechanism. The latter is a 31200 bps analog modem connection. (Unfortunately, my switchover mechanism is manual and an incredible pita, which tends to extend the downtime.) > but from your description I think I understand = > what you are building. OK. :-) > You must also rule out ATM cell loss and reordering. Otherwise, there is = > too little data in your receive buffer to reassemble the transmitted = > frame (temporarily with reordering, terminally with loss). This calls = > for a timeout of sorts. Ahh, I guess I need to clarify something. Basically, there are two kinds of ATM users: 1. People who use ATM because they actually like it, and extoll its supposed virtue of allowing the cell streams on different VCs to be interleaved, such that a cell on one VC can be sent in the middle of a long packet on another VC. Of course these are the people who gave us this ATM abomination in the first place, but I don't know if any of them are still alive and still uphold those beliefs. 2. People who use ATM because that's what comes down the pipe from the service provider, hence they have to deal with it whether they like it or not, usually the latter. I strongly suspect that the vast majority of ATM users today are in this category. The service providers themselves (*cough* Covad *cough*) might be in this boat as well, with the like-it-or-not imposed condition being their vast sunk-cost investment in their nationwide access network, which is 100% ATM. The issue of reordering is relevant only when there are 2 or more VCs on the same physical circuit, and I have yet to encounter an xDSL circuit of any flavor that is set up that way. OK, it's very likely that the people from category 1 above had their circuits set up that way, but once again, I don't know if any of those people are still alive. All xDSL/ATM circuits I've ever used, worked on, or encountered in any other way belong to category 2 ATM users, and have only one VC at some fixed VPI/VCI (0/38 for Covad SDSL). When there is only one PVC on the underlying physical circuit, there is no possibility of cell reordering or any other supposed benefits of ATM: every packet is sent as a series of cells in order from start to finish, in exactly the same order in which it would have been transmitted over an Ethernet or HDLC (Frame Relay) medium, and the ATM cell structure does absolutely nothing except wasting the circuit's bit bandwidth. The basic/default version of the FPGA logic function on my BlitzDSU will support the conversion of SDSL/ATM to HDLC for the dominant configuration of only one PVC. If anyone ever wishes to use the device on an SDSL/ATM circuit with 2 or more active VCs, well, it's an FPGA and not an ASIC: we can just load a different logic function for those special-case users. Cell loss: it can't be detected directly, and it manifests itself in the reassembled AAL5 packets appearing to be corrupt (bad CRC-32 at the end). If the first cell or some middle cell of a packet gets lost, only that one packet is lost as a result. If the last cell of a packet gets lost, the overall IP-over-ATM link loses two packets: the one whose last cell got lost, and the next one, which appears to be a continuation of the previous one. So that's another way in which ATM on an xDSL circuit is worse than Ethernet or HDLC. Oversize packets: the design of my Layer 2 converter thoroughly assumes that the maximum size of a valid packet in ATM cells is a static configuration parameter fixed at provisioning time. In my original post I've called this configuration parameter M, and for classic IPv4 packets of up to 1500 octets in one of the standard encapsulations M=32. This limit on the number of ATM cells per AAL5 packet will be strictly enforced by the L2 converter. As ATM cells arrive from the SDSL side, they'll be stored in the SDSL->HDLC buffer, and a count register will be incremented. If that count register reaches M on a cell whose "terminal cell" header bit isn't set (meaning that more cells are coming), the packet will be declared invalid for the reason of being oversize, an error stats counter will be incremented, and the buffer write pointer will be reset, discarding the previously accepted cells of that oversize packet. An oversize packet can appear for one of two reasons: either misconfiguration on the other end (perhaps miscommunication between the user and the service provider as to what the MTU is or should be), or more likely, the result of cell loss "merging" two valid packets into one bogon packet. Reassembly timeout: yes, it would be a good measure, as it might reduce the needless packet loss resulting from cell loss-induced "merging". If the last cell of a packet gets lost, then the line goes quiescent for a while, then another packet begins, the reassembly timeout might prevent the second good packet from being interpreted as a continuation of the corrupt one and therefore lost as well. But it won't help prevent exactly the same loss scenario if the second good packet follows immediately (or shortly) after the corrupt one on a busy pipe. ATM sucks. Majorly. The DSLAM vendors could have ameliorated the problem somewhat by keeping ATM for the backhaul links behind the DSLAMs, but converting to HDLC (doing FRF.8) before sending the bits to the individual subscribers. Copper Mountain did it, but the 3 CM DSL network operators I know of (NorthPoint, Rhythms, DSL.net) have all bit the dust now (in the listed order). All we are left with is what used to be Covad, now MegaPath. Nokia D50 DSLAMs, ATM cells all the way to the CPE. Oh, and for all the "regular" DSL users out there: ADSL, if I'm not mistaken, specifies ATM cells as part of the standard, so if you have regular ADSL, no matter which ISP or DSLAM brand, you are getting exactly the same nonsense as I get from Covad/Nokia SDSL: a single PVC carrying your IP packets as sequences of ATM cells, with all the same issues. > My colleague tells me his Linux boxes (3.5/x64, 2.6.32/x86) have an = > ip.ipfrag_time of 30 seconds. Anyway, it's lots of cells to buffer, I = > suppose. In my design the AAL5 reassembly timeout does not affect the size of the SDSL->HDLC buffer in any way. There is only one PVC, cells get written into the buffer as they arrive, at whatever time that happens, and the buffer is allowed to empty out onto the HDLC interface when a cell arrives with the "terminal cell" header bit set, and the CRC-32 check at the end of that cell passes. This design will work with a finite buffer (only M+1 cells of buffer space needed to guarantee zero possibility of out-of-buffer packet loss) even if the reassembly timeout is set to infinity: if the reassembly timer is ticking, that means we are getting idle cells from the line in a middle of a packet reassembly, and those don't need to be stored. Latency considerations: I've entertained the possibility of sending the bits to the HDLC side as they arrive from the SDSL side, without waiting to buffer the whole packet, but that would require doing data- dependent clock gating on the V.35 Rx side, and I find the idea of data-dependent clock gating to be a little too evil for me. So at least for the initial version, I'll have my SDSL->HDLC logic function start pushing a packet out on the HDLC side only when that packet has been buffered up in its entirety. The greatest added latency that any packet may experience (from the moment the last bit of that packet has arrived from the SDSL transceiver chip to the moment the attached V.35 router receives the last bit of the transformed packet) equals the time it would take to transmit the largest allowed packet (M cells) on the V.35 Rx link, at whatever bit rate that link has been configured for. Choice of bit rate for the V.35 Rx link: it will be a programmable clock divider, totally independent of the SDSL physical layer bit clock. The guarantee of zero out-of-buffer packet loss under every possible condition requires that the V.35 Rx link bit rate be set to no less than SDSL_bit_rate*1.2*48/53 (a little under x1.09) for the minimum buffer size of M+1, or no less than SDSL_bit_rate*1.2*8/9 (a little under x1.07) for the minimum buffer size of M+8. The 1.2 factor accounts for the possibility of HDLC's worst-case expansion (if someone were to send big packets filled with FFs, or all 1s), and the difference between x48/53 vs. x8/9 changing the minimum buffer size requirement has to do with how the ATM cells are packed into the Nokia SDSL frame structure. But the above calculations show the minimum required bit rate for the V.35 Rx link. It can be set higher to reduce the latency. Most V.35 routers have been built for use with T1/E1 circuits, and should have no problem comfortably handling bit rates up to 2 Mbps or at least 1.5 Mbps. My current OSDCU (the feeble CPU-mediated version of the Layer 2 converter I'm planning to implement in FPGA logic) serves a 384 kbps circuit, but I have the V.35 Rx bit rate set to ~1562 kbps: 25 MHz clock divided by 16. And if someone does want to try the data-dependent clock gating trick (sending the bits to the HDLC side before the complete packet has been received), well, it's an FPGA, not an ASIC - go ahead and try it! [moving on to the HDLC->SDSL direction] > Canceling packets halfway through transmission makes no sense. Of course, I'm not suggesting doing that. The only time that would happen with my FPGA-based HDLC->SDSL logic function is if a corrupt packet arrives on the V.35 port. Because ATM allows idle cells in the middle of a packet, we can send the first cell out on the SDSL side as soon as we've received the first 48 octets of the packet payload from the HDLC side, without waiting for the end of the packet, then apply the same logic to the next cell, and so on. But the HDLC FCS (CRC-16) check happens at the end of the packet being received from the V.35 Tx side. What do we do if that CRC check fails, or we receive an HDLC abort sequence (7 or more ones)? If we've already started transmitting that packet on the SDSL side, we can "cancel" it by sending what's called an AAL5 abort: a cell with the "terminal cell" header bit set, and 0000 in the 16-bit packet length field in the AAL5 trailer. It won't recover the bandwidth wasted sending that invalid packet, but it would tell the router on the other end to discard it, rather than turn a corrupt packet into one claiming to be good. The assumption is that the end user is able to ensure that such errors won't happen in normal operation. A V.35 cable connects two pieces of equipment sitting right next to each other in the same room, totally under the end user's control, so there should be no reason for errors on that V.35 cable. > I think = > there are many good arguments for head-drop, and while I'm not a = > hardware engineer, I don't see why it would be terribly difficult to = > implement. Suppose you have a list of starts of packets addresses within = > your cell transmit ring buffer. Drop the first list element. Head drop! = > Done! Once the current packet's cells are transmitted, you jump to the = > next start of packet in the list, wherever it is. As long as you ensure = > that packets you receive on the HDLC side don't overwrite the cells you = > are currently transmitting, you are fine. (If the buffer was really = > small, this probably meant lots of head drop.) Let me try to understand what you are suggesting. It seems to me that you are suggesting having two separate elastic buffers between the fill side and the drain side: one storing cells, the other storing one-element-per-packet information (such as a buffer start address). You implement head drop by dropping the head element from the buffer that reckons in packets, rather than cells. But the cells are still there, still taking up memory in the cell buffer, and that cell storage memory can't be used to buffer up a new ingress packet because that storage sits in the middle between cells of a packet in the middle of transmission and the next packet which is still "officially" in the queue. But I can see how perhaps the head drop mechanism you are suggesting could be used as a secondary AQM-style packet dropper - but it can't be the primary mechanism which ensures that the fill logic doesn't overwrite the cell buffer RAM which the drain side is still reading from. I was talking about the latter in my original post when I said that I couldn't do anything other than tail drop. But I've come up with an enhancement to the tail drop which we could call "modified tail drop" - is there a better, more widely accepted term perhaps? Here's what I mean: when the fill logic sees the beginning of a new packet on the HDLC side, it checks the buffer fill level. If the limit has been reached, instead of simply proceeding to skip/ignore the new HDLC ingress packet, drop the packet which has been most recently added to the cell buffer (reset the write pointer back to a saved position), and proceed to buffer up the new packet normally. > As regards values for the maximum latency L, this might be a too = > restrictive way of thinking of queue lengths. Usually, you'd like to = > accommodate for short bursts of packets if you are sure you can get the = > packets out in reasonable time, but don't want a long "standing queue" = > to develop. You might have read over at ACM Queue [1] that the CoDel = > queue management algorithm starts dropping packets if the time packets = > are enqueued exceeds a threshold. This doesn't mean that the queue = > length has a hard upper limit, though In a hardware implementation, every queue length has a hard upper limit - the physical size of the RAM block allocated for that queue. I suppose that a software implementation can have "infinite" queues, dynamically adding all the available gigabytes of system RAM to them, but in hardware-based solutions, the physical RAM resources are almost always hard-allocated to specific functions at design time. Internal RAM in an FPGA has a much higher cost per bit than ordinary computer RAM. While PC memories are now measured in gigabytes, the RAM blocks in low-cost FPGAs (Altera Cyclone or Xilinx Spartan) still measure in kilobytes. The main purpose of my original inquiry was to gauge what FPGA part I should select for my DSU board design, in terms of the available buffer RAM capacity. Right now I'm leaning toward the Cyclone III family from Altera. Because I have a very low I/O pin count (only a handful of serial interfaces leaving the FPGA), and I want the FPGA to be physically small, inexpensive, and assembly-friendly (i.e., no BGAs), I am only looking at the lowest-end parts of each family in the TQFP-144 packages. (144 pins is way overkill for what I need, but that's the smallest package available in the FPGA families of interest to me.) What draws me to the Cyclone III family is that even the lowest-end parts feature 46 RAM blocks of 9 Kibits each (1 Kibyte + parity), i.e., up to 46 KiB of RAM capacity. That's huge for a low-cost FPGA. If I allocate 26 of those M9K blocks for the HDLC->SDSL cell buffer (24 blocks for the cell payload buffer and 2 blocks for the parallel buffer storing cell header data for the SDSL Tx logic), we can buffer up to 511 ATM cells. (To be convenient, the buffer size in cells must be a power of 2 minus 1.) That corresponds to ~575 ms at 384 kbps or ~144 ms at 1536 kbps. Do you think that would be enough, or do you think we need even more RAM for the "good queue" function? I have not yet gone through the hassle of installing a new version of Quartus (Altera's FPGA compiler) that supports Cyclone III - the version I'm using right now only supports up to Cyclone II, and that's what I'm using for my preliminary Verilog development. With Cyclone II the RAM blocks are only 4096 bits each, and there are only 36 of them in the EP2C8 device (the biggest Cyclone II available in TQFP-144), so that's only 18 KiB of internal RAM. Taking up 13 of those M4K blocks for the HDLC->SDSL function would give us 127 cells; 26 blocks would give us 255 cells. In terms of the cost of the actual parts (in hobbyist single-piece quantities from Digi-Key), there is no significant difference between Cyclone II and Cyclone III. As the extra RAM and logic capacity can't hurt (we don't have to use them if we don't need/want them), I'm hoping to be able to use Cyclone III, unless I run into an obstacle somewhere. > -- it will try to keep the queue = > short (as dropping packets signals "back off!" to the end hosts = > communicating across the link) with some elasticity towards longer = > enqueueing times. Maybe you'd like to be the pioneer acid-testing = > CoDel's implementability in silicon. > [1] http://queue.acm.org/detail.cfm?id=3D2209336 In that paper they say: "If the buffer is full when a packet arrives, then the packet can be dropped as usual." So it seems that the mechanism that prevents system malfunction as a result of the writer overrunning the reader is still essentially tail drop (or my "modified tail drop"), whereas CoDel serves as an add-on that artificially drops some of the queued-up packets. Thus it seems to me that a sensible way to proceed would be to build my hardware first, using the simplistic buffer limit mechanism I have described in my original post, and then try to add CoDel later, once the basic functionality has been brought up. Or do you think this approach would be problematic? > Does any of this make sense to you? Yes, it does. But my application is quite different from what the CoDel folks had in mind: 1. They have put a lot of emphasis into variable-bandwidth links that may be subject to degradation. In the case of SDSL the bandwidth is fixed by the user's budget, and all SDSL circuits I've ever used have been 100% error-free at the physical layer. (That doesn't mean they don't go down, though - but every single time my IP-over- SDSL circuit has crapped out, it has always been a high-level problem on the ISP's side, never a physical layer failure or degradation. My current downtime is no exception.) 2. I work on a different bandwidth scale. They talk a lot about packets crossing from 1 Gbps or 100 Mbps to 10 Mbps, etc, but the highest possible SDSL kbps rate is 2320, and even that is only on paper: the line cards which are actually deployed USA-wide by what used to be Covad max out at 1536 kbps. We still have a bandwidth- reduction bottleneck as the V.35 Tx link (router->DSU) has a higher capacity than SDSL/ATM, but it isn't *that* much higher bandwidth. 3. The CoDel (and bufferbloat/AQM) folks in general seem to view the arrival and departure of packets as atomic instantaneous events. While the latter may be true for a software router which receives packets from or hands them over to a hardware Ethernet interface in an instantaneous atomic manner, that model absolutely does not hold for the underlying HW device itself as it reformats a stream of packets from HDLC to SDSL/ATM in a flow-through fashion, without waiting for each packet to be received in its entirety. So I think I'll need to do some investigation and experimentation of my own to apply these ideas to my peculiar circumstances. If I can use an FPGA part with plenty of RAM and logic capacity to spare after I've implemented the basic minimum, then that should facilitate such later investigation and experimentation. SF