[Bloat] Designer of a new HW gadget wishes to avoid bufferbloat

Fri Oct 26 14:54:56 EDT 2012

Albert Rafetseder <albert.rafetseder+bufferbloat at univie.ac.at> wrote:

> Hi Michael,
> Not that I'm an expert of any means on the topics you touch, but I'll =
> share my point of view on some of the questions raised.

Yay, someone was interested enough to respond!

> Please excuse my =
> aggressive shortening of your original post.

No problem. :-)

> > http://ifctfvax.Harhan.ORG/OpenWAN/OSDCU/
> The site appears down,

It was indeed down for about 31 h, from about 2012-10-21T09:11 to
about 2012-10-22T16:28 GMT.  It is back up now, but on a *very* slow
connection - the 384 kbps pipe it is supposed to be on is still down,
and I've had to activate the emergency dial backup mechanism.  The
latter is a 31200 bps analog modem connection.

(Unfortunately, my switchover mechanism is manual and an incredible
 pita, which tends to extend the downtime.)

> but from your description I think I understand =
> what you are building.

OK. :-)

> You must also rule out ATM cell loss and reordering. Otherwise, there is =
> too little data in your receive buffer to reassemble the transmitted =
> frame (temporarily with reordering, terminally with loss). This calls =
> for a timeout of sorts.

Ahh, I guess I need to clarify something.  Basically, there are two
kinds of ATM users:

1. People who use ATM because they actually like it, and extoll its
   supposed virtue of allowing the cell streams on different VCs to be
   interleaved, such that a cell on one VC can be sent in the middle
   of a long packet on another VC.  Of course these are the people who
   gave us this ATM abomination in the first place, but I don't know
   if any of them are still alive and still uphold those beliefs.

2. People who use ATM because that's what comes down the pipe from the
   service provider, hence they have to deal with it whether they like
   it or not, usually the latter.  I strongly suspect that the vast
   majority of ATM users today are in this category.  The service
   providers themselves (*cough* Covad *cough*) might be in this boat
   as well, with the like-it-or-not imposed condition being their vast
   sunk-cost investment in their nationwide access network, which is
   100% ATM.

The issue of reordering is relevant only when there are 2 or more VCs
on the same physical circuit, and I have yet to encounter an xDSL
circuit of any flavor that is set up that way.  OK, it's very likely
that the people from category 1 above had their circuits set up that
way, but once again, I don't know if any of those people are still
alive.

All xDSL/ATM circuits I've ever used, worked on, or encountered in any
other way belong to category 2 ATM users, and have only one VC at some
fixed VPI/VCI (0/38 for Covad SDSL).  When there is only one PVC on
the underlying physical circuit, there is no possibility of cell
reordering or any other supposed benefits of ATM: every packet is sent
as a series of cells in order from start to finish, in exactly the
same order in which it would have been transmitted over an Ethernet or
HDLC (Frame Relay) medium, and the ATM cell structure does absolutely
nothing except wasting the circuit's bit bandwidth.

The basic/default version of the FPGA logic function on my BlitzDSU
will support the conversion of SDSL/ATM to HDLC for the dominant
configuration of only one PVC.  If anyone ever wishes to use the
device on an SDSL/ATM circuit with 2 or more active VCs, well, it's an
FPGA and not an ASIC: we can just load a different logic function for
those special-case users.

Cell loss: it can't be detected directly, and it manifests itself in
the reassembled AAL5 packets appearing to be corrupt (bad CRC-32 at
the end).  If the first cell or some middle cell of a packet gets
lost, only that one packet is lost as a result.  If the last cell of a
packet gets lost, the overall IP-over-ATM link loses two packets: the
one whose last cell got lost, and the next one, which appears to be a
continuation of the previous one.  So that's another way in which ATM
on an xDSL circuit is worse than Ethernet or HDLC.

Oversize packets: the design of my Layer 2 converter thoroughly
assumes that the maximum size of a valid packet in ATM cells is a
static configuration parameter fixed at provisioning time.  In my
original post I've called this configuration parameter M, and for
classic IPv4 packets of up to 1500 octets in one of the standard
encapsulations M=32.  This limit on the number of ATM cells per AAL5
packet will be strictly enforced by the L2 converter.  As ATM cells
arrive from the SDSL side, they'll be stored in the SDSL->HDLC buffer,
and a count register will be incremented.  If that count register
reaches M on a cell whose "terminal cell" header bit isn't set
(meaning that more cells are coming), the packet will be declared
invalid for the reason of being oversize, an error stats counter will
be incremented, and the buffer write pointer will be reset, discarding
the previously accepted cells of that oversize packet.

An oversize packet can appear for one of two reasons: either
misconfiguration on the other end (perhaps miscommunication between
the user and the service provider as to what the MTU is or should be),
or more likely, the result of cell loss "merging" two valid packets
into one bogon packet.

Reassembly timeout: yes, it would be a good measure, as it might
reduce the needless packet loss resulting from cell loss-induced
"merging".  If the last cell of a packet gets lost, then the line goes
quiescent for a while, then another packet begins, the reassembly
timeout might prevent the second good packet from being interpreted as
a continuation of the corrupt one and therefore lost as well.  But it
won't help prevent exactly the same loss scenario if the second good
packet follows immediately (or shortly) after the corrupt one on a
busy pipe.

ATM sucks.  Majorly.  The DSLAM vendors could have ameliorated the
problem somewhat by keeping ATM for the backhaul links behind the
DSLAMs, but converting to HDLC (doing FRF.8) before sending the bits
to the individual subscribers.  Copper Mountain did it, but the 3 CM
DSL network operators I know of (NorthPoint, Rhythms, DSL.net) have
all bit the dust now (in the listed order).  All we are left with is
what used to be Covad, now MegaPath.  Nokia D50 DSLAMs, ATM cells all
the way to the CPE.

Oh, and for all the "regular" DSL users out there: ADSL, if I'm not
mistaken, specifies ATM cells as part of the standard, so if you have
regular ADSL, no matter which ISP or DSLAM brand, you are getting
exactly the same nonsense as I get from Covad/Nokia SDSL: a single PVC
carrying your IP packets as sequences of ATM cells, with all the same
issues.

> My colleague tells me his Linux boxes (3.5/x64, 2.6.32/x86) have an =
> ip.ipfrag_time of 30 seconds. Anyway, it's lots of cells to buffer, I =
> suppose.

In my design the AAL5 reassembly timeout does not affect the size of
the SDSL->HDLC buffer in any way.  There is only one PVC, cells get
written into the buffer as they arrive, at whatever time that happens,
and the buffer is allowed to empty out onto the HDLC interface when a
cell arrives with the "terminal cell" header bit set, and the CRC-32
check at the end of that cell passes.  This design will work with a
finite buffer (only M+1 cells of buffer space needed to guarantee zero
possibility of out-of-buffer packet loss) even if the reassembly
timeout is set to infinity: if the reassembly timer is ticking, that
means we are getting idle cells from the line in a middle of a packet
reassembly, and those don't need to be stored.

Latency considerations: I've entertained the possibility of sending
the bits to the HDLC side as they arrive from the SDSL side, without
waiting to buffer the whole packet, but that would require doing data-
dependent clock gating on the V.35 Rx side, and I find the idea of
data-dependent clock gating to be a little too evil for me.  So at
least for the initial version, I'll have my SDSL->HDLC logic function
start pushing a packet out on the HDLC side only when that packet has
been buffered up in its entirety.  The greatest added latency that any
packet may experience (from the moment the last bit of that packet has
arrived from the SDSL transceiver chip to the moment the attached V.35
router receives the last bit of the transformed packet) equals the
time it would take to transmit the largest allowed packet (M cells) on
the V.35 Rx link, at whatever bit rate that link has been configured
for.

Choice of bit rate for the V.35 Rx link: it will be a programmable
clock divider, totally independent of the SDSL physical layer bit
clock.  The guarantee of zero out-of-buffer packet loss under every
possible condition requires that the V.35 Rx link bit rate be set to
no less than SDSL_bit_rate*1.2*48/53 (a little under x1.09) for the
minimum buffer size of M+1, or no less than SDSL_bit_rate*1.2*8/9 (a
little under x1.07) for the minimum buffer size of M+8.  The 1.2
factor accounts for the possibility of HDLC's worst-case expansion (if
someone were to send big packets filled with FFs, or all 1s), and the
difference between x48/53 vs. x8/9 changing the minimum buffer size
requirement has to do with how the ATM cells are packed into the Nokia
SDSL frame structure.

But the above calculations show the minimum required bit rate for the
V.35 Rx link.  It can be set higher to reduce the latency.  Most V.35
routers have been built for use with T1/E1 circuits, and should have
no problem comfortably handling bit rates up to 2 Mbps or at least
1.5 Mbps.  My current OSDCU (the feeble CPU-mediated version of the
Layer 2 converter I'm planning to implement in FPGA logic) serves a
384 kbps circuit, but I have the V.35 Rx bit rate set to ~1562 kbps:
25 MHz clock divided by 16.

And if someone does want to try the data-dependent clock gating trick
(sending the bits to the HDLC side before the complete packet has been
received), well, it's an FPGA, not an ASIC - go ahead and try it!

[moving on to the HDLC->SDSL direction]

> Canceling packets halfway through transmission makes no sense.

Of course, I'm not suggesting doing that.  The only time that would
happen with my FPGA-based HDLC->SDSL logic function is if a corrupt
packet arrives on the V.35 port.  Because ATM allows idle cells in the
middle of a packet, we can send the first cell out on the SDSL side as
soon as we've received the first 48 octets of the packet payload from
the HDLC side, without waiting for the end of the packet, then apply
the same logic to the next cell, and so on.  But the HDLC FCS (CRC-16)
check happens at the end of the packet being received from the V.35 Tx
side.  What do we do if that CRC check fails, or we receive an HDLC
abort sequence (7 or more ones)?  If we've already started
transmitting that packet on the SDSL side, we can "cancel" it by
sending what's called an AAL5 abort: a cell with the "terminal cell"
header bit set, and 0000 in the 16-bit packet length field in the AAL5
trailer.  It won't recover the bandwidth wasted sending that invalid
packet, but it would tell the router on the other end to discard it,
rather than turn a corrupt packet into one claiming to be good.

The assumption is that the end user is able to ensure that such errors
won't happen in normal operation.  A V.35 cable connects two pieces of
equipment sitting right next to each other in the same room, totally
under the end user's control, so there should be no reason for errors
on that V.35 cable.

> I think =
> there are many good arguments for head-drop, and while I'm not a =
> hardware engineer, I don't see why it would be terribly difficult to =
> implement. Suppose you have a list of starts of packets addresses within =
> your cell transmit ring buffer. Drop the first list element. Head drop! =
> Done! Once the current packet's cells are transmitted, you jump to the =
> next start of packet  in the list, wherever it is. As long as you ensure =
> that packets you receive on the HDLC side don't overwrite the cells you =
> are currently transmitting, you are fine. (If the buffer was really =
> small, this probably meant lots of head drop.)

Let me try to understand what you are suggesting.  It seems to me that
you are suggesting having two separate elastic buffers between the
fill side and the drain side: one storing cells, the other storing
one-element-per-packet information (such as a buffer start address).
You implement head drop by dropping the head element from the buffer
that reckons in packets, rather than cells.  But the cells are still
there, still taking up memory in the cell buffer, and that cell
storage memory can't be used to buffer up a new ingress packet because
that storage sits in the middle between cells of a packet in the
middle of transmission and the next packet which is still "officially"
in the queue.

But I can see how perhaps the head drop mechanism you are suggesting
could be used as a secondary AQM-style packet dropper - but it can't
be the primary mechanism which ensures that the fill logic doesn't
overwrite the cell buffer RAM which the drain side is still reading
from.  I was talking about the latter in my original post when I said
that I couldn't do anything other than tail drop.

But I've come up with an enhancement to the tail drop which we could
call "modified tail drop" - is there a better, more widely accepted
term perhaps?  Here's what I mean: when the fill logic sees the
beginning of a new packet on the HDLC side, it checks the buffer fill
level.  If the limit has been reached, instead of simply proceeding to
skip/ignore the new HDLC ingress packet, drop the packet which has
been most recently added to the cell buffer (reset the write pointer
back to a saved position), and proceed to buffer up the new packet
normally.

> As regards values for the maximum latency L, this might be a too =
> restrictive way of thinking of queue lengths. Usually, you'd like to =
> accommodate for short bursts of packets if you are sure you can get the =
> packets out in reasonable time, but don't want a long "standing queue" =
> to develop. You might have read over at ACM Queue [1] that the CoDel =
> queue management algorithm starts dropping packets if the time packets =
> are enqueued exceeds a threshold. This doesn't mean that the queue =
> length has a hard upper limit, though

In a hardware implementation, every queue length has a hard upper
limit - the physical size of the RAM block allocated for that queue.
I suppose that a software implementation can have "infinite" queues,
dynamically adding all the available gigabytes of system RAM to them,
but in hardware-based solutions, the physical RAM resources are almost
always hard-allocated to specific functions at design time.

Internal RAM in an FPGA has a much higher cost per bit than ordinary
computer RAM.  While PC memories are now measured in gigabytes, the
RAM blocks in low-cost FPGAs (Altera Cyclone or Xilinx Spartan) still
measure in kilobytes.

The main purpose of my original inquiry was to gauge what FPGA part I
should select for my DSU board design, in terms of the available
buffer RAM capacity.  Right now I'm leaning toward the Cyclone III
family from Altera.  Because I have a very low I/O pin count (only a
handful of serial interfaces leaving the FPGA), and I want the FPGA to
be physically small, inexpensive, and assembly-friendly (i.e., no
BGAs), I am only looking at the lowest-end parts of each family in the
TQFP-144 packages.  (144 pins is way overkill for what I need, but
that's the smallest package available in the FPGA families of interest
to me.)

What draws me to the Cyclone III family is that even the lowest-end
parts feature 46 RAM blocks of 9 Kibits each (1 Kibyte + parity),
i.e., up to 46 KiB of RAM capacity.  That's huge for a low-cost FPGA.
If I allocate 26 of those M9K blocks for the HDLC->SDSL cell buffer
(24 blocks for the cell payload buffer and 2 blocks for the parallel
buffer storing cell header data for the SDSL Tx logic), we can buffer
up to 511 ATM cells.  (To be convenient, the buffer size in cells must
be a power of 2 minus 1.)  That corresponds to ~575 ms at 384 kbps or
~144 ms at 1536 kbps.  Do you think that would be enough, or do you
think we need even more RAM for the "good queue" function?

I have not yet gone through the hassle of installing a new version of
Quartus (Altera's FPGA compiler) that supports Cyclone III - the
version I'm using right now only supports up to Cyclone II, and that's
what I'm using for my preliminary Verilog development.  With
Cyclone II the RAM blocks are only 4096 bits each, and there are only
36 of them in the EP2C8 device (the biggest Cyclone II available in
TQFP-144), so that's only 18 KiB of internal RAM.  Taking up 13 of
those M4K blocks for the HDLC->SDSL function would give us 127 cells;
26 blocks would give us 255 cells.

In terms of the cost of the actual parts (in hobbyist single-piece
quantities from Digi-Key), there is no significant difference between
Cyclone II and Cyclone III.  As the extra RAM and logic capacity can't
hurt (we don't have to use them if we don't need/want them), I'm
hoping to be able to use Cyclone III, unless I run into an obstacle
somewhere.

> -- it will try to keep the queue =
> short (as dropping packets signals "back off!" to the end hosts =
> communicating across the link) with some elasticity towards longer =
> enqueueing times. Maybe you'd like to be the pioneer acid-testing =
> CoDel's implementability in silicon.
> [1] http://queue.acm.org/detail.cfm?id=3D2209336

In that paper they say: "If the buffer is full when a packet arrives,
then the packet can be dropped as usual."  So it seems that the
mechanism that prevents system malfunction as a result of the writer
overrunning the reader is still essentially tail drop (or my "modified
tail drop"), whereas CoDel serves as an add-on that artificially drops
some of the queued-up packets.

Thus it seems to me that a sensible way to proceed would be to build
my hardware first, using the simplistic buffer limit mechanism I have
described in my original post, and then try to add CoDel later, once
the basic functionality has been brought up.  Or do you think this
approach would be problematic?

> Does any of this make sense to you?

Yes, it does.  But my application is quite different from what the
CoDel folks had in mind:

1. They have put a lot of emphasis into variable-bandwidth links that
   may be subject to degradation.  In the case of SDSL the bandwidth
   is fixed by the user's budget, and all SDSL circuits I've ever used
   have been 100% error-free at the physical layer.  (That doesn't
   mean they don't go down, though - but every single time my IP-over-
   SDSL circuit has crapped out, it has always been a high-level
   problem on the ISP's side, never a physical layer failure or
   degradation.  My current downtime is no exception.)

2. I work on a different bandwidth scale.  They talk a lot about
   packets crossing from 1 Gbps or 100 Mbps to 10 Mbps, etc, but the
   highest possible SDSL kbps rate is 2320, and even that is only on
   paper: the line cards which are actually deployed USA-wide by what
   used to be Covad max out at 1536 kbps.  We still have a bandwidth-
   reduction bottleneck as the V.35 Tx link (router->DSU) has a higher
   capacity than SDSL/ATM, but it isn't *that* much higher bandwidth.

3. The CoDel (and bufferbloat/AQM) folks in general seem to view the
   arrival and departure of packets as atomic instantaneous events.
   While the latter may be true for a software router which receives
   packets from or hands them over to a hardware Ethernet interface in
   an instantaneous atomic manner, that model absolutely does not hold
   for the underlying HW device itself as it reformats a stream of
   packets from HDLC to SDSL/ATM in a flow-through fashion, without
   waiting for each packet to be received in its entirety.

So I think I'll need to do some investigation and experimentation of
my own to apply these ideas to my peculiar circumstances.  If I can
use an FPGA part with plenty of RAM and logic capacity to spare after
I've implemented the basic minimum, then that should facilitate such
later investigation and experimentation.

SF