[Bloat] Designer of a new HW gadget wishes to avoid bufferbloat

Sat Oct 13 23:41:57 EDT 2012

Hello esteemed anti-bufferbloat folks,

I am designing a new networking-related *hardware* gadget, and I wish
to design it in such a way that won't be guilty of bufferbloat.  I am
posting on this mailing list in order to solicit some buffering and
bloat-related design advice.

The HW gadget I am designing will be an improved-performance successor
to this OSHW design:

http://ifctfvax.Harhan.ORG/OpenWAN/OSDCU/

The device targets a vanishingly small audience of those few wretched
souls who are still voluntarily using SDSL, i.e., deliberately paying
more per month for less bandwidth.  (I am one of those wretched souls,
and my reasons have to do with a very precious non-portable IPv4
address block assignment that is inseparably tied to its associated
384 kbps SDSL circuit.)

What my current OSDCU board does (the new one is intended to do the
exact same thing, but better) is convert SDSL to V.35/HDLC.  My own
SDSL line (the one with the precious IPv4 block) is served via a Nokia
D50 DSLAM operated by what used to be Covad, and to the best of my
knowledge the same holds for all other still-remaining SDSL lines in
the USA-occupied territories, now that the last CM DSLAM operator has
bit the dust.  The unfortunate thing about the Nokia/Covad flavor of
SDSL is that the bit stream sent toward the CPE (and expected from the
CPE in return) is that abomination called ATM.  Hence my hardware
device is essentially a converter between ATM cells on the SDSL side
and HDLC packets on the V.35 side.

On my current OSDCU board the conversion is mediated by the CPU, which
has to handle every packet and manage its reassembly from or chopping
into ATM cells.  The performance sucks, unfortunately.  I am now
designing a new version in which the entire Layer 2 conversion
function will be implemented in a single FPGA.  The CPU will stay out
of the data path, and the FPGA will contain two independent and
autonomous logic functions: HDLC->SDSL and SDSL->HDLC bit stream
reformatters.

The SDSL->HDLC direction involves no bufferbloat issues: I can set
things up so that no received packet ever has to be dropped, and the
greatest latency that may be experienced by any packet is the HDLC
side (DSU->DTE router) transmission time of the longest packet size
allowed by the static configuration - and I can statically prove that
both conditions I've just stated will be satisfied given a rather
small buffer of only M+1 ATM cells, where M is the maximum packet size
set by the static configuration, translated into ATM cells.  (For IPv4
packets of up to 1500 octets, including the IPv4 header, using the
standard RFC 1483 encapsulation, M=32.)

However, the HDLC->SDSL direction is the tricky one in terms of
bufferbloat issues, and that's the one I am soliciting advice for.
Unlike the SDSL->HDLC direction, HDLC->SDSL can't be designed in such
a way that no packets will ever have to be dropped.  Aside from the
infamous cell tax (the Nokia SDSL frame structure imposes 6 octets of
overhead, including both cell headers and SDSL-specific crud, for
every 48 octets of payload), which is data-independent, the ATM creep
imposes some data-dependent overhead: the padding of every AAL5 packet
to the next-up multiple of 48 octets, and the RFC 1483 headers and
trailers which are longer than their Frame Relay counterparts on the
HDLC/V.35 side of the DSU.  Both of the latter need to be viewed as
data-dependent overhead because both are incurred per packet, rather
than per octet of bulk payload, and thus penalize small packets more
than large ones.

Just to clarify, I can set the bit rate on the V.35 side to whatever
I want (put a trivial programmable clock divider in the FPGA), and I
can set different bit rates for the DSU->router and router->DSU
directions.  (Setting the bit rate for the DSU->router direction to at
least the SDSL bit rate times 1.07 is part of the trick for ensuring
that the SDSL->HDLC direction can never overflow its tiny buffer.)
Strictly speaking, one could set the bit rate for the router->DSU
direction of the V.35 interface so low that no matter what the router
sends, that packet stream will always fit on the SDSL side without a
packet ever having to be dropped.  However, because the worst case
expansion in the HDLC->SDSL direction is so high (in one hypothetical
case I've considered, UDP packets with 5 octets of payload, such that
each IPv4 packet is 33 octets long, the RFC 1490->1483 expansion is
2.4x *before* the cell tax!), setting the clock so slow that even a
continuous HDLC line rate stream of worst-case packets will fit is not
a serious proposition.

Thus I have to design the HDLC->SDSL logic function in the FPGA with
the expectation that the packet stream it receives from the HDLC side
may be such that it exceeds the line capacity on the SDSL side, and
because the attached V.35 router "has the right" to send a continuous
line rate stream of such packets, a no-drop policy would require an
infinite buffer in the DSU.  Whatever finite buffer size I implement,
my logic will have to be prepared for the possibility of that buffer
filling up, and has to have a policy for dropping packets.  What I am
soliciting from the bufferbloat-experienced minds of this list is some
advice with the sizing of my HDLC->SDSL buffer and the choice of the
packet dropping policy.

Because the smallest indivisible unit of transmission on the SDSL side
(the output side of the HDLC->SDSL logic function in question) is one
ATM cell (48 octets of payload + 6 octets of overhead, averaged over
the rigidly repeating SDSL frame structure), one sensible way to
structure the buffer would be to provide enough FPGA RAM resources to
hold a certain number of ATM cells, call it N.  Wire it up as a ring
buffer, such that the HDLC Rx side adds ATM cells at the tail, while
the SDSL Tx side takes ATM cells from the head.  With this design the
simplest packet drop policy would be in the form of a latency limit: a
configurable register in the FPGA would set the maximum allowed
latency in ATM cells, call it L.  At the beginning of each incoming
packet, the HDLC Rx logic would check the number of ATM cells queued
up in the buffer, waiting for SDSL Tx: if that number exceeds L, drop
the incoming packet, otherwise accept it, adding more cells to the
tail of the queue as the bits trickle in from V.35.  The constrainst
on L is that L+M (the max packet size in ATM cells) must never exceed
N (the number of cells that the HW is capable of storing).

If I choose the design just described, I know what M is (32 for the
standard IPv4 usage), and L would be a configuration parameter, but N
affects the HW design, i.e., I need to know how many FPGA RAM blocks
I should reserve.  And because I need N >= L+M, in order to decide on
the N for my HW design, I need to have some idea of what would be a
reasonable value for L.

L is the maximum allowed HDLC->SDSL packet latency measured in ATM
cells, which directly translates into milliseconds for each given SDSL
kbps tier, of which there are only 5: 192, 384, 768, 1152 and 1536.
At 384 kbps, one ATM cell (which has to be reckoned as 54 octets
rather than 53 because of Nokia SDSL) is 1.125 ms; scale accordingly
for other kbps tiers.  A packet of 1500 octets (32 ATM cells) will
take 36 ms to transmit - or just 9 ms at the top SDSL tier of 1536
kbps.  With the logic design proposed above, the HDLC->SDSL latency of
every packet (from the moment the V.35 router starts transmitting that
packet on the HDLC interface to the moment its first cell starts Tx on
the physical SDSL pipe) will be exactly known to the logic in the FPGA
the moment when the starts begins to arrive from the V.35 port: it
will be simply equal to the number of ATM cells in the Tx queue at
that moment.  My proposed logic design will drop the packet if that
latency measure exceeds a set threshold, or allow it through
otherwise.  My questions to the list are:

a) would it be a good packet drop policy, or not?

b) if it is a good policy, what would be a reasonable value for the
   latency threshold L?  (In ATM cells or in ms, I can convert :)

The only major downside I can see with the approach I've just outlined
is that it is a tail drop.  I've heard it said in the bufferbloat
community that tail drop is bad and head drop is better.  However,
implementing head drop or any other policy besides tail drop with the
HW logic design outlined above would be very difficult: if the buffer
is physically structured as a queue of ATM cells, rather than packets,
then deleting a packet from the middle of the queue (it does no good
to abort the transmission of a packet already started, hence head drop
effectively becomes middle drop in terms of ATM cells) becomes quite a
challenge.

Another approach I have considered (actually my first idea, before I
came up with the ring-of-cells buffer idea above) is to have a more
old-fashioned correspondence of 1 buffer = 1 packet.  Size each buffer
in the HW for the expected max number of cells M (e.g., a 2 KiB HW RAM
block would allow M<=42), and have some fixed number of these packet
buffers, say, 2, 4 or 8.  Each buffer would have a "fill level"
register associated with it, giving the number of ready-to-Tx cells in
it, so the SDSL Tx block can still begin transmitting a packet before
it's been fully received from HDLC Rx.  (In the very unlikely case
that SDSL Tx is faster than HDLC Rx, SDSL Tx can always put idle cells
in the middle of a packet, which ATM allows.)  Advantage over the
ring-of-cells approach: head-drop turned middle-drop becomes easy:
simply drop the complete buffer right after the head (the one whose Tx
is already in progress.)  Disadvantage: less of a direct relationship
between the packet drop policy and the latency equivalent of the
buffered-up ATM cells for Tx.

Which approach would the bufferbloat experts here recommend?

TIA for reading my ramblings and for any technical advice,

Michael Spacefalcon,
retro-telecom nut