General list for discussing Bufferbloat
 help / color / mirror / Atom feed
* [Bloat] Designer of a new HW gadget wishes to avoid bufferbloat
@ 2012-10-14  3:41 Michael Spacefalcon
  2012-10-22 15:58 ` Albert Rafetseder
  0 siblings, 1 reply; 8+ messages in thread
From: Michael Spacefalcon @ 2012-10-14  3:41 UTC (permalink / raw)
  To: bloat

Hello esteemed anti-bufferbloat folks,

I am designing a new networking-related *hardware* gadget, and I wish
to design it in such a way that won't be guilty of bufferbloat.  I am
posting on this mailing list in order to solicit some buffering and
bloat-related design advice.

The HW gadget I am designing will be an improved-performance successor
to this OSHW design:

http://ifctfvax.Harhan.ORG/OpenWAN/OSDCU/

The device targets a vanishingly small audience of those few wretched
souls who are still voluntarily using SDSL, i.e., deliberately paying
more per month for less bandwidth.  (I am one of those wretched souls,
and my reasons have to do with a very precious non-portable IPv4
address block assignment that is inseparably tied to its associated
384 kbps SDSL circuit.)

What my current OSDCU board does (the new one is intended to do the
exact same thing, but better) is convert SDSL to V.35/HDLC.  My own
SDSL line (the one with the precious IPv4 block) is served via a Nokia
D50 DSLAM operated by what used to be Covad, and to the best of my
knowledge the same holds for all other still-remaining SDSL lines in
the USA-occupied territories, now that the last CM DSLAM operator has
bit the dust.  The unfortunate thing about the Nokia/Covad flavor of
SDSL is that the bit stream sent toward the CPE (and expected from the
CPE in return) is that abomination called ATM.  Hence my hardware
device is essentially a converter between ATM cells on the SDSL side
and HDLC packets on the V.35 side.

On my current OSDCU board the conversion is mediated by the CPU, which
has to handle every packet and manage its reassembly from or chopping
into ATM cells.  The performance sucks, unfortunately.  I am now
designing a new version in which the entire Layer 2 conversion
function will be implemented in a single FPGA.  The CPU will stay out
of the data path, and the FPGA will contain two independent and
autonomous logic functions: HDLC->SDSL and SDSL->HDLC bit stream
reformatters.

The SDSL->HDLC direction involves no bufferbloat issues: I can set
things up so that no received packet ever has to be dropped, and the
greatest latency that may be experienced by any packet is the HDLC
side (DSU->DTE router) transmission time of the longest packet size
allowed by the static configuration - and I can statically prove that
both conditions I've just stated will be satisfied given a rather
small buffer of only M+1 ATM cells, where M is the maximum packet size
set by the static configuration, translated into ATM cells.  (For IPv4
packets of up to 1500 octets, including the IPv4 header, using the
standard RFC 1483 encapsulation, M=32.)

However, the HDLC->SDSL direction is the tricky one in terms of
bufferbloat issues, and that's the one I am soliciting advice for.
Unlike the SDSL->HDLC direction, HDLC->SDSL can't be designed in such
a way that no packets will ever have to be dropped.  Aside from the
infamous cell tax (the Nokia SDSL frame structure imposes 6 octets of
overhead, including both cell headers and SDSL-specific crud, for
every 48 octets of payload), which is data-independent, the ATM creep
imposes some data-dependent overhead: the padding of every AAL5 packet
to the next-up multiple of 48 octets, and the RFC 1483 headers and
trailers which are longer than their Frame Relay counterparts on the
HDLC/V.35 side of the DSU.  Both of the latter need to be viewed as
data-dependent overhead because both are incurred per packet, rather
than per octet of bulk payload, and thus penalize small packets more
than large ones.

Just to clarify, I can set the bit rate on the V.35 side to whatever
I want (put a trivial programmable clock divider in the FPGA), and I
can set different bit rates for the DSU->router and router->DSU
directions.  (Setting the bit rate for the DSU->router direction to at
least the SDSL bit rate times 1.07 is part of the trick for ensuring
that the SDSL->HDLC direction can never overflow its tiny buffer.)
Strictly speaking, one could set the bit rate for the router->DSU
direction of the V.35 interface so low that no matter what the router
sends, that packet stream will always fit on the SDSL side without a
packet ever having to be dropped.  However, because the worst case
expansion in the HDLC->SDSL direction is so high (in one hypothetical
case I've considered, UDP packets with 5 octets of payload, such that
each IPv4 packet is 33 octets long, the RFC 1490->1483 expansion is
2.4x *before* the cell tax!), setting the clock so slow that even a
continuous HDLC line rate stream of worst-case packets will fit is not
a serious proposition.

Thus I have to design the HDLC->SDSL logic function in the FPGA with
the expectation that the packet stream it receives from the HDLC side
may be such that it exceeds the line capacity on the SDSL side, and
because the attached V.35 router "has the right" to send a continuous
line rate stream of such packets, a no-drop policy would require an
infinite buffer in the DSU.  Whatever finite buffer size I implement,
my logic will have to be prepared for the possibility of that buffer
filling up, and has to have a policy for dropping packets.  What I am
soliciting from the bufferbloat-experienced minds of this list is some
advice with the sizing of my HDLC->SDSL buffer and the choice of the
packet dropping policy.

Because the smallest indivisible unit of transmission on the SDSL side
(the output side of the HDLC->SDSL logic function in question) is one
ATM cell (48 octets of payload + 6 octets of overhead, averaged over
the rigidly repeating SDSL frame structure), one sensible way to
structure the buffer would be to provide enough FPGA RAM resources to
hold a certain number of ATM cells, call it N.  Wire it up as a ring
buffer, such that the HDLC Rx side adds ATM cells at the tail, while
the SDSL Tx side takes ATM cells from the head.  With this design the
simplest packet drop policy would be in the form of a latency limit: a
configurable register in the FPGA would set the maximum allowed
latency in ATM cells, call it L.  At the beginning of each incoming
packet, the HDLC Rx logic would check the number of ATM cells queued
up in the buffer, waiting for SDSL Tx: if that number exceeds L, drop
the incoming packet, otherwise accept it, adding more cells to the
tail of the queue as the bits trickle in from V.35.  The constrainst
on L is that L+M (the max packet size in ATM cells) must never exceed
N (the number of cells that the HW is capable of storing).

If I choose the design just described, I know what M is (32 for the
standard IPv4 usage), and L would be a configuration parameter, but N
affects the HW design, i.e., I need to know how many FPGA RAM blocks
I should reserve.  And because I need N >= L+M, in order to decide on
the N for my HW design, I need to have some idea of what would be a
reasonable value for L.

L is the maximum allowed HDLC->SDSL packet latency measured in ATM
cells, which directly translates into milliseconds for each given SDSL
kbps tier, of which there are only 5: 192, 384, 768, 1152 and 1536.
At 384 kbps, one ATM cell (which has to be reckoned as 54 octets
rather than 53 because of Nokia SDSL) is 1.125 ms; scale accordingly
for other kbps tiers.  A packet of 1500 octets (32 ATM cells) will
take 36 ms to transmit - or just 9 ms at the top SDSL tier of 1536
kbps.  With the logic design proposed above, the HDLC->SDSL latency of
every packet (from the moment the V.35 router starts transmitting that
packet on the HDLC interface to the moment its first cell starts Tx on
the physical SDSL pipe) will be exactly known to the logic in the FPGA
the moment when the starts begins to arrive from the V.35 port: it
will be simply equal to the number of ATM cells in the Tx queue at
that moment.  My proposed logic design will drop the packet if that
latency measure exceeds a set threshold, or allow it through
otherwise.  My questions to the list are:

a) would it be a good packet drop policy, or not?

b) if it is a good policy, what would be a reasonable value for the
   latency threshold L?  (In ATM cells or in ms, I can convert :)

The only major downside I can see with the approach I've just outlined
is that it is a tail drop.  I've heard it said in the bufferbloat
community that tail drop is bad and head drop is better.  However,
implementing head drop or any other policy besides tail drop with the
HW logic design outlined above would be very difficult: if the buffer
is physically structured as a queue of ATM cells, rather than packets,
then deleting a packet from the middle of the queue (it does no good
to abort the transmission of a packet already started, hence head drop
effectively becomes middle drop in terms of ATM cells) becomes quite a
challenge.

Another approach I have considered (actually my first idea, before I
came up with the ring-of-cells buffer idea above) is to have a more
old-fashioned correspondence of 1 buffer = 1 packet.  Size each buffer
in the HW for the expected max number of cells M (e.g., a 2 KiB HW RAM
block would allow M<=42), and have some fixed number of these packet
buffers, say, 2, 4 or 8.  Each buffer would have a "fill level"
register associated with it, giving the number of ready-to-Tx cells in
it, so the SDSL Tx block can still begin transmitting a packet before
it's been fully received from HDLC Rx.  (In the very unlikely case
that SDSL Tx is faster than HDLC Rx, SDSL Tx can always put idle cells
in the middle of a packet, which ATM allows.)  Advantage over the
ring-of-cells approach: head-drop turned middle-drop becomes easy:
simply drop the complete buffer right after the head (the one whose Tx
is already in progress.)  Disadvantage: less of a direct relationship
between the packet drop policy and the latency equivalent of the
buffered-up ATM cells for Tx.

Which approach would the bufferbloat experts here recommend?

TIA for reading my ramblings and for any technical advice,

Michael Spacefalcon,
retro-telecom nut

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Bloat] Designer of a new HW gadget wishes to avoid bufferbloat
  2012-10-14  3:41 [Bloat] Designer of a new HW gadget wishes to avoid bufferbloat Michael Spacefalcon
@ 2012-10-22 15:58 ` Albert Rafetseder
  0 siblings, 0 replies; 8+ messages in thread
From: Albert Rafetseder @ 2012-10-22 15:58 UTC (permalink / raw)
  To: bloat

Hi Michael,

Not that I'm an expert of any means on the topics you touch, but I'll share my point of view on some of the questions raised. Please excuse my aggressive shortening of your original post.

> http://ifctfvax.Harhan.ORG/OpenWAN/OSDCU/

The site appears down, but from your description I think I understand what you are building.

> The SDSL->HDLC direction involves no bufferbloat issues: I can set
> things up so that no received packet ever has to be dropped, and the
> greatest latency that may be experienced by any packet is the HDLC
> side (DSU->DTE router) transmission time of the longest packet size
> allowed by the static configuration - and I can statically prove that
> both conditions I've just stated will be satisfied given a rather
> small buffer of only M+1 ATM cells, where M is the maximum packet size
> set by the static configuration, translated into ATM cells.  (For IPv4
> packets of up to 1500 octets, including the IPv4 header, using the
> standard RFC 1483 encapsulation, M=32.)

You must also rule out ATM cell loss and reordering. Otherwise, there is too little data in your receive buffer to reassemble the transmitted frame (temporarily with reordering, terminally with loss). This calls for a timeout of sorts. Quoting RFC 791 "Internet Protocol":

"The current recommendation for the initial timer setting is 15 seconds."

My colleague tells me his Linux boxes (3.5/x64, 2.6.32/x86) have an ip.ipfrag_time of 30 seconds. Anyway, it's lots of cells to buffer, I suppose.

> Strictly speaking, one could set the bit rate for the router->DSU
> direction of the V.35 interface so low that no matter what the router
> sends, that packet stream will always fit on the SDSL side without a
> packet ever having to be dropped.  However, because the worst case
> expansion in the HDLC->SDSL direction is so high (in one hypothetical
> case I've considered, UDP packets with 5 octets of payload, such that
> each IPv4 packet is 33 octets long, the RFC 1490->1483 expansion is
> 2.4x *before* the cell tax!), setting the clock so slow that even a
> continuous HDLC line rate stream of worst-case packets will fit is not
> a serious proposition.

Setting the ingress rate into your board on the HDLC side that low also pushes the potential bloat issue into the router's transmit buffer which might be even more difficult to control.

> L is the maximum allowed HDLC->SDSL packet latency measured in ATM
> cells, which directly translates into milliseconds for each given SDSL
> kbps tier [...]
>  My proposed logic design will drop the packet if that
> latency measure exceeds a set threshold, or allow it through
> otherwise.  My questions to the list are:
> 
> a) would it be a good packet drop policy, or not?
> 
> b) if it is a good policy, what would be a reasonable value for the
>   latency threshold L?  (In ATM cells or in ms, I can convert :)
> 
> The only major downside I can see with the approach I've just outlined
> is that it is a tail drop.  I've heard it said in the bufferbloat
> community that tail drop is bad and head drop is better.  However,
> implementing head drop or any other policy besides tail drop with the
> HW logic design outlined above would be very difficult: if the buffer
> is physically structured as a queue of ATM cells, rather than packets,
> then deleting a packet from the middle of the queue (it does no good
> to abort the transmission of a packet already started, hence head drop
> effectively becomes middle drop in terms of ATM cells) becomes quite a
> challenge.

Canceling packets halfway through transmission makes no sense. I think there are many good arguments for head-drop, and while I'm not a hardware engineer, I don't see why it would be terribly difficult to implement. Suppose you have a list of starts of packets addresses within your cell transmit ring buffer. Drop the first list element. Head drop! Done! Once the current packet's cells are transmitted, you jump to the next start of packet  in the list, wherever it is. As long as you ensure that packets you receive on the HDLC side don't overwrite the cells you are currently transmitting, you are fine. (If the buffer was really small, this probably meant lots of head drop.)

As regards values for the maximum latency L, this might be a too restrictive way of thinking of queue lengths. Usually, you'd like to accommodate for short bursts of packets if you are sure you can get the packets out in reasonable time, but don't want a long "standing queue" to develop. You might have read over at ACM Queue [1] that the CoDel queue management algorithm starts dropping packets if the time packets are enqueued exceeds a threshold. This doesn't mean that the queue length has a hard upper limit, though -- it will try to keep the queue short (as dropping packets signals "back off!" to the end hosts communicating across the link) with some elasticity towards longer enqueueing times. Maybe you'd like to be the pioneer acid-testing CoDel's implementability in silicon.

[1] http://queue.acm.org/detail.cfm?id=2209336

Does any of this make sense to you?

Cheers,
 Albert.

-- 
Albert Rafetseder BSc. MSc.
Member of Scientific Staff

University of Vienna
Faculty of Computer Science
Research Group Future Communication (Endowed by A1 Telekom Austria AG)
Waehringer Strasse 29/5.46, A-1090 Vienna, Austria

T +43-1-42777-8620
albert.rafetseder@univie.ac.at
http://www.cs.univie.ac.at/fc




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Bloat] Designer of a new HW gadget wishes to avoid bufferbloat
@ 2012-11-18  0:53 Michael Spacefalcon
  0 siblings, 0 replies; 8+ messages in thread
From: Michael Spacefalcon @ 2012-11-18  0:53 UTC (permalink / raw)
  To: bloat

Albert Rafetseder <albert.rafetseder+bufferbloat@univie.ac.at> wrote:

> To save you the additional ring buffer, the starts-of-packets =
> information could be stored as a length or number of cells field in =
> front of the stream of cells of one packet.

It seems to me that the logic complexity of what you are suggesting
(at least when fitting it into the architecture of my implementation)
would be more than an additional ring buffer, and the total number of
RAM bits needed to store the necessary info (allowing for the worst
case of each ATM cell being its own packet) would be the same either
way.

In case someone here enjoys reading Verilog, you can see my current
work-in-progress in this public CVS repository:

$ cvs -d :pserver:anoncvs@ifctfvax.Harhan.ORG:/fs1/IFCTF-cvs co BlitzDSU

The HDLC_to_SDSL logic block is complete in that all logic is there to
produce an outgoing SDSL bit stream whose ultimate data source is the
incoming HDLC one, but there aren't any provisions for AQM yet, i.e.,
it is the simple first version in which the only queue control
mechanism is the "caudal" drop on the fill side of the ring buffer.
All logic is completely untested except that it passes through the
Quartus compiler to produce FPGA configuration bits.  I still need to
write the SDSL_to_HDLC logic, then hopefully move it from Cyclone II
to Cyclone III (i.e., fight through the process of getting a newer
version of Quartus running), then design and build the PCB for this
FPGA to go onto, and only then we'll be able to test any of this logic
for real...

Oh, and in the unlikely case that someone is actually bored enough to
look at my FPGA design in detail, one clarification is in order: the
design assumes that the SYSCLK frequency fed to the FPGA will be an
independent clock source that has no relation to the SDSL bit rate or
the clock arriving from the V.35/HDLC interface, i.e., SYSCLK will
always tick at its own steady rate no matter what happens in the data
path, which SDSL speed is in use, etc.  Furthermore, the logic assumes
that the SYSCLK frequency is significantly higher than the highest
supported data path speed - SDSL tops at 2.3 Mbps, and I'm thinking of
running SYSCLK at something like 40 MHz - plenty fast for my purposes,
and plenty slow for the Cyclone FPGAs.  This way SYSCLK becomes the
fast clock to which everything else can be synchronized inside the
FPGA.

> In a 46kibibit RAM,

Huh?  46 kibibit?  Where did that number come from?  The "46K" figure
must be from my mention of desiring to use a Cyclone III FPGA, but the
total RAM capacity in the latter parts is 46 kibibyte, not kibibit -
more specifically, 46 so-called M9K blocks, each of the latter holding
9 kibibit (including parity bits) or 1 kibibyte.

But that is the total RAM capacity of an FPGA part.  We won't be able
to devote all of it to the HDLC->SDSL cell ring buffer as there are
other needs: the HDLC->SDSL cell header buffer, a possible additional
buffer to allow the drain logic to skip packets, and several buffers
for the other logic block handling the SDSL->HDLC direction.  (The
latter can be set up to have no bottleneck, hence no bufferbloat
issues to discuss on this list, but it still needs a certain amount of
buffer RAM for the packet reassembly and debug capture features.)

Back to the actual usable capacity of the HDLC->SDSL cell ring buffer
we are discussing here, the version that is currently in CVS above and
compiles for Cyclone II has the tick-define set to size the buffer at
128 ATM cells of 48 octets each.  (The current design requires the
buffer capacity in cells to be a power of 2.)  This buffer, together
with a smaller companion that stores 32 bits of header info per cell,
for exactly the same number of cells, currently takes up 13 M4K blocks
of the Cyclone II fabric - that is 6.5 kibibyte.  One Ethernet MTU is
32 ATM cells, hence with a Cyclone II FPGA we can buffer up to 4
Ethernet MTUs.

But if I can move this design to a Cyclone III FPGA, I should be able
to bump the buffer size to 512 cells by changing one tick-define.  That
would take up 26 M9K blocks (CIII has M9K blocks, twice the size of
CII's M4K) out of the 46 available, leaving enough for the SDSL->HDLC
logic, and would hold 16 Ethernet MTUs worth of ATM cells.

> "Caudal" (near tail / tail-side) drop?

Thanks for the term - it now appears in my Verilog code in the name of
one of the states in the fill logic state machine. :-)

> I wonder if it suffices to =
> head-drop more than one packet's worth of cells if required to keep the =
> read and write pointers sufficiently far apart.

We'll have to revisit the whole AQM head drop logic design once we
have this stuff physically running on a board and get the rest of the
system working - not a small task...  I'm hoping that we'll be OK with
this late addition, as long as the FPGA part is sized with some logic
and RAM capacity to spare.

> With a buffer of less =
> than four Ethernet MTUs, what's the worst case really?

See above regarding how many Ethernet MTUs we can buffer up using
different FPGA part choices.

SF

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Bloat] Designer of a new HW gadget wishes to avoid bufferbloat
  2012-10-27  2:49 Michael Spacefalcon
@ 2012-11-12  8:03 ` Albert Rafetseder
  0 siblings, 0 replies; 8+ messages in thread
From: Albert Rafetseder @ 2012-11-12  8:03 UTC (permalink / raw)
  To: bloat

> Yes, I know how ring buffers work. :-)

No offense meant :-)

>> The trick is now to push the read =
>> pointer forward a bit instead of holding back the write pointer.
> 
> Can't do that: the ^r pointer will normally point at some cell in the
> middle of a packet being transmitted.  Jerking that pointer forward
> will cause us to transmit garbage consisting of partial packets, which
> is the last thing we want.

As it turns out from your answer to Jonathan, we were talking about the same (kind of) implementation, i.e. keep additional information on the start cells of packets, advance the read pointer systematically if needed, so as not to stop transmitting in the middle of packets, etc.

To save you the additional ring buffer, the starts-of-packets information could be stored as a length or number of cells field in front of the stream of cells of one packet. In a 46kibibit RAM, this will consume 3-4 cells' worth of data as overhead.

> But we still have to have tail drop (or my
> "modified tail drop") as the backup packet drop mechanism that
> protects the integrity of the ring buffer.  I guess the trick is to
> design the AQM policy such that it does most of the job and the
> out-of-buffer packet drop mechanism is rarely exercised.

"Caudal" (near tail / tail-side) drop? I wonder if it suffices to head-drop more than one packet's worth of cells if required to keep the read and write pointers sufficiently far apart. With a buffer of less than four Ethernet MTUs, what's the worst case really?

I'll need to get myself an envelope to scribble on...
  Albert.
-- 
Albert Rafetseder BSc. MSc.
Member of Scientific Staff

University of Vienna
Faculty of Computer Science
Research Group Future Communication
(Endowed by A1 Telekom Austria AG)
Waehringer Strasse 29/5.46, A-1090 Vienna, Austria

T +43-1-42777-8620
albert.rafetseder@univie.ac.at
http://www.cs.univie.ac.at/fc





^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Bloat] Designer of a new HW gadget wishes to avoid bufferbloat
@ 2012-10-27  2:49 Michael Spacefalcon
  2012-11-12  8:03 ` Albert Rafetseder
  0 siblings, 1 reply; 8+ messages in thread
From: Michael Spacefalcon @ 2012-10-27  2:49 UTC (permalink / raw)
  To: bloat

Jonathan Morton <chromatix99@gmail.com> wrote:

> Okay, so you can fudge the write pointer to do a different kind of tail
> drop. That's great for dealing with a physically full buffer on packet
> arrival.
>
> What you need to do for head drop is the same kind of pointer manipulation
> at the read end. You only do it between IP packets,

Yup, that's what I was thinking - give the drain side logic the
ability to skip packet N+1 when it's done transmitting the cells of
packet N, and start transmitting the cells of packet N+2.

> so you may want to make
> the decision of whether to do so in advance, so as to avoid having to
> insert an idle cell while that decision is made.

Of course, my whole point in moving from a CPU-mediated implementation
to an all-FPGA one is to eliminate the possibility of bandwidth
underutilization or unnecessary latency because the L2 converter logic
is "thinking about it".

Having the head drop decision made while the last cell of the previous
in-progress packet is being transmitted seems like a reasonable
approach to me.  IOW, as we transmit the last cell of packet N, decide
whether we'll do packet N+1 or N+2 next.

> Half a second at bottleneck rate is plenty. At the higher link rate the
> buffer will fill more gradually, so you can still absorb a half second
> burst in practice, if your inbound link rate is 2Mbps.

OK, so it sounds like we'll be in a good shape with an EP3C5 or EP3C10
FPGA.  (Those are the lowest-end Cyclone III parts with 46 KiB of RAM
inside.)

> You should definitely think ahead to how you will implement head drop
> eventually. Changing that part of the system later might be more painful
> than you anticipate.

Well, I do for sure need to plan the FPGA logic and RAM resource usage
with this subsequent addition in mind.  The main thing I'll need to
add to my logic in order to implement this head drop mechanism is a
way for the drain side logic to know where the packet boundaries are.

In the simplistic HDLC->SDSL logic function I have in mind right now,
the logic on the drain side of the cell ring buffer literally won't
need to know what a packet is: it will only see cells queued up for
transmission.  But this won't be good enough for head drop: the latter
has to happen on the drain side, and because we need to drop whole
packets, not partial, there will need to be additional communication
from the packet-aware fill side logic.

I'm thinking of implementing an additional ring buffer, with the same
fill and drain sides, whose pointers would increment once per packet,
rather than once per cell as the main buffer.  The data elements
stored in this auxiliary ring buffer would be packet start pointers
into the main cell ring buffer, written by the fill logic.  The drain
logic, when running normally (not dropping any packets) would simply
need to increment its read pointer for the auxiliary buffer whenever
it transmits a cell with the "terminal cell" bit set.  But when
implementing head drop, the pointers read from this auxiliary buffer
can be used to know how to increment the main (cell) read pointer to
skip a whole packet.

> Aside from that, starting with a simple tail drop queue will validate the
> rest of the system and give you a point of comparison.

That's what I had in mind.  The "validate the rest of the system" part
is quite critical - there will be quite a bit of complex logic to
debug and make work...

> The next step might
> be to test the head drop mechanism using a trivial rule such as dropping
> alternate packets if the queue is more than half full.

Good idea.

> It's a valid model for decision making, since you make decisions at
> discrete points in time corresponding to packet boundaries. If you come up
> with some specific objections, feel free to bring them up here.

No objections, I just haven't fully internalized this stuff yet.  But
it looks like my basic logic design will be OK, so we can revisit the
implementation of the actual AQM policy after we do all the other
legwork: getting the basic logic design to compile into an FPGA image,
building the physical board, bringing up all the hardware functionality
of that board which isn't relevant to this discussion, and getting the
L2 converter logic function to work at the basic level.

Albert Rafetseder <albert.rafetseder@univie.ac.at> wrote:

> I.e., you rule out ATM cell loss and reordering! :-)

Yup.

> I'm suggesting a *ring* buffer.

Of course; all of the hardware buffers I've discussed in this thread
are ring buffers.  I don't know of any other way to implement a
continuous fill-drain buffer in hardware.

> Let me try to sketch what I mean (view =
> in your favorite fixed-space font).

My xterm window always uses the same fixed-width font. :-)

> ^r is the read pointer (output to =
> SDSL, head of the queue), ^w is the write pointer (input from HDLC, =
> tail). Pointers can only go right, but all addressing is done modulo the =
> buffer size, so you start over on the left side if you were ever to =
> point past the buffer.

Yes, I know how ring buffers work. :-)

> Currently, the read pointer processes "2", whereas the write pointer =
> will overwrite "A" next.=20
>
> 0123456789ABCDEF
>   ^r      ^w
>
> Now the read pointer has progressed a bit, but the write pointer was =
> much faster:
>
> xyz3456789rstuvw
>    ^w ^r

Yup, you are describing my design.

> Obviously, ^w must not overtake ^r.

Yes, exactly.

> The trick is now to push the read =
> pointer forward a bit instead of holding back the write pointer.

Can't do that: the ^r pointer will normally point at some cell in the
middle of a packet being transmitted.  Jerking that pointer forward
will cause us to transmit garbage consisting of partial packets, which
is the last thing we want.

Jonathan's idea of implementing head drop on the drain (read) side
instead of the fill (write) side makes a lot more sense to me.  That
head drop mechanism would be controlled by the AQM policy, which is a
kind of optional add-on.  But we still have to have tail drop (or my
"modified tail drop") as the backup packet drop mechanism that
protects the integrity of the ring buffer.  I guess the trick is to
design the AQM policy such that it does most of the job and the
out-of-buffer packet drop mechanism is rarely exercised.

> The sketch above applies to both the list of addresses of cell starts =
> and the actual cell buffer. (You could also implement both ring buffers =
> as a single one, as a linked list: Store the address of the start cell =
> of the next packet in front of the cells of the current packet. To =
> head-drop packets, dereference the pointer multiple times. ASCII =
> diagrams on request :-)

See my reply to Jonathan's comments above.  Does that clarify how I
envision the buffers being structured?

SF

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Bloat] Designer of a new HW gadget wishes to avoid bufferbloat
  2012-10-26 18:54 Michael Spacefalcon
  2012-10-26 19:56 ` Jonathan Morton
@ 2012-10-26 21:45 ` Albert Rafetseder
  1 sibling, 0 replies; 8+ messages in thread
From: Albert Rafetseder @ 2012-10-26 21:45 UTC (permalink / raw)
  To: bloat; +Cc: opensdsl

Am 26.10.2012 um 20:54 schrieb Michael Spacefalcon:

> Albert Rafetseder <albert.rafetseder+bufferbloat@univie.ac.at> wrote:
> 
>> You must also rule out ATM cell loss and reordering. Otherwise, there is =
>> too little data in your receive buffer to reassemble the transmitted =
>> frame (temporarily with reordering, terminally with loss). This calls =
>> for a timeout of sorts.
> 
> (...)
> The issue of reordering is relevant only when there are 2 or more VCs
> on the same physical circuit, and I have yet to encounter an xDSL
> circuit of any flavor that is set up that way.
> (...)

> Cell loss: it can't be detected directly, and it manifests itself in
> the reassembled AAL5 packets appearing to be corrupt (bad CRC-32 at
> the end).  If the first cell or some middle cell of a packet gets
> lost, only that one packet is lost as a result.  If the last cell of a
> packet gets lost, the overall IP-over-ATM link loses two packets: the
> one whose last cell got lost, and the next one, which appears to be a
> continuation of the previous one.  So that's another way in which ATM
> on an xDSL circuit is worse than Ethernet or HDLC.

I.e., you rule out ATM cell loss and reordering! :-)

>> Suppose you have a list of starts of packets addresses within =
>> your cell transmit ring buffer. Drop the first list element. Head drop! =
>> Done! Once the current packet's cells are transmitted, you jump to the =
>> next start of packet  in the list, wherever it is. As long as you ensure =
>> that packets you receive on the HDLC side don't overwrite the cells you =
>> are currently transmitting, you are fine. (If the buffer was really =
>> small, this probably meant lots of head drop.)
> 
> Let me try to understand what you are suggesting.  It seems to me that
> you are suggesting having two separate elastic buffers between the
> fill side and the drain side: one storing cells, the other storing
> one-element-per-packet information (such as a buffer start address).
> You implement head drop by dropping the head element from the buffer
> that reckons in packets, rather than cells.  But the cells are still
> there, still taking up memory in the cell buffer, and that cell
> storage memory can't be used to buffer up a new ingress packet because
> that storage sits in the middle between cells of a packet in the
> middle of transmission and the next packet which is still "officially"
> in the queue.

I'm suggesting a *ring* buffer. Let me try to sketch what I mean (view in your favorite fixed-space font). ^r is the read pointer (output to SDSL, head of the queue), ^w is the write pointer (input from HDLC, tail). Pointers can only go right, but all addressing is done modulo the buffer size, so you start over on the left side if you were ever to point past the buffer.

Currently, the read pointer processes "2", whereas the write pointer will overwrite "A" next. 

0123456789ABCDEF
  ^r      ^w

Now the read pointer has progressed a bit, but the write pointer was much faster:

xyz3456789rstuvw
   ^w ^r

Obviously, ^w must not overtake ^r. The trick is now to push the read pointer forward a bit instead of holding back the write pointer. Push it how far? Far enough to buy us some headroom, e.g. given how likely you estimate worst-case packets will accumulate. Finally, we will have lost buffer slots (marked with asterisks) from the head of the queue, but were able to save all of the new incoming data, which means no tail drop:

xyz!#$6*****tuvw
      ^w    ^r

> But I can see how perhaps the head drop mechanism you are suggesting
> could be used as a secondary AQM-style packet dropper - but it can't
> be the primary mechanism which ensures that the fill logic doesn't
> overwrite the cell buffer RAM which the drain side is still reading
> from.  I was talking about the latter in my original post when I said
> that I couldn't do anything other than tail drop.

The sketch above applies to both the list of addresses of cell starts and the actual cell buffer. (You could also implement both ring buffers as a single one, as a linked list: Store the address of the start cell of the next packet in front of the cells of the current packet. To head-drop packets, dereference the pointer multiple times. ASCII diagrams on request :-)

> 3. The CoDel (and bufferbloat/AQM) folks in general seem to view the
>   arrival and departure of packets as atomic instantaneous events.
>   While the latter may be true for a software router which receives
>   packets from or hands them over to a hardware Ethernet interface in
>   an instantaneous atomic manner, that model absolutely does not hold
>   for the underlying HW device itself as it reformats a stream of
>   packets from HDLC to SDSL/ATM in a flow-through fashion, without
>   waiting for each packet to be received in its entirety.

That's an interesting notion indeed. Still, even for sending cells from packets that haven't fully gone into the ring buffer from the HDLC side, I think that the approach described would work. The closer the write and read pointers get, the more headaches, though.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Bloat] Designer of a new HW gadget wishes to avoid bufferbloat
  2012-10-26 18:54 Michael Spacefalcon
@ 2012-10-26 19:56 ` Jonathan Morton
  2012-10-26 21:45 ` Albert Rafetseder
  1 sibling, 0 replies; 8+ messages in thread
From: Jonathan Morton @ 2012-10-26 19:56 UTC (permalink / raw)
  To: Michael Spacefalcon; +Cc: opensdsl, bloat

[-- Attachment #1: Type: text/plain, Size: 6026 bytes --]

> > I think =
> > there are many good arguments for head-drop, and while I'm not a =
> > hardware engineer, I don't see why it would be terribly difficult to =
> > implement. Suppose you have a list of starts of packets addresses
within =
> > your cell transmit ring buffer. Drop the first list element. Head drop!
=
> > Done! Once the current packet's cells are transmitted, you jump to the =
> > next start of packet  in the list, wherever it is. As long as you
ensure =
> > that packets you receive on the HDLC side don't overwrite the cells you
=
> > are currently transmitting, you are fine. (If the buffer was really =
> > small, this probably meant lots of head drop.)
>
> Let me try to understand what you are suggesting.  It seems to me that
> you are suggesting having two separate elastic buffers between the
> fill side and the drain side: one storing cells, the other storing
> one-element-per-packet information (such as a buffer start address).
> You implement head drop by dropping the head element from the buffer
> that reckons in packets, rather than cells.  But the cells are still
> there, still taking up memory in the cell buffer, and that cell
> storage memory can't be used to buffer up a new ingress packet because
> that storage sits in the middle between cells of a packet in the
> middle of transmission and the next packet which is still "officially"
> in the queue.
>
> But I can see how perhaps the head drop mechanism you are suggesting
> could be used as a secondary AQM-style packet dropper - but it can't
> be the primary mechanism which ensures that the fill logic doesn't
> overwrite the cell buffer RAM which the drain side is still reading
> from.  I was talking about the latter in my original post when I said
> that I couldn't do anything other than tail drop.
>
> But I've come up with an enhancement to the tail drop which we could
> call "modified tail drop" - is there a better, more widely accepted
> term perhaps?  Here's what I mean: when the fill logic sees the
> beginning of a new packet on the HDLC side, it checks the buffer fill
> level.  If the limit has been reached, instead of simply proceeding to
> skip/ignore the new HDLC ingress packet, drop the packet which has
> been most recently added to the cell buffer (reset the write pointer
> back to a saved position), and proceed to buffer up the new packet
> normally.

Okay, so you can fudge the write pointer to do a different kind of tail
drop. That's great for dealing with a physically full buffer on packet
arrival.

What you need to do for head drop is the same kind of pointer manipulation
at the read end. You only do it between IP packets, so you may want to make
the decision of whether to do so in advance, so as to avoid having to
insert an idle cell while that decision is made.

> What draws me to the Cyclone III family is that even the lowest-end
> parts feature 46 RAM blocks of 9 Kibits each (1 Kibyte + parity),
> i.e., up to 46 KiB of RAM capacity.  That's huge for a low-cost FPGA.
> If I allocate 26 of those M9K blocks for the HDLC->SDSL cell buffer
> (24 blocks for the cell payload buffer and 2 blocks for the parallel
> buffer storing cell header data for the SDSL Tx logic), we can buffer
> up to 511 ATM cells.  (To be convenient, the buffer size in cells must
> be a power of 2 minus 1.)  That corresponds to ~575 ms at 384 kbps or
> ~144 ms at 1536 kbps.  Do you think that would be enough, or do you
> think we need even more RAM for the "good queue" function?

Half a second at bottleneck rate is plenty. At the higher link rate the
buffer will fill more gradually, so you can still absorb a half second
burst in practice, if your inbound link rate is 2Mbps.

> Thus it seems to me that a sensible way to proceed would be to build
> my hardware first, using the simplistic buffer limit mechanism I have
> described in my original post, and then try to add CoDel later, once
> the basic functionality has been brought up.  Or do you think this
> approach would be problematic?

You should definitely think ahead to how you will implement head drop
eventually. Changing that part of the system later might be more painful
than you anticipate.

Aside from that, starting with a simple tail drop queue will validate the
rest of the system and give you a point of comparison. The next step might
be to test the head drop mechanism using a trivial rule such as dropping
alternate packets if the queue is more than half full.

> 2. I work on a different bandwidth scale.  They talk a lot about
>    packets crossing from 1 Gbps or 100 Mbps to 10 Mbps, etc, but the
>    highest possible SDSL kbps rate is 2320, and even that is only on
>    paper: the line cards which are actually deployed USA-wide by what
>    used to be Covad max out at 1536 kbps.  We still have a bandwidth-
>    reduction bottleneck as the V.35 Tx link (router->DSU) has a higher
>    capacity than SDSL/ATM, but it isn't *that* much higher bandwidth.

Lower bandwidth links are still very important in the consumer space.
Sitting on a commuter train in Helsinki, I still regularly get GPRS links
instead of 3G in certain specific places. Even 3G is more typically at your
SDSL bandwidths than at the widely advertised HSPA rates.

> 3. The CoDel (and bufferbloat/AQM) folks in general seem to view the
>    arrival and departure of packets as atomic instantaneous events.
>    While the latter may be true for a software router which receives
>    packets from or hands them over to a hardware Ethernet interface in
>    an instantaneous atomic manner, that model absolutely does not hold
>    for the underlying HW device itself as it reformats a stream of
>    packets from HDLC to SDSL/ATM in a flow-through fashion, without
>    waiting for each packet to be received in its entirety.

It's a valid model for decision making, since you make decisions at
discrete points in time corresponding to packet boundaries. If you come up
with some specific objections, feel free to bring them up here.

- Jonathan Morton

[-- Attachment #2: Type: text/html, Size: 6781 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Bloat] Designer of a new HW gadget wishes to avoid bufferbloat
@ 2012-10-26 18:54 Michael Spacefalcon
  2012-10-26 19:56 ` Jonathan Morton
  2012-10-26 21:45 ` Albert Rafetseder
  0 siblings, 2 replies; 8+ messages in thread
From: Michael Spacefalcon @ 2012-10-26 18:54 UTC (permalink / raw)
  To: bloat; +Cc: opensdsl

Albert Rafetseder <albert.rafetseder+bufferbloat@univie.ac.at> wrote:

> Hi Michael,
> Not that I'm an expert of any means on the topics you touch, but I'll =
> share my point of view on some of the questions raised.

Yay, someone was interested enough to respond!

> Please excuse my =
> aggressive shortening of your original post.

No problem. :-)

> > http://ifctfvax.Harhan.ORG/OpenWAN/OSDCU/
> The site appears down,

It was indeed down for about 31 h, from about 2012-10-21T09:11 to
about 2012-10-22T16:28 GMT.  It is back up now, but on a *very* slow
connection - the 384 kbps pipe it is supposed to be on is still down,
and I've had to activate the emergency dial backup mechanism.  The
latter is a 31200 bps analog modem connection.

(Unfortunately, my switchover mechanism is manual and an incredible
 pita, which tends to extend the downtime.)

> but from your description I think I understand =
> what you are building.

OK. :-)

> You must also rule out ATM cell loss and reordering. Otherwise, there is =
> too little data in your receive buffer to reassemble the transmitted =
> frame (temporarily with reordering, terminally with loss). This calls =
> for a timeout of sorts.

Ahh, I guess I need to clarify something.  Basically, there are two
kinds of ATM users:

1. People who use ATM because they actually like it, and extoll its
   supposed virtue of allowing the cell streams on different VCs to be
   interleaved, such that a cell on one VC can be sent in the middle
   of a long packet on another VC.  Of course these are the people who
   gave us this ATM abomination in the first place, but I don't know
   if any of them are still alive and still uphold those beliefs.

2. People who use ATM because that's what comes down the pipe from the
   service provider, hence they have to deal with it whether they like
   it or not, usually the latter.  I strongly suspect that the vast
   majority of ATM users today are in this category.  The service
   providers themselves (*cough* Covad *cough*) might be in this boat
   as well, with the like-it-or-not imposed condition being their vast
   sunk-cost investment in their nationwide access network, which is
   100% ATM.

The issue of reordering is relevant only when there are 2 or more VCs
on the same physical circuit, and I have yet to encounter an xDSL
circuit of any flavor that is set up that way.  OK, it's very likely
that the people from category 1 above had their circuits set up that
way, but once again, I don't know if any of those people are still
alive.

All xDSL/ATM circuits I've ever used, worked on, or encountered in any
other way belong to category 2 ATM users, and have only one VC at some
fixed VPI/VCI (0/38 for Covad SDSL).  When there is only one PVC on
the underlying physical circuit, there is no possibility of cell
reordering or any other supposed benefits of ATM: every packet is sent
as a series of cells in order from start to finish, in exactly the
same order in which it would have been transmitted over an Ethernet or
HDLC (Frame Relay) medium, and the ATM cell structure does absolutely
nothing except wasting the circuit's bit bandwidth.

The basic/default version of the FPGA logic function on my BlitzDSU
will support the conversion of SDSL/ATM to HDLC for the dominant
configuration of only one PVC.  If anyone ever wishes to use the
device on an SDSL/ATM circuit with 2 or more active VCs, well, it's an
FPGA and not an ASIC: we can just load a different logic function for
those special-case users.

Cell loss: it can't be detected directly, and it manifests itself in
the reassembled AAL5 packets appearing to be corrupt (bad CRC-32 at
the end).  If the first cell or some middle cell of a packet gets
lost, only that one packet is lost as a result.  If the last cell of a
packet gets lost, the overall IP-over-ATM link loses two packets: the
one whose last cell got lost, and the next one, which appears to be a
continuation of the previous one.  So that's another way in which ATM
on an xDSL circuit is worse than Ethernet or HDLC.

Oversize packets: the design of my Layer 2 converter thoroughly
assumes that the maximum size of a valid packet in ATM cells is a
static configuration parameter fixed at provisioning time.  In my
original post I've called this configuration parameter M, and for
classic IPv4 packets of up to 1500 octets in one of the standard
encapsulations M=32.  This limit on the number of ATM cells per AAL5
packet will be strictly enforced by the L2 converter.  As ATM cells
arrive from the SDSL side, they'll be stored in the SDSL->HDLC buffer,
and a count register will be incremented.  If that count register
reaches M on a cell whose "terminal cell" header bit isn't set
(meaning that more cells are coming), the packet will be declared
invalid for the reason of being oversize, an error stats counter will
be incremented, and the buffer write pointer will be reset, discarding
the previously accepted cells of that oversize packet.

An oversize packet can appear for one of two reasons: either
misconfiguration on the other end (perhaps miscommunication between
the user and the service provider as to what the MTU is or should be),
or more likely, the result of cell loss "merging" two valid packets
into one bogon packet.

Reassembly timeout: yes, it would be a good measure, as it might
reduce the needless packet loss resulting from cell loss-induced
"merging".  If the last cell of a packet gets lost, then the line goes
quiescent for a while, then another packet begins, the reassembly
timeout might prevent the second good packet from being interpreted as
a continuation of the corrupt one and therefore lost as well.  But it
won't help prevent exactly the same loss scenario if the second good
packet follows immediately (or shortly) after the corrupt one on a
busy pipe.

ATM sucks.  Majorly.  The DSLAM vendors could have ameliorated the
problem somewhat by keeping ATM for the backhaul links behind the
DSLAMs, but converting to HDLC (doing FRF.8) before sending the bits
to the individual subscribers.  Copper Mountain did it, but the 3 CM
DSL network operators I know of (NorthPoint, Rhythms, DSL.net) have
all bit the dust now (in the listed order).  All we are left with is
what used to be Covad, now MegaPath.  Nokia D50 DSLAMs, ATM cells all
the way to the CPE.

Oh, and for all the "regular" DSL users out there: ADSL, if I'm not
mistaken, specifies ATM cells as part of the standard, so if you have
regular ADSL, no matter which ISP or DSLAM brand, you are getting
exactly the same nonsense as I get from Covad/Nokia SDSL: a single PVC
carrying your IP packets as sequences of ATM cells, with all the same
issues.

> My colleague tells me his Linux boxes (3.5/x64, 2.6.32/x86) have an =
> ip.ipfrag_time of 30 seconds. Anyway, it's lots of cells to buffer, I =
> suppose.

In my design the AAL5 reassembly timeout does not affect the size of
the SDSL->HDLC buffer in any way.  There is only one PVC, cells get
written into the buffer as they arrive, at whatever time that happens,
and the buffer is allowed to empty out onto the HDLC interface when a
cell arrives with the "terminal cell" header bit set, and the CRC-32
check at the end of that cell passes.  This design will work with a
finite buffer (only M+1 cells of buffer space needed to guarantee zero
possibility of out-of-buffer packet loss) even if the reassembly
timeout is set to infinity: if the reassembly timer is ticking, that
means we are getting idle cells from the line in a middle of a packet
reassembly, and those don't need to be stored.

Latency considerations: I've entertained the possibility of sending
the bits to the HDLC side as they arrive from the SDSL side, without
waiting to buffer the whole packet, but that would require doing data-
dependent clock gating on the V.35 Rx side, and I find the idea of
data-dependent clock gating to be a little too evil for me.  So at
least for the initial version, I'll have my SDSL->HDLC logic function
start pushing a packet out on the HDLC side only when that packet has
been buffered up in its entirety.  The greatest added latency that any
packet may experience (from the moment the last bit of that packet has
arrived from the SDSL transceiver chip to the moment the attached V.35
router receives the last bit of the transformed packet) equals the
time it would take to transmit the largest allowed packet (M cells) on
the V.35 Rx link, at whatever bit rate that link has been configured
for.

Choice of bit rate for the V.35 Rx link: it will be a programmable
clock divider, totally independent of the SDSL physical layer bit
clock.  The guarantee of zero out-of-buffer packet loss under every
possible condition requires that the V.35 Rx link bit rate be set to
no less than SDSL_bit_rate*1.2*48/53 (a little under x1.09) for the
minimum buffer size of M+1, or no less than SDSL_bit_rate*1.2*8/9 (a
little under x1.07) for the minimum buffer size of M+8.  The 1.2
factor accounts for the possibility of HDLC's worst-case expansion (if
someone were to send big packets filled with FFs, or all 1s), and the
difference between x48/53 vs. x8/9 changing the minimum buffer size
requirement has to do with how the ATM cells are packed into the Nokia
SDSL frame structure.

But the above calculations show the minimum required bit rate for the
V.35 Rx link.  It can be set higher to reduce the latency.  Most V.35
routers have been built for use with T1/E1 circuits, and should have
no problem comfortably handling bit rates up to 2 Mbps or at least
1.5 Mbps.  My current OSDCU (the feeble CPU-mediated version of the
Layer 2 converter I'm planning to implement in FPGA logic) serves a
384 kbps circuit, but I have the V.35 Rx bit rate set to ~1562 kbps:
25 MHz clock divided by 16.

And if someone does want to try the data-dependent clock gating trick
(sending the bits to the HDLC side before the complete packet has been
received), well, it's an FPGA, not an ASIC - go ahead and try it!

[moving on to the HDLC->SDSL direction]

> Canceling packets halfway through transmission makes no sense.

Of course, I'm not suggesting doing that.  The only time that would
happen with my FPGA-based HDLC->SDSL logic function is if a corrupt
packet arrives on the V.35 port.  Because ATM allows idle cells in the
middle of a packet, we can send the first cell out on the SDSL side as
soon as we've received the first 48 octets of the packet payload from
the HDLC side, without waiting for the end of the packet, then apply
the same logic to the next cell, and so on.  But the HDLC FCS (CRC-16)
check happens at the end of the packet being received from the V.35 Tx
side.  What do we do if that CRC check fails, or we receive an HDLC
abort sequence (7 or more ones)?  If we've already started
transmitting that packet on the SDSL side, we can "cancel" it by
sending what's called an AAL5 abort: a cell with the "terminal cell"
header bit set, and 0000 in the 16-bit packet length field in the AAL5
trailer.  It won't recover the bandwidth wasted sending that invalid
packet, but it would tell the router on the other end to discard it,
rather than turn a corrupt packet into one claiming to be good.

The assumption is that the end user is able to ensure that such errors
won't happen in normal operation.  A V.35 cable connects two pieces of
equipment sitting right next to each other in the same room, totally
under the end user's control, so there should be no reason for errors
on that V.35 cable.

> I think =
> there are many good arguments for head-drop, and while I'm not a =
> hardware engineer, I don't see why it would be terribly difficult to =
> implement. Suppose you have a list of starts of packets addresses within =
> your cell transmit ring buffer. Drop the first list element. Head drop! =
> Done! Once the current packet's cells are transmitted, you jump to the =
> next start of packet  in the list, wherever it is. As long as you ensure =
> that packets you receive on the HDLC side don't overwrite the cells you =
> are currently transmitting, you are fine. (If the buffer was really =
> small, this probably meant lots of head drop.)

Let me try to understand what you are suggesting.  It seems to me that
you are suggesting having two separate elastic buffers between the
fill side and the drain side: one storing cells, the other storing
one-element-per-packet information (such as a buffer start address).
You implement head drop by dropping the head element from the buffer
that reckons in packets, rather than cells.  But the cells are still
there, still taking up memory in the cell buffer, and that cell
storage memory can't be used to buffer up a new ingress packet because
that storage sits in the middle between cells of a packet in the
middle of transmission and the next packet which is still "officially"
in the queue.

But I can see how perhaps the head drop mechanism you are suggesting
could be used as a secondary AQM-style packet dropper - but it can't
be the primary mechanism which ensures that the fill logic doesn't
overwrite the cell buffer RAM which the drain side is still reading
from.  I was talking about the latter in my original post when I said
that I couldn't do anything other than tail drop.

But I've come up with an enhancement to the tail drop which we could
call "modified tail drop" - is there a better, more widely accepted
term perhaps?  Here's what I mean: when the fill logic sees the
beginning of a new packet on the HDLC side, it checks the buffer fill
level.  If the limit has been reached, instead of simply proceeding to
skip/ignore the new HDLC ingress packet, drop the packet which has
been most recently added to the cell buffer (reset the write pointer
back to a saved position), and proceed to buffer up the new packet
normally.

> As regards values for the maximum latency L, this might be a too =
> restrictive way of thinking of queue lengths. Usually, you'd like to =
> accommodate for short bursts of packets if you are sure you can get the =
> packets out in reasonable time, but don't want a long "standing queue" =
> to develop. You might have read over at ACM Queue [1] that the CoDel =
> queue management algorithm starts dropping packets if the time packets =
> are enqueued exceeds a threshold. This doesn't mean that the queue =
> length has a hard upper limit, though

In a hardware implementation, every queue length has a hard upper
limit - the physical size of the RAM block allocated for that queue.
I suppose that a software implementation can have "infinite" queues,
dynamically adding all the available gigabytes of system RAM to them,
but in hardware-based solutions, the physical RAM resources are almost
always hard-allocated to specific functions at design time.

Internal RAM in an FPGA has a much higher cost per bit than ordinary
computer RAM.  While PC memories are now measured in gigabytes, the
RAM blocks in low-cost FPGAs (Altera Cyclone or Xilinx Spartan) still
measure in kilobytes.

The main purpose of my original inquiry was to gauge what FPGA part I
should select for my DSU board design, in terms of the available
buffer RAM capacity.  Right now I'm leaning toward the Cyclone III
family from Altera.  Because I have a very low I/O pin count (only a
handful of serial interfaces leaving the FPGA), and I want the FPGA to
be physically small, inexpensive, and assembly-friendly (i.e., no
BGAs), I am only looking at the lowest-end parts of each family in the
TQFP-144 packages.  (144 pins is way overkill for what I need, but
that's the smallest package available in the FPGA families of interest
to me.)

What draws me to the Cyclone III family is that even the lowest-end
parts feature 46 RAM blocks of 9 Kibits each (1 Kibyte + parity),
i.e., up to 46 KiB of RAM capacity.  That's huge for a low-cost FPGA.
If I allocate 26 of those M9K blocks for the HDLC->SDSL cell buffer
(24 blocks for the cell payload buffer and 2 blocks for the parallel
buffer storing cell header data for the SDSL Tx logic), we can buffer
up to 511 ATM cells.  (To be convenient, the buffer size in cells must
be a power of 2 minus 1.)  That corresponds to ~575 ms at 384 kbps or
~144 ms at 1536 kbps.  Do you think that would be enough, or do you
think we need even more RAM for the "good queue" function?

I have not yet gone through the hassle of installing a new version of
Quartus (Altera's FPGA compiler) that supports Cyclone III - the
version I'm using right now only supports up to Cyclone II, and that's
what I'm using for my preliminary Verilog development.  With
Cyclone II the RAM blocks are only 4096 bits each, and there are only
36 of them in the EP2C8 device (the biggest Cyclone II available in
TQFP-144), so that's only 18 KiB of internal RAM.  Taking up 13 of
those M4K blocks for the HDLC->SDSL function would give us 127 cells;
26 blocks would give us 255 cells.

In terms of the cost of the actual parts (in hobbyist single-piece
quantities from Digi-Key), there is no significant difference between
Cyclone II and Cyclone III.  As the extra RAM and logic capacity can't
hurt (we don't have to use them if we don't need/want them), I'm
hoping to be able to use Cyclone III, unless I run into an obstacle
somewhere.

> -- it will try to keep the queue =
> short (as dropping packets signals "back off!" to the end hosts =
> communicating across the link) with some elasticity towards longer =
> enqueueing times. Maybe you'd like to be the pioneer acid-testing =
> CoDel's implementability in silicon.
> [1] http://queue.acm.org/detail.cfm?id=3D2209336

In that paper they say: "If the buffer is full when a packet arrives,
then the packet can be dropped as usual."  So it seems that the
mechanism that prevents system malfunction as a result of the writer
overrunning the reader is still essentially tail drop (or my "modified
tail drop"), whereas CoDel serves as an add-on that artificially drops
some of the queued-up packets.

Thus it seems to me that a sensible way to proceed would be to build
my hardware first, using the simplistic buffer limit mechanism I have
described in my original post, and then try to add CoDel later, once
the basic functionality has been brought up.  Or do you think this
approach would be problematic?

> Does any of this make sense to you?

Yes, it does.  But my application is quite different from what the
CoDel folks had in mind:

1. They have put a lot of emphasis into variable-bandwidth links that
   may be subject to degradation.  In the case of SDSL the bandwidth
   is fixed by the user's budget, and all SDSL circuits I've ever used
   have been 100% error-free at the physical layer.  (That doesn't
   mean they don't go down, though - but every single time my IP-over-
   SDSL circuit has crapped out, it has always been a high-level
   problem on the ISP's side, never a physical layer failure or
   degradation.  My current downtime is no exception.)

2. I work on a different bandwidth scale.  They talk a lot about
   packets crossing from 1 Gbps or 100 Mbps to 10 Mbps, etc, but the
   highest possible SDSL kbps rate is 2320, and even that is only on
   paper: the line cards which are actually deployed USA-wide by what
   used to be Covad max out at 1536 kbps.  We still have a bandwidth-
   reduction bottleneck as the V.35 Tx link (router->DSU) has a higher
   capacity than SDSL/ATM, but it isn't *that* much higher bandwidth.

3. The CoDel (and bufferbloat/AQM) folks in general seem to view the
   arrival and departure of packets as atomic instantaneous events.
   While the latter may be true for a software router which receives
   packets from or hands them over to a hardware Ethernet interface in
   an instantaneous atomic manner, that model absolutely does not hold
   for the underlying HW device itself as it reformats a stream of
   packets from HDLC to SDSL/ATM in a flow-through fashion, without
   waiting for each packet to be received in its entirety.

So I think I'll need to do some investigation and experimentation of
my own to apply these ideas to my peculiar circumstances.  If I can
use an FPGA part with plenty of RAM and logic capacity to spare after
I've implemented the basic minimum, then that should facilitate such
later investigation and experimentation.

SF

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2012-11-18  0:53 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-10-14  3:41 [Bloat] Designer of a new HW gadget wishes to avoid bufferbloat Michael Spacefalcon
2012-10-22 15:58 ` Albert Rafetseder
2012-10-26 18:54 Michael Spacefalcon
2012-10-26 19:56 ` Jonathan Morton
2012-10-26 21:45 ` Albert Rafetseder
2012-10-27  2:49 Michael Spacefalcon
2012-11-12  8:03 ` Albert Rafetseder
2012-11-18  0:53 Michael Spacefalcon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox