[Bloat] Fwd: 400G forwarding - how does it work?

General list for discussing Bufferbloat
 help / color / mirror / Atom feed

* [Bloat] Fwd: 400G forwarding - how does it work?
       [not found] <CAAWx_pX3fHc96N8Miti6Tuk9Km3YLUqAqZxDu_M5WJw2NErwHA@mail.gmail.com>
@ 2022-07-25 13:12 ` Dave Taht
  2022-07-25 14:58   ` Simon Leinen
       [not found] ` <20220725131600.GC30425@cmadams.net>
  1 sibling, 1 reply; 3+ messages in thread
From: Dave Taht @ 2022-07-25 13:12 UTC (permalink / raw)
  To: bloat

I'd like to understand more deeply, too.

---------- Forwarded message ---------
From: James Bensley <jwbensley+nanog@gmail.com>
Date: Mon, Jul 25, 2022 at 5:55 AM
Subject: 400G forwarding - how does it work?
To: NANOG <nanog@nanog.org>

Hi All,

I've been trying to understand how forwarding at 400G is possible,
specifically in this example, in relation to the Broadcom J2 chips,
but I don't the mystery is anything specific to them...

According to the Broadcom Jericho2 BCM88690 data sheet it provides
4.8Tbps of traffic processing and supports packet forwarding at 2Bpps.
According to my maths that means it requires packet sizes of 300Bs to
reach line rate across all ports. The data sheet says packet sizes
above 284B, so I guess this is excluding some headers like the
inter-frame gap and CRC (nothing after the PHY/MAC needs to know about
them if the CRC is valid)? As I interpret the data sheet, J2 should
supports chassis with 12x 400Gbps ports at line rate with 284B packets
then.

Jericho2 can be linked to a BCM16K for expanded packet forwarding
tables and lookup processing (i.e. to hold the full global routing
table, in such a case, forwarding lookups are offloaded to the
BCM16K). The BCM16K documentation suggests that it uses TCAM for exact
matching (e.g.,for ACLs) in something called the "Database Array"
(with 2M 40b entries?), and SRAM for LPM (e.g., IP lookups) in
something called the "User Data Array" (with 16M 32b entries?).

A BCM16K supports 16 parallel searches, which means that each of the
12x 400G ports on a Jericho2 could perform an forwarding lookup at
same time. This means that the BCM16K "only" needs to perform
forwarding look-ups at a linear rate of 1x 400Gbps, not 4.8Tbps, and
"only" for packets larger than 284 bytes, because that is the Jericho2
line-rate Pps rate. This means that each of the 16 parallel searches
in the BCM16K, they need to support a rate of 164Mpps (164,473,684) to
reach 400Gbps. This is much more in the realm of feasible, but still
pretty extreme...

1 second / 164473684 packets = 1 packet every 6.08 nanoseconds, which
is within the access time of TCAM and SRAM but this needs to include
some computing time too e.g. generating a key for a lookup and passing
the results along the pipeline etc. The BCM16K has a clock speed of
1Ghz (1,000,000,000, cycles per second, or cycle every 1 nano second)
and supports an SRAM memory access in a single clock cycle (according
to the data sheet). If one cycle is required for an SRAM lookup, the
BCM16K only has 5 cycles to perform other computation tasks, and the
J2 chip needs to do the various header re-writes and various counter
updates etc., so how is magic this happening?!?

The obvious answer is that it's not magic and my understanding is
fundamentally flawed, so please enlighten me.

Cheers,
James.

-- 
FQ World Domination pending: https://blog.cerowrt.org/post/state_of_fq_codel/
Dave Täht CEO, TekLibre, LLC

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Bloat] Fwd: 400G forwarding - how does it work?
  2022-07-25 13:12 ` [Bloat] Fwd: 400G forwarding - how does it work? Dave Taht
@ 2022-07-25 14:58   ` Simon Leinen
  0 siblings, 0 replies; 3+ messages in thread
From: Simon Leinen @ 2022-07-25 14:58 UTC (permalink / raw)
  To: Dave Taht via Bloat

Dave Taht via Bloat writes:
> I'd like to understand more deeply, too.

"Deep" is the right word here, 'cause there is DEEP pipelining going on.

"1 packet every 6.08 nanoseconds" (per pipeline) does NOT mean that you
only have (at 1GHz clock speed) 6 cycles to spend on each packet, just
that *each stage of the pipeline* needs to be ready to accept a new
packet every 6 cycles.

So if you can distribute the work over, say, 20 pipeline stages, then
you have 120 cycles to work on each packet.

Implementing this is surely tricky! Especially as you have to ship
relatively large amounts of (packet) data to the correct output port as
part of that processing.  Definitely a major engineering feat...

Cheers,
-- 
Simon.

> ---------- Forwarded message ---------
> From: James Bensley <jwbensley+nanog@gmail.com>
> Date: Mon, Jul 25, 2022 at 5:55 AM
> Subject: 400G forwarding - how does it work?
> To: NANOG <nanog@nanog.org>


> Hi All,

> I've been trying to understand how forwarding at 400G is possible,
> specifically in this example, in relation to the Broadcom J2 chips,
> but I don't the mystery is anything specific to them...

> According to the Broadcom Jericho2 BCM88690 data sheet it provides
> 4.8Tbps of traffic processing and supports packet forwarding at 2Bpps.
> According to my maths that means it requires packet sizes of 300Bs to
> reach line rate across all ports. The data sheet says packet sizes
> above 284B, so I guess this is excluding some headers like the
> inter-frame gap and CRC (nothing after the PHY/MAC needs to know about
> them if the CRC is valid)? As I interpret the data sheet, J2 should
> supports chassis with 12x 400Gbps ports at line rate with 284B packets
> then.

> Jericho2 can be linked to a BCM16K for expanded packet forwarding
> tables and lookup processing (i.e. to hold the full global routing
> table, in such a case, forwarding lookups are offloaded to the
> BCM16K). The BCM16K documentation suggests that it uses TCAM for exact
> matching (e.g.,for ACLs) in something called the "Database Array"
> (with 2M 40b entries?), and SRAM for LPM (e.g., IP lookups) in
> something called the "User Data Array" (with 16M 32b entries?).

> A BCM16K supports 16 parallel searches, which means that each of the
> 12x 400G ports on a Jericho2 could perform an forwarding lookup at
> same time. This means that the BCM16K "only" needs to perform
> forwarding look-ups at a linear rate of 1x 400Gbps, not 4.8Tbps, and
> "only" for packets larger than 284 bytes, because that is the Jericho2
> line-rate Pps rate. This means that each of the 16 parallel searches
> in the BCM16K, they need to support a rate of 164Mpps (164,473,684) to
> reach 400Gbps. This is much more in the realm of feasible, but still
> pretty extreme...

> 1 second / 164473684 packets = 1 packet every 6.08 nanoseconds, which
> is within the access time of TCAM and SRAM but this needs to include
> some computing time too e.g. generating a key for a lookup and passing
> the results along the pipeline etc. The BCM16K has a clock speed of
> 1Ghz (1,000,000,000, cycles per second, or cycle every 1 nano second)
> and supports an SRAM memory access in a single clock cycle (according
> to the data sheet). If one cycle is required for an SRAM lookup, the
> BCM16K only has 5 cycles to perform other computation tasks, and the
> J2 chip needs to do the various header re-writes and various counter
> updates etc., so how is magic this happening?!?

> The obvious answer is that it's not magic and my understanding is
> fundamentally flawed, so please enlighten me.

> Cheers,
> James.


> -- 
> FQ World Domination pending: https://blog.cerowrt.org/post/state_of_fq_codel/
> Dave Täht CEO, TekLibre, LLC
> _______________________________________________
> Bloat mailing list
> Bloat@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Bloat] 400G forwarding - how does it work?
       [not found]                           ` <CAFwHcn=YH0vL6RNkr3wcHXmF405FY0x+YobgfTTMV5wBexwMFw@mail.gmail.com>
@ 2022-08-07 17:34                             ` Dave Taht
  0 siblings, 0 replies; 3+ messages in thread
From: Dave Taht @ 2022-08-07 17:34 UTC (permalink / raw)
  To: dip; +Cc: Masataka Ohta, NANOG, bloat

[-- Attachment #1: Type: text/plain, Size: 5457 bytes --]

If it's of any help... the bloat mailing list at lists.bufferbloat.net has
the largest concentration of
queue theorists and network operator + developers I know of. (also, bloat
readers, this ongoing thread on nanog about 400Gbit is fascinating)

There is 10+ years worth of debate in the archives:
https://lists.bufferbloat.net/pipermail/bloat/2012-May/thread.html as one
example.

On Sun, Aug 7, 2022 at 10:14 AM dip <diptanshu.singh@gmail.com> wrote:

>
> Disclaimer: I often use the M/M/1 queuing assumption for much of my work
> to keep the maths simple and believe that I am reasonably aware in which
> context it's a right or a wrong application :). Also, I don't intend to
> change the core topic of the thread, but since this has come up, I couldn't
> resist.
>
> >> With 99% load M/M/1, 500 packets (750kB for 1500B MTU) of
> >> buffer is enough to make packet drop probability less than
> >> 1%. With 98% load, the probability is 0.0041%.
>
> To expand the above a bit so that there is no ambiguity. The above assumes
> that the router behaves like an M/M/1 queue. The expected number of packets
> in the systems can be given by
>
> [image: image.png]
> where [image: image.png] is the utilization. The probability that at
> least B packets are in the system is given by  [image: image.png] where B
> is the number of packets in the system. for a link utilization of .98, the
> packet drop probability is .98**(500) = 0.000041%. for a link utilization
> of 99%,  .99**500 = 0.00657%.
>
>
Regrettably, tcp ccs, by design do not stop growth until you get that drop,
e.g. 100+% utilization.


>> When many TCPs are running, burst is averaged and traffic
> >> is poisson.
>
> M/M/1 queuing assumes that traffic is Poisson, and the Poisson assumption
> is
> 1) The number of sources is infinite
> 2) The traffic arrival pattern is random.
>
> I think the second assumption is where I often question whether the
> traffic arrival pattern is truly random. I have seen cases where traffic
> behaves more like self-similar. Most Poisson models rely on the Central
> limit theorem, which loosely states that the sample distribution will
> approach a normal distribution as we aggregate more from various
> distributions. The mean will smooth towards a value.
>
> Do you have any good pointers where the research has been done that
> today's internet traffic can be modeled accurately by Poisson? For as many
> papers supporting Poisson, I have seen as many papers saying it's not
> Poisson.
>
> https://www.icir.org/vern/papers/poisson.TON.pdf
> https://www.cs.wustl.edu/~jain/cse567-06/ftp/traffic_models2/#sec1.2
>

I am firmly in the not-poisson camp, however, by inserting (esp) FQ and AQM
techniques on the bottleneck links it is very possible to smooth traffic
into this more easily analytical model - and gain enormous benefits from
doing so.



> On Sun, 7 Aug 2022 at 04:18, Masataka Ohta <
> mohta@necom830.hpcl.titech.ac.jp> wrote:
>
>> Saku Ytti wrote:
>>
>> >> I'm afraid you imply too much buffer bloat only to cause
>> >> unnecessary and unpleasant delay.
>> >>
>> >> With 99% load M/M/1, 500 packets (750kB for 1500B MTU) of
>> >> buffer is enough to make packet drop probability less than
>> >> 1%. With 98% load, the probability is 0.0041%.
>>
>> > I feel like I'll live to regret asking. Which congestion control
>> > algorithm are you thinking of?
>>
>> I'm not assuming LAN environment, for which paced TCP may
>> be desirable (if bandwidth requirement is tight, which is
>> unlikely in LAN).
>>
>> > But Cubic and Reno will burst tcp window growth at sender rate, which
>> > may be much more than receiver rate, someone has to store that growth
>> > and pace it out at receiver rate, otherwise window won't grow, and
>> > receiver rate won't be achieved.
>>
>> When many TCPs are running, burst is averaged and traffic
>> is poisson.
>>
>> > So in an ideal scenario, no we don't need a lot of buffer, in
>> > practical situations today, yes we need quite a bit of buffer.
>>
>> That is an old theory known to be invalid (Ethernet switches with
>> small buffer is enough for IXes) and theoretically denied by:
>>
>>         Sizing router buffers
>>         https://dl.acm.org/doi/10.1145/1030194.1015499
>>
>> after which paced TCP was developed for unimportant exceptional
>> cases of LAN.
>>
>>  > Now add to this multiple logical interfaces, each having 4-8 queues,
>>  > it adds up.
>>
>> Having so may queues requires sorting of queues to properly
>> prioritize them, which costs a lot of computation (and
>> performance loss) for no benefit and is a bad idea.
>>
>>  > Also the shallow ingress buffers discussed in the thread are not delay
>>  > buffers and the problem is complex because no device is marketable
>>  > that can accept wire rate of minimum packet size, so what trade-offs
>>  > do we carry, when we get bad traffic at wire rate at small packet
>>  > size? We can't empty the ingress buffers fast enough, do we have
>>  > physical memory for each port, do we share, how do we share?
>>
>> People who use irrationally small packets will suffer, which is
>> not a problem for the rest of us.
>>
>>                                                 Masataka Ohta
>>
>>
>>

-- 
FQ World Domination pending:
https://blog.cerowrt.org/post/state_of_fq_codel/
Dave Täht CEO, TekLibre, LLC

[-- Attachment #2: Type: text/html, Size: 7874 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2022-08-07 17:34 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAAWx_pX3fHc96N8Miti6Tuk9Km3YLUqAqZxDu_M5WJw2NErwHA@mail.gmail.com>
2022-07-25 13:12 ` [Bloat] Fwd: 400G forwarding - how does it work? Dave Taht
2022-07-25 14:58   ` Simon Leinen
     [not found] ` <20220725131600.GC30425@cmadams.net>
     [not found]   ` <CAAWx_pUk1rOwc+jvVEED4Es5384hLEb_eDHH0DtdnS8m-3K8ZQ@mail.gmail.com>
     [not found]     ` <CAAeewD9h6ci89WecsO72-v-975xGcwtXZqnWHyFHXHXi=dDHYw@mail.gmail.com>
     [not found]       ` <884F632E-BB1C-44DA-9FF8-9D20AC66D158@gmail.com>
     [not found]         ` <CAEHH8rG2w_3nBAQhAfu=bby2TarnmWaDQJVjBPQxvJiQxUyYkw@mail.gmail.com>
     [not found]           ` <CAAWx_pXJ6X36odR2pHRHn4So7qHwDDCtgcB7heKAA2FWhfS05Q@mail.gmail.com>
     [not found]             ` <49E386AE-BC73-458F-B679-2A438CF73E7A@hxcore.ol>
     [not found]               ` <CAFAzdPUKijZeJV+j1C+j=TtM7q3sgW5smdADRPpN191dEQmv-Q@mail.gmail.com>
     [not found]                 ` <CAAeewD-M5chnDF-dLDn_=8ow245yWbuvdOZgeL-kF51nvQeL-A@mail.gmail.com>
     [not found]                   ` <02ac01d8a8f1$3114efb0$933ecf10$@gmail.com>
     [not found]                     ` <1e895c68-bc7b-dfbf-3fdc-e31394c68626@necom830.hpcl.titech.ac.jp>
     [not found]                       ` <CAAeewD-_zE+huidNw_a__EEjRwFSGM+3uDV3COGL9EC0_=suBw@mail.gmail.com>
     [not found]                         ` <7e0b044d-3963-9c60-6a88-66d462278118@necom830.hpcl.titech.ac.jp>
     [not found]                           ` <CAFwHcn=YH0vL6RNkr3wcHXmF405FY0x+YobgfTTMV5wBexwMFw@mail.gmail.com>
2022-08-07 17:34                             ` [Bloat] " Dave Taht

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox