[Cerowrt-devel] 800gige

Development issues regarding the cerowrt test router project
 help / color / mirror / Atom feed

* [Cerowrt-devel] 800gige
@ 2020-04-11 23:08 Dave Taht
  2020-04-12 16:15 ` David P. Reed
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Dave Taht @ 2020-04-11 23:08 UTC (permalink / raw)
  To: cerowrt-devel

The way I've basically looked at things since 25Gbit ethernet was that
improvements in single stream throughput were dead. I see a lot of
work on out of order delivery tolerance as an outgrowth of that,
but... am I wrong?

https://ethernettechnologyconsortium.org/wp-content/uploads/2020/03/800G-Specification_r1.0.pdf

-- 
Make Music, Not War

Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-435-0729

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Cerowrt-devel] 800gige
  2020-04-11 23:08 [Cerowrt-devel] 800gige Dave Taht
@ 2020-04-12 16:15 ` David P. Reed
  2020-04-15 17:39 ` Mikael Abrahamsson
       [not found] ` <mailman.1077.1586972355.1241.cerowrt-devel@lists.bufferbloat.net>
  2 siblings, 0 replies; 6+ messages in thread
From: David P. Reed @ 2020-04-12 16:15 UTC (permalink / raw)
  To: Dave Taht; +Cc: cerowrt-devel

Sadly, out-of-order delivery tolerance was a "requirement" when we designed TCP originally. There was a big motivation: spreading traffic across a variety of roughly equivalent paths, when you look at the center of the network activity (not the stupid image called "backbone" the forces you to think it is just one pipe in the middle).
Instead a bunch of bell-head, circuit-oriented thought was engineered into TCP's later assumptions (though not UDP, thank the lord). And I mean to be insulting there.

It continues to appall me how much the post-1990 TCP tinkerers have assumed "almost perfectly in-order" delivery of packets that are in transit in the network between endpoints, and how much they screw up when that isn't true.

Almost every paper in the literature (and RFC's) makes the assumption. 

But here's the point. With a little careful thought, it is unnecessary to make this assumption in almost all cases. For example: you can get the effect of SACK without having to assume that delivery is almost in-order. And the resut will be a protocol that works better for out-of-order delivery, and also have much better performance wnen in-order delivery happens to occur. (and that's not even taking advantage of erasure coding like the invention called "Digital Fountains", which is also an approach for out-of-order delivery in TCP).

This is another example of the failure to adhere to the end-to-end argument. You don't need to put "near-in-order-delivery" as a function into the network to get the result you want (congestion control, efficient error-tolerance). So don't put that requirement on the network. Let it choose a different route for every packet from A to B.

On Saturday, April 11, 2020 7:08pm, "Dave Taht" <dave.taht@gmail.com> said:

> The way I've basically looked at things since 25Gbit ethernet was that
> improvements in single stream throughput were dead. I see a lot of
> work on out of order delivery tolerance as an outgrowth of that,
> but... am I wrong?
> 
> https://ethernettechnologyconsortium.org/wp-content/uploads/2020/03/800G-Specification_r1.0.pdf
> 
> --
> Make Music, Not War
> 
> Dave Täht
> CTO, TekLibre, LLC
> http://www.teklibre.com
> Tel: 1-831-435-0729
> _______________________________________________
> Cerowrt-devel mailing list
> Cerowrt-devel@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/cerowrt-devel
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Cerowrt-devel] 800gige
  2020-04-11 23:08 [Cerowrt-devel] 800gige Dave Taht
  2020-04-12 16:15 ` David P. Reed
@ 2020-04-15 17:39 ` Mikael Abrahamsson
       [not found] ` <mailman.1077.1586972355.1241.cerowrt-devel@lists.bufferbloat.net>
  2 siblings, 0 replies; 6+ messages in thread
From: Mikael Abrahamsson @ 2020-04-15 17:39 UTC (permalink / raw)
  To: Dave Taht; +Cc: cerowrt-devel

On Sat, 11 Apr 2020, Dave Taht wrote:

> The way I've basically looked at things since 25Gbit ethernet was that
> improvements in single stream throughput were dead. I see a lot of
> work on out of order delivery tolerance as an outgrowth of that,
> but... am I wrong?

Backbone ISPs today are built with lots of parallel links (20x100GE for 
instance) and then we do L4 hashing for flows across these. This means a 
single L4 flow is capped at less than 100GE. This is not a huge problem 
but we're always trying to get faster and faster ports for single flow 
(and for other reasons).

We're now going for 100 gigabit/s per lane (it's been going up from 4x2.5G 
for 10GE to 1x10G, then we went for lane speeds of 10G, 25G, 50G and now 
we're at 100G per lane), and it seems the 800GE in your link has 8 lanes 
of that. This means a single L4 flow can be 800GE even though it's in 
reality 8x100G lanes, as a single packet bits are being sprayed across all 
the lanes.

The lane speeds are going up and up, and this relates to PCI-E as well, 
but it's not fast enough to we're going wider as well (think equivalent 
PCI-E x16).

Out the port we're trying to DWDM long-haul transmissions and we're 
getting closer and closer to the shannon limit and we're throwing lots of 
DSP at the problem (the new DSPs are 7nm and they already have 5nm and 3nm 
roadmaps for these DSPs to keep power down).

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Cerowrt-devel] 800gige
       [not found] ` <mailman.1077.1586972355.1241.cerowrt-devel@lists.bufferbloat.net>
@ 2020-04-15 20:57   ` Michael Richardson
  2020-04-15 21:08     ` Joel Wirāmu Pauling
  0 siblings, 1 reply; 6+ messages in thread
From: Michael Richardson @ 2020-04-15 20:57 UTC (permalink / raw)
  To: Mikael Abrahamsson, cerowrt-devel

[-- Attachment #1: Type: text/plain, Size: 1821 bytes --]

Mikael Abrahamsson via Cerowrt-devel wrote:
    > Backbone ISPs today are built with lots of parallel links (20x100GE for
    > instance) and then we do L4 hashing for flows across these. This means

got it. inverse multiplexing of flows across *links*

    > We're now going for 100 gigabit/s per lane (it's been going up from 4x2.5G
    > for 10GE to 1x10G, then we went for lane speeds of 10G, 25G, 50G and now
    > we're at 100G per lane), and it seems the 800GE in your link has 8 lanes of
    > that. This means a single L4 flow can be 800GE even though it's in reality
    > 8x100G lanes, as a single packet bits are being sprayed across all the
    > lanes.

Here you talk about *lanes*, and inverse multiplexing of a single frame across *lanes*.
Your allusion to PCI-E is well taken, but if I am completing the analogy, and
the reference to DWDM, I'm thinking that you are talking about 100 gigabit/s
per lambda, with a single frame being inverse multiplexed across lambdas (as lanes).

Did I understand this correctly?

I understand a bit of "because we can".
I also understand that 20 x 800GE parallel links is better than 20 x 100GE
parallel links across the same long-haul (dark) fiber.

But, what is the reason among ISPs to desire enabling a single L4 flow to use more
than 100GE?  Given that it seems that being able to L3 switch 800GE is harder
than switching 8x flows of already L4 ordered 100GE. (Flowlabel!), why pay
the extra price here?

While I can see L2VPN use cases, I can also see that L2VPNs could generate
multiple flows themselves if they wanted.

--
]               Never tell me the odds!                 | ipv6 mesh networks [
]   Michael Richardson, Sandelman Software Works        |    IoT architect   [
]     mcr@sandelman.ca  http://www.sandelman.ca/        |   ruby on rails    [

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Cerowrt-devel] 800gige
  2020-04-15 20:57   ` Michael Richardson
@ 2020-04-15 21:08     ` Joel Wirāmu Pauling
  2020-04-15 21:35       ` Dave Taht
  0 siblings, 1 reply; 6+ messages in thread
From: Joel Wirāmu Pauling @ 2020-04-15 21:08 UTC (permalink / raw)
  To: Michael Richardson; +Cc: Mikael Abrahamsson, cerowrt-devel

[-- Attachment #1: Type: text/plain, Size: 2566 bytes --]

Another neat thing about 400 and 800GE is that you can get MPO optics that
allow splitting a single 4x100 or 8x100 into individual 100G feeds. Good
for port density and/or adding capacity to processing/Edge/Appliances

Now there are decent ER optics for 100G you can now do 40-70KM runs of each
100G link without additional active electronics on the path or going to and
optical transport route.

On Thu, 16 Apr 2020 at 08:57, Michael Richardson <mcr@sandelman.ca> wrote:

>
> Mikael Abrahamsson via Cerowrt-devel wrote:
>     > Backbone ISPs today are built with lots of parallel links (20x100GE
> for
>     > instance) and then we do L4 hashing for flows across these. This
> means
>
> got it. inverse multiplexing of flows across *links*
>
>     > We're now going for 100 gigabit/s per lane (it's been going up from
> 4x2.5G
>     > for 10GE to 1x10G, then we went for lane speeds of 10G, 25G, 50G and
> now
>     > we're at 100G per lane), and it seems the 800GE in your link has 8
> lanes of
>     > that. This means a single L4 flow can be 800GE even though it's in
> reality
>     > 8x100G lanes, as a single packet bits are being sprayed across all
> the
>     > lanes.
>
> Here you talk about *lanes*, and inverse multiplexing of a single frame
> across *lanes*.
> Your allusion to PCI-E is well taken, but if I am completing the analogy,
> and
> the reference to DWDM, I'm thinking that you are talking about 100
> gigabit/s
> per lambda, with a single frame being inverse multiplexed across lambdas
> (as lanes).
>
> Did I understand this correctly?
>
> I understand a bit of "because we can".
> I also understand that 20 x 800GE parallel links is better than 20 x 100GE
> parallel links across the same long-haul (dark) fiber.
>
> But, what is the reason among ISPs to desire enabling a single L4 flow to
> use more
> than 100GE?  Given that it seems that being able to L3 switch 800GE is
> harder
> than switching 8x flows of already L4 ordered 100GE. (Flowlabel!), why pay
> the extra price here?
>
> While I can see L2VPN use cases, I can also see that L2VPNs could generate
> multiple flows themselves if they wanted.
>
> --
> ]               Never tell me the odds!                 | ipv6 mesh
> networks [
> ]   Michael Richardson, Sandelman Software Works        |    IoT
> architect   [
> ]     mcr@sandelman.ca  http://www.sandelman.ca/        |   ruby on
> rails    [
>
> _______________________________________________
> Cerowrt-devel mailing list
> Cerowrt-devel@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/cerowrt-devel
>

[-- Attachment #2: Type: text/html, Size: 3579 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Cerowrt-devel] 800gige
  2020-04-15 21:08     ` Joel Wirāmu Pauling
@ 2020-04-15 21:35       ` Dave Taht
  0 siblings, 0 replies; 6+ messages in thread
From: Dave Taht @ 2020-04-15 21:35 UTC (permalink / raw)
  To: Joel Wirāmu Pauling; +Cc: Michael Richardson, cerowrt-devel

I've always kind of wanted a guestimate and cost breakdown
(politics/fiber cost/trenching) as to how much it costs to run, oh,
16km of quality 100GBit fiber from los gatos to me. I know, month to
month that would kind of cost a lot to fill....

I costed out what it would take to trench the whole community once
upon a time, and instead of that I've been patiently awaiting my first
starlink terminals....

https://www.google.com/maps/place/20600+Aldercroft+Heights+Rd,+Los+Gatos,+CA+95033/@37.1701322,-121.9806674,17z/data=!3m1!4b1!4m5!3m4!1s0x808e37ced60da4fd:0x189086a00c73ad37!8m2!3d37.1701322!4d-121.9784787

On Wed, Apr 15, 2020 at 2:09 PM Joel Wirāmu Pauling <joel@aenertia.net> wrote:
>
> Another neat thing about 400 and 800GE is that you can get MPO optics that allow splitting a single 4x100 or 8x100 into individual 100G feeds. Good for port density and/or adding capacity to processing/Edge/Appliances
>
> Now there are decent ER optics for 100G you can now do 40-70KM runs of each 100G link without additional active electronics on the path or going to and optical transport route.
>
> On Thu, 16 Apr 2020 at 08:57, Michael Richardson <mcr@sandelman.ca> wrote:
>>
>>
>> Mikael Abrahamsson via Cerowrt-devel wrote:
>>     > Backbone ISPs today are built with lots of parallel links (20x100GE for
>>     > instance) and then we do L4 hashing for flows across these. This means
>>
>> got it. inverse multiplexing of flows across *links*
>>
>>     > We're now going for 100 gigabit/s per lane (it's been going up from 4x2.5G
>>     > for 10GE to 1x10G, then we went for lane speeds of 10G, 25G, 50G and now
>>     > we're at 100G per lane), and it seems the 800GE in your link has 8 lanes of
>>     > that. This means a single L4 flow can be 800GE even though it's in reality
>>     > 8x100G lanes, as a single packet bits are being sprayed across all the
>>     > lanes.
>>
>> Here you talk about *lanes*, and inverse multiplexing of a single frame across *lanes*.
>> Your allusion to PCI-E is well taken, but if I am completing the analogy, and
>> the reference to DWDM, I'm thinking that you are talking about 100 gigabit/s
>> per lambda, with a single frame being inverse multiplexed across lambdas (as lanes).
>>
>> Did I understand this correctly?
>>
>> I understand a bit of "because we can".
>> I also understand that 20 x 800GE parallel links is better than 20 x 100GE
>> parallel links across the same long-haul (dark) fiber.
>>
>> But, what is the reason among ISPs to desire enabling a single L4 flow to use more
>> than 100GE?  Given that it seems that being able to L3 switch 800GE is harder
>> than switching 8x flows of already L4 ordered 100GE. (Flowlabel!), why pay
>> the extra price here?
>>
>> While I can see L2VPN use cases, I can also see that L2VPNs could generate
>> multiple flows themselves if they wanted.
>>
>> --
>> ]               Never tell me the odds!                 | ipv6 mesh networks [
>> ]   Michael Richardson, Sandelman Software Works        |    IoT architect   [
>> ]     mcr@sandelman.ca  http://www.sandelman.ca/        |   ruby on rails    [
>>
>> _______________________________________________
>> Cerowrt-devel mailing list
>> Cerowrt-devel@lists.bufferbloat.net
>> https://lists.bufferbloat.net/listinfo/cerowrt-devel
>
> _______________________________________________
> Cerowrt-devel mailing list
> Cerowrt-devel@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/cerowrt-devel



-- 
Make Music, Not War

Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-435-0729

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-04-15 21:35 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-11 23:08 [Cerowrt-devel] 800gige Dave Taht
2020-04-12 16:15 ` David P. Reed
2020-04-15 17:39 ` Mikael Abrahamsson
     [not found] ` <mailman.1077.1586972355.1241.cerowrt-devel@lists.bufferbloat.net>
2020-04-15 20:57   ` Michael Richardson
2020-04-15 21:08     ` Joel Wirāmu Pauling
2020-04-15 21:35       ` Dave Taht

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox