[Bloat] [Cerowrt-devel] DC behaviors today

Sun Dec 17 16:37:28 EST 2017

This is an interesting topic to me. Over the past 5+ years, I've been
reading about GPON fiber aggregators(GPON chassis for lack of a proper
term) with 400Gb-1Tb/s of uplink, 1-2Tb/s line-cards, and enough GPON ports
for several thousand customers.

When my current ISP started rolling out fiber(all of it underground, no
above-ground fiber), I called support during a graveyard hour on the
weekend, and I got a senior network admin answering the phone instead of
normal tech support. When talking to him, I asked him what they claimed by
"guaranteed" bandwidth. I guess I should mention that my ISP claims
dedicated bandwidth for everyone. He told me that they played with
over-subscription for a while, but it just resulted in complex situations
that caused customers to complain. Complaining customers are expensive
because they eat up support phone time. They eventually went to a
non-oversubscribed flat model. He told me that the GPON chassis plugs
strait into the core router. I asked him about GPON port shared bandwidth
and the GPON uplink. He said they will not over-subscribe a GPON port, so
all ONTs on the port can use 100% of their provisioned rate, and they will
not place more provision bandwidth on a single GPON chassis than what they
uplink can support.

For the longest time, their max sold bandwidth was 50Mb/s. After some time,
they were having some issues resulting in packet-loss during peak hours.
Turned out their old core router could not support all of the new customers
in the ARP cache and was causing massive amounts of broadcasted packets. I
actually helped them solve this issue. They had me work with a hired
consulting service that was having issues diagnosing the problem, much
because of the older hardware not supporting modern diagnostic features.
They fixed the problem by upgrading the core router. Because I was already
in contact with them during this issue, I was made privy that their new
core router could handle about 10Tb/s with a lot of room for 100Gb+ ports.
No exact details, but told their slowest internal link was now 100Gb.

Their core router actually had traffic shaping and an AQM built in. They
switched from using ONT rate limiting for provisioning to letting the core
router handle provisioning. I can actually see 1Gb bursts as their shaping
seems to be like a sliding window over a few tens of ms. I have actually
tested their AQM a bit via a DOS testing service. At the time, I had a
100Mb/100Mb service, and externally flooding my connection with 110Mb/s
resulted in about 10% packetloss, but my ping stayed under 20ms. I tried
200Mb/s for about 20 seconds, which resulted in about 50% loss and still
~20ms pings. For about 10 seconds I tested 1Gb/s DOS and had about 90%
loss(not a long time to sample, but was sampled at a rate of 10pps against
their speed-test server), but 20-40ms pings. I tested this during off
hours, like 1am.

A few months after the upgrade, I got upgraded to a 100Mb connection with
no change in price and several new higher tiers were added, all the way up
to 1Gb/s. I asked them about this. Yes, the 1Gb tier was also not
over-subscribed. I'm not sure if some lone customer pretty much got their
own GPON port or they had some WDM-PON linecards.

I'm currently paying about $40/m for 150/150 for a "dedicated" connection.
I'm currently getting about 1ms+-0.1ms pings to my ISP's speedtest server
24/7.  If I do a ping flood, I can get my avg ping down near 0.12ms. I
assume this is because of GPON scheduling. Of course I only test this
against their speedtest server and during off hours.

As for the trunk, I've also talked to them about that, at least in the past
and I can't speak for more current times. They had 3 trunks, 2 to Level 3
Chicago and one to Global Crossing Minnesota. I was told each link was a
paired link for immediate fail-over. I was told that in some cases, they've
bonded the links, primarily due to DDOS attacks, to quickly double their
bandwidth. Their GX link was their fail-over and the two Chicago Level 3
links were the load balanced primaries. Based on trace-routes, they seemed
to be load-balanced by some lower-bits in the IP address. This gave a total
of 6 links. The network admin told me that any given link had enough
bandwidth provisioned, that if all 5 other links were down, that one link
would have a 95th percentile below 80% during peak hours, and customers
should be completely unaffected.

They've been advertising guaranteed dedicated bandwidth for over 15 years
now. They recently had a marketing campaign against the local incumbent
where they poked fun at them for only selling "up to" bandwidth. This went
on for at least a year. They openly advertised that their bandwidth was not
"up to", but that customers will always get all of their bandwidth all the
time. In the small print it said "to their transit provider". In short, my
ISP is claiming I should always get my provisioned bandwidth to Level 3
24/7. As far as I have cared to measure, this is true. At one point I had a
month long ping of 2 pps against AWS Frunkfurt. ~140ms avg, ~139ms min,
~std-dev 3ms, max ping of ~160ms, and fewer than 100 lost packets. 6-12ms
to Chicago, depending on which link, 30-35ms to New York City depending on
the link, 90ms to London, and 110 to Paris. Interesting note, AWS Frankfurt
was only claiming about 6 hops from Midwest USA. That's impressive.

Back when I was load testing my 100Mb connection, I queued up a bunch of
well seeded large Linux ISOs, and downloaded to my SSDs. Between my traffic
shaping via pfSense and my ISP's unknown AQM, I averaged 99.8Mb/s with a
max of 99.5Mb/s and a min of 99.7Mb/s, sampled over a 1.5 hour window from
8:30p to 10p. Those averages were as reported by pfSense in 1min slices. 0
ping packets lost to my ISP with no ping more than ~10ms and the
avg/std-dev was identical to idle to with-in 0.1ms. When doing the DDOS,
pfSense reported exactly 100.0Mb/s hitting the WAN with zero dips.

In short, if I wanted to, I could purchase a 500/500 "dedicated" connection
for $110/m, plus tax but no other fees, free install, passive
point-to-point self-healing ring back to the CO from my house, and a /29
static block for an additional $10/m, and told I can do do web-hosting, but
no SLA, even though I get near perfect connectivity and single digit
minutes 1a-2a yearly downtime.

This is all from a local private ISP that openly brags that they do no
accept any government grants, loans, or other subsidies. My ISP is about
120 years old and started off as a telegraph service. I've gotten the
feeling that fast dedicated bandwidth is cheap and easy, assuming you're an
established ISP that doesn't have to fight through red-tape. We've got
farmers with 1Gb/1Gb dedicated fiber connections, all without government
support.

About 3 years ago I was reading about petabit core routers with 1Tb/s ports
and single-fiber ~40Tb/s multiplexers. Recently I heard that 100Gb PON with
2.5Tb/s of bandwidth is already partially working in labs, with an expected
cost not much more than current day XG2-PON, which is what... 300Gb/s or so
split among 32 customers?. As far as I can tell, last mile bandwidth is a
solved problem short of incompetence, greed, or extreme circumstances.

Ahh yes. Statistical over-subscription was the topic. This works well for
backbone providers where they have many peering links with a heavy mix of
flows. Level 3 has a blog where they were showing off a 10Gb link where
below the 95th percentile, the link had zero total packets lost and a
queuing delay of less than 0.1ms. But above 80%, suddenly loss and jitter
went up with a hockey-stick curve. Then they showed a 400Gb link. It was at
98% utilization for the 95th percentile and it had zero total packets lost
and a max queuing delay of 0.01ms with an average of 0.00ms.

There was a major European IX that had a blog about bandwidth planning and
over-provisioning. They had a 95th percentile in the many-terabits, and
they said they said they could always predict peak bandwidth to within 1%
for any given day. Given a large mix of flow types, statistics is very good.

On a slightly different topic, I wonder what trunk providers are using for
AQMs. My ISP was under a massive DDOS some time in the past year and I use
a Level 3 looking glass from Chicago, which showed only a 40ms delta
between the pre-hop and hitting my ISP, where it was normally about 11ms
for that link. You could say about 30ms of buffering was going on. The
really interesting thing is I was only getting about 5-10Mb/s, which means
there was virtually zero free bandwidth. but I had almost no packet-loss. I
called my ISP shortly after the issue started and that's when they told me
they were under a DDOS and were at 100% trunk, and they said they were
going to have their trunk bandwidth increased shortly. 5 minutes later, the
issue was gone. About 30 minutes later I was called back and told the DDOS
was still on-going, they just upgraded to enough bandwidth to soak it all.
I found it very interesting that a DDOS large enough to effectively kill
95% of my provisioned bandwidth and increase my ping 30ms over normal, did
not seem to affect packet-loss almost at all. It was well under 0.1%. Is
this due to the statistical nature of large links or did Level 3 have an
AQM to my ISP?

On Thu, Dec 14, 2017 at 2:22 AM, Mikael Abrahamsson <swmike at swm.pp.se>
wrote:

> On Wed, 13 Dec 2017, Jonathan Morton wrote:
>
> Ten times average demand estimated at time of deployment, and struggling
>> badly with peak demand a decade later, yes.  And this is the transportation
>> industry, where a decade is a *short* time - like less than a year in
>> telecoms.
>>
>
> I've worked in ISPs since 1999 or so. I've been at startups and I've been
> at established ISPs.
>
> It's kind of an S curve when it comes to traffic growth, when you're
> adding customers you can easily see 100%-300% growth per year (or more).
> Then after market becomes saturated growth comes from per-customer
> increased usage, and for the past 20 years or so, this has been in the
> neighbourhood of 20-30% per year.
>
> Running a network that congests parts of the day, it's hard to tell what
> "Quality of Experience" your customers will have. I've heard of horror
> stories from the 90ties where a then large US ISP was running an OC3 (155
> megabit/s) full most of the day. So someone said "oh, we need to upgrade
> this", and after a while, they did, to 2xOC3. Great, right? No, after that
> upgrade both OC3:s were completely congested. Ok, then upgrade to OC12 (622
> megabit/s). After that upgrade, evidently that link was not congested a few
> hours of the day, and of course needed more upgrades.
>
> So at the places I've been, I've advocated for planning rules that say
> that when the link is peaking at 5 minute averages of more than 50% of link
> capacity, then upgrade needs to be ordered. This 50% number can be larger
> if the link aggregates larger number of customers, because typically your
> "statistical overbooking" varies less the more customers participates.
>
> These devices do not do per-flow anything. They might have 10G or 100G
> link to/from it with many many millions of flows, and it's all NPU
> forwarding. Typically they might do DIFFserv-based queueing and WRED to
> mitigate excessive buffering. Today, they typically don't even do ECN
> marking (which I have advocated for, but there is not much support from
> other ISPs in this mission).
>
> Now, on the customer access line it's a completely different matter.
> Typically people build with BRAS or similar, where (tens of) thousands of
> customers might sit on a (very expensive) access card with hundreds of
> thousands of queues per NPU. This still leaves just a few queues per
> customer, unfortunately. So these do not do per-flow anything either. This
> is where PIE comes in, because these devices like these can do PIE in the
> NPU fairly easily because it's kind of like WRED.
>
> So back to the capacity issue. Since these devices typically aren't good
> at assuring per-customer access to the shared medium (backbone links), it's
> easier to just make sure the backbone links are not regularily full. This
> doesn't mean you're going to have 10x capacity all the time, it probably
> means you're going to be bouncing between 25-70% utilization of your links
> (for the normal case, because you need spare capacity to handle events that
> increase traffic temporarily, plus handle loss of capacity in case of a
> link fault). The upgrade might be to add another link, or a higher tier
> speed interface, bringing down the utilization to typically half or quarter
> of what you had before.
>
>
> --
> Mikael Abrahamsson    email: swmike at swm.pp.se
> _______________________________________________
> Bloat mailing list
> Bloat at lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.bufferbloat.net/pipermail/bloat/attachments/20171217/dcce9663/attachment-0002.html>