[LibreQoS] Fwd: CGNAT growing pains

Many ISPs need the kinds of quality shaping cake can do
 help / color / mirror / Atom feed

* [LibreQoS] Fwd: CGNAT growing pains
       [not found] ` <a92b5fec-5d5c-e28-afd0-b5a29b2b3f5@richweb.com>
@ 2024-10-08 19:40   ` Dave Taht
  2024-10-08 20:10     ` dan
  0 siblings, 1 reply; 2+ messages in thread
From: Dave Taht @ 2024-10-08 19:40 UTC (permalink / raw)
  To: libreqos

[-- Attachment #1: Type: text/plain, Size: 6449 bytes --]

---------- Forwarded message ---------
From: C. Jon Larsen <jlarsen@richweb.com>
Date: Tue, Oct 8, 2024 at 12:34 PM
Subject: Re: CGNAT growing pains
To: Jon Lewis <jlewis@lewis.org>
Cc: <nanog@nanog.org>



We have had very good success with A10 vthunder on rural broadband
co-op networks for Resi subscribers. No problems with the NAT aspect,
literally 0. Operationally it just works. Games, streaming, xbox,
nintendo switch, all just works.

We typically do 32:1 or about 2000 udp/tcp ports allocated per
customer behind the A10. The closer you climb to 48:1 64:1 128:1 etc
the ratio of CDN blocking b/c "you are behind a vpn" starts to go up
noticeably.

If you have your LIDs (what A10 calls the inside ips that get mapped
to nat pools) setup properly and your inside CGN 100.64/10 ip space
sanely laid out its pretty easy. You can carve out pools for each
market (say a couple of /21s or a /19) and map that to a pool of
public ips accordingly and then in your self hosted geofeed lay out
that block with the correct data.

We try to give all business customers a /32 public ip either from dhcp
reservation or static assignment on an evpn subnet so business
customers would not get CGN ips typically. Also encourage them to
enable v6 and get that setup where possible.

> We started rolling out CGNAT about 6 months ago.  It was smooth sailing
for
> the first few months, but we eventually did run into a number of issues.
>
> Our customer base is primarily FTTH with "dynamic" IP assignment via
DHCP.
> Since connections are always-on, customer ONTs/routers get an IP
assigned,
> and then when the lease is renewed, they request a new lease for the
existing
> IP, and, in general, that request is granted.  This gives customers the
> mistaken impression they have a static IP.  So, my impression, from
working
> with some customers who've needed to be moved from CGNAT back to public
IP is
> that customers who are doing port-forwarding don't even bother with
dynamic
> DNS.  They just know they can connect to their IP as they've never seen
it
> change.  We do offer/sell static IP, but pre-CGNAT, it was strictly for
> business customers.  i.e. A residential customer could only get static IP
> service by converting their account to a business account. That may
change in
> the near future.
>
> One issue we didn't foresee has been IP Geo issues.  i.e.  We all knew
that
> streaming services like Netflix use IP Geo to determine what content
should
> be made available, but that's, AFAIK, limited by country or region. What
we
> didn't anticipate is services like Hulu Live TV doing IP Geo down to the
city
> level to determine which local channels are a subscriber's local
channels.
> We're using Juniper MX gear and SPC3 cards for our CGNAT routers, each
one
> having a single large external pool.  Since we serve most of FL, one
external
> pool can't IP Geo correctly for customers as far apart as Miami and
> Jacksonville hitting the same CGNAT router.  We don't currently have an
> acceptable solution to this other than moving impacted customers off
CGNAT.
>
> One of the great unknowns (at least for us) with CGNAT was what our PBA
> settings should be.  i.e.  How large each port-block should be, and how
many
> port-blocks to allow per customer.  We started with 256x4.  It seemed to
> work.  We eventually noticed that we were logging port-block exceeded
errors.
> This is one aspect where Juniper's CGNAT support is lacking. There's a
> counter for these errors, and it's available via SNMP, but there's no way
to
> attribute the errors to subscriber IPs.  We're polling the mib and
graphing
> it, so we know it's a continuing issue and can see when it's incrementing
> faster/slower, but Junos provides no means for determining if "PBEs" are
all
> being caused by a single customer, a handful of customers, etc.  We have
a
> JTAC case open on this.  As a quick & hopeful fix, we both increased the
> port-block size and block limit.  That helped, but didn't stop the
errors.
> It also cut our CGNAT ratio by more than half (64:1 -> 28:1), if we stay
at
> this ratio, we'll need much larger external pools than originally
> anticipated.  Tuning these settings is kind of painful as JTAC strongly
> recommends bouncing the CGNAT service anytime CGNAT related config
changes
> are made.  This means briefly breaking Internet access for all CGNAT'd
> customers.  For the PBEs, JTAC's suggestions so far have been to shorten
some
> of the timeouts in the config and to keep doing what we're doing, which
is a
> cron job that essentially does a "show services nat source port-block",
> parses the output looking for subscriber IPs that have used up the ports
in
> several of their port-blocks, then does a "show services sessions
> source-prefix ..." and logs all of this.  This at least gives us
snapshots of
> "who's a heavy user right now" and lets us look at how they were using
all
> their ports.  i.e. was it bittorent, are they compromised and scanning
the
> internet for more systems to compromise, is it legit looking traffic -
just
> lots of it, etc.?
>
> The latest CGNAT issue is a customer with a Palo Alto Networks firewall
> connected to our network and several of their employees are our FTTH
> customers.  On their PANW firewall, they're doing IP Geo based filtering,
> limiting access to internal servers to "US IPs".  Since we only CGNAT
traffic
> to the external Internet, their on-net employees hit the firewall from
their
> 100.64/10 IPs and get blocked.  I suggested they whitelist 100.64/10,
saying
> we block traffic from 100.64/10 from entering our network via peering and
> transit, so they can be assured anything from 100.64/10 came from inside
our
> network / our customers.  They say the firewall won't let them whitelist
> 100.64.0.0/10, giving an error that it's invalid IP space.
>
> I know we're not the first to implement CGNAT, so I'm curious if others
have
> run into these sorts of issues, or others we haven't run into yet, and if
so,
> how you solved them.
>
>
> ----------------------------------------------------------------------
> Jon Lewis, MCP :)              |  I route
> Blue Stream Fiber, Sr. Neteng  |  therefore you are
> _________ http://www.lewis.org/~jlewis/pgp for PGP public key_________
>
>


-- 
Dave Täht CSO, LibreQos

[-- Attachment #2: Type: text/html, Size: 7993 bytes --]

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [LibreQoS] Fwd: CGNAT growing pains
  2024-10-08 19:40   ` [LibreQoS] Fwd: CGNAT growing pains Dave Taht
@ 2024-10-08 20:10     ` dan
  0 siblings, 0 replies; 2+ messages in thread
From: dan @ 2024-10-08 20:10 UTC (permalink / raw)
  To: Dave Taht; +Cc: libreqos

[-- Attachment #1: Type: text/plain, Size: 8320 bytes --]

couple things on CGNAT.  We never do less than 1000 ports per IP.  That
seems to be the limit for having general problems.  Dialing back TCP
timeouts to 5-10 minutes also helps, any shorter than that and people
report issues with some security cameras etc because their keep alives are
longer.  customers per IP is irrelevant because you run out of ports with
1000-2000 per IP before any other practical limits hit.  This was our
primary issue with CGNAT for a while, connections hanging on for 24 hours
and customers running out of ports.

We do a hairpin nat config on the head end so customers can talk to each
other on the public IP.

The primary issue we see is when some common business between subscribers,
say a local hospital that has work-from-home people, blocks the /24 because
of failed login attempts and that hits everyone in the CGNAT pool and they
can't RDP to the workplace.  RDP being stupid insecure these places have
tried to short it up with whitelists but that's their flaw that we have to
deal with.

When we get new IPs they are often geocoded elsewhere which causes some
issues I have to chase down.  We keep CGNAT pools very localized because we
are multi-head-end and multi-homed, no one exits our network very far from
where they really are.

On Tue, Oct 8, 2024 at 1:40 PM Dave Taht via LibreQoS <
libreqos@lists.bufferbloat.net> wrote:

>
>
> ---------- Forwarded message ---------
> From: C. Jon Larsen <jlarsen@richweb.com>
> Date: Tue, Oct 8, 2024 at 12:34 PM
> Subject: Re: CGNAT growing pains
> To: Jon Lewis <jlewis@lewis.org>
> Cc: <nanog@nanog.org>
>
>
>
> We have had very good success with A10 vthunder on rural broadband
> co-op networks for Resi subscribers. No problems with the NAT aspect,
> literally 0. Operationally it just works. Games, streaming, xbox,
> nintendo switch, all just works.
>
> We typically do 32:1 or about 2000 udp/tcp ports allocated per
> customer behind the A10. The closer you climb to 48:1 64:1 128:1 etc
> the ratio of CDN blocking b/c "you are behind a vpn" starts to go up
> noticeably.
>
> If you have your LIDs (what A10 calls the inside ips that get mapped
> to nat pools) setup properly and your inside CGN 100.64/10 ip space
> sanely laid out its pretty easy. You can carve out pools for each
> market (say a couple of /21s or a /19) and map that to a pool of
> public ips accordingly and then in your self hosted geofeed lay out
> that block with the correct data.
>
> We try to give all business customers a /32 public ip either from dhcp
> reservation or static assignment on an evpn subnet so business
> customers would not get CGN ips typically. Also encourage them to
> enable v6 and get that setup where possible.
>
> > We started rolling out CGNAT about 6 months ago.  It was smooth sailing
> for
> > the first few months, but we eventually did run into a number of issues.
> >
> > Our customer base is primarily FTTH with "dynamic" IP assignment via
> DHCP.
> > Since connections are always-on, customer ONTs/routers get an IP
> assigned,
> > and then when the lease is renewed, they request a new lease for the
> existing
> > IP, and, in general, that request is granted.  This gives customers the
> > mistaken impression they have a static IP.  So, my impression, from
> working
> > with some customers who've needed to be moved from CGNAT back to public
> IP is
> > that customers who are doing port-forwarding don't even bother with
> dynamic
> > DNS.  They just know they can connect to their IP as they've never seen
> it
> > change.  We do offer/sell static IP, but pre-CGNAT, it was strictly for
> > business customers.  i.e. A residential customer could only get static
> IP
> > service by converting their account to a business account. That may
> change in
> > the near future.
> >
> > One issue we didn't foresee has been IP Geo issues.  i.e.  We all knew
> that
> > streaming services like Netflix use IP Geo to determine what content
> should
> > be made available, but that's, AFAIK, limited by country or region. What
> we
> > didn't anticipate is services like Hulu Live TV doing IP Geo down to the
> city
> > level to determine which local channels are a subscriber's local
> channels.
> > We're using Juniper MX gear and SPC3 cards for our CGNAT routers, each
> one
> > having a single large external pool.  Since we serve most of FL, one
> external
> > pool can't IP Geo correctly for customers as far apart as Miami and
> > Jacksonville hitting the same CGNAT router.  We don't currently have an
> > acceptable solution to this other than moving impacted customers off
> CGNAT.
> >
> > One of the great unknowns (at least for us) with CGNAT was what our PBA
> > settings should be.  i.e.  How large each port-block should be, and how
> many
> > port-blocks to allow per customer.  We started with 256x4.  It seemed to
> > work.  We eventually noticed that we were logging port-block exceeded
> errors.
> > This is one aspect where Juniper's CGNAT support is lacking. There's a
> > counter for these errors, and it's available via SNMP, but there's no
> way to
> > attribute the errors to subscriber IPs.  We're polling the mib and
> graphing
> > it, so we know it's a continuing issue and can see when it's
> incrementing
> > faster/slower, but Junos provides no means for determining if "PBEs" are
> all
> > being caused by a single customer, a handful of customers, etc.  We have
> a
> > JTAC case open on this.  As a quick & hopeful fix, we both increased the
> > port-block size and block limit.  That helped, but didn't stop the
> errors.
> > It also cut our CGNAT ratio by more than half (64:1 -> 28:1), if we stay
> at
> > this ratio, we'll need much larger external pools than originally
> > anticipated.  Tuning these settings is kind of painful as JTAC strongly
> > recommends bouncing the CGNAT service anytime CGNAT related config
> changes
> > are made.  This means briefly breaking Internet access for all CGNAT'd
> > customers.  For the PBEs, JTAC's suggestions so far have been to shorten
> some
> > of the timeouts in the config and to keep doing what we're doing, which
> is a
> > cron job that essentially does a "show services nat source port-block",
> > parses the output looking for subscriber IPs that have used up the ports
> in
> > several of their port-blocks, then does a "show services sessions
> > source-prefix ..." and logs all of this.  This at least gives us
> snapshots of
> > "who's a heavy user right now" and lets us look at how they were using
> all
> > their ports.  i.e. was it bittorent, are they compromised and scanning
> the
> > internet for more systems to compromise, is it legit looking traffic -
> just
> > lots of it, etc.?
> >
> > The latest CGNAT issue is a customer with a Palo Alto Networks firewall
> > connected to our network and several of their employees are our FTTH
> > customers.  On their PANW firewall, they're doing IP Geo based
> filtering,
> > limiting access to internal servers to "US IPs".  Since we only CGNAT
> traffic
> > to the external Internet, their on-net employees hit the firewall from
> their
> > 100.64/10 IPs and get blocked.  I suggested they whitelist 100.64/10,
> saying
> > we block traffic from 100.64/10 from entering our network via peering
> and
> > transit, so they can be assured anything from 100.64/10 came from inside
> our
> > network / our customers.  They say the firewall won't let them whitelist
> > 100.64.0.0/10, giving an error that it's invalid IP space.
> >
> > I know we're not the first to implement CGNAT, so I'm curious if others
> have
> > run into these sorts of issues, or others we haven't run into yet, and
> if so,
> > how you solved them.
> >
> >
> > ----------------------------------------------------------------------
> > Jon Lewis, MCP :)              |  I route
> > Blue Stream Fiber, Sr. Neteng  |  therefore you are
> > _________ http://www.lewis.org/~jlewis/pgp for PGP public key_________
> >
> >
>
>
> --
> Dave Täht CSO, LibreQos
> _______________________________________________
> LibreQoS mailing list
> LibreQoS@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/libreqos
>

[-- Attachment #2: Type: text/html, Size: 10037 bytes --]

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2024-10-08 20:10 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <b9bca61-1bbd-aab9-7014-1c2262b7b428@lewis.org>
     [not found] ` <a92b5fec-5d5c-e28-afd0-b5a29b2b3f5@richweb.com>
2024-10-08 19:40   ` [LibreQoS] Fwd: CGNAT growing pains Dave Taht
2024-10-08 20:10     ` dan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox