[Cerowrt-devel] Recording RF management info _and_ associated traffic?

Sun Jan 25 04:39:32 EST 2015

On Sun, 25 Jan 2015, Dave Taht wrote:

> I want to make clear that I support dlang's design in the abstract... and
> am just arguing because it is a slow day.

I welcome challenges to the design, it's how I improve things :-)

> On Sat, Jan 24, 2015 at 10:44 PM, David Lang <david at lang.hm> wrote:
>> On Sat, 24 Jan 2015, Dave Taht wrote:
>>

to clarify, the chain of comments was

1. instead of bridging I should route

2. network manager would preserve the IPv4 address to prevent breaking 
established connections.

I was explaining how that can't work. If you are moving between different 
networks, each routed independently, they either need to have different address 
ranges (in which case the old IP just won't work), or they would each need to 
NAT to get to the outside (in which case the IP may stay the same, but the 
connections will break since the new router wouldn't have the NAT entries for 
the existing connections)

> Hmm? The first thing I ever do to a router is renumber it to a unique IP 
> address range, and rename the subnet in dns to something unique. The 3 sed 
> lines for this are on a cerowrt web page somewhere. Adding ipv6 statically is 
> a pita, but doable with care and a uci script, and mildly more doable as hnetd 
> matures.
>
> I run local dns services on each in the hope that at least some will be
> cached, and a local dhcp server to serve addresses out of that range. I
> turn off dhcp default route fetching on each routers external interface and
> use babel instead to find the right route(s) out of the system.
>
> On the NAT front, there is no nat on the internal routers, just a flat
> address space (172.20.0.0/14 in my case). I push all the nat to the main
> egress gateway(s), and in a case like yours would probably use multiple
> external IPs and dnat rather than masquarade the entire subnet on one to
> free up port space. You rapidly run out of ports in a natted evironment
> with that many users. I've had to turn down NAT timeouts for udp in
> particular to truly unreasonable levels otherwise (20 seconds in some cases)

hmm, we haven't seen anything like this, but it could be a problem we haven't 
noticed because we haven't been looking for it.

> Doing this I can get a quick status on what is up with "ip route", and by
> monitoring the activity on each ip range, see if traffic is actually being
> passed, a failure of a given gateway fails over to another, and so on.
> There's a couple snmp hacks to do things like monitor active leases, and
> smokeping/mrtg to access other stats. There's a couple beagles that are on
> wifi that I ping on some APs. The beagles have not been very reliable for
> me, so they switch on and off with digiloggers gear when they fail a local
> ping. In fact the main logging beagle failed entirely the other month, sigh.
>
> I use the ad-hoc links on cerowrt as backups (if they lose ethernet
> connectivity) and extenders (if there is no ethernet connectivity), and (as
> I have 5 different comcast exit nodes spread throughout the network), use
> babel-pinger on each to see if they are up, and insert default routes into
> the mix that are automatically the shortest "distance" between the node and
> exit gateway. If one gw goes down (usually) all the traffic ends up
> switching to the next nearest default gateway switching over in 16 seconds
> or so, breaking all the nat associations for the net they were on (sigh),
> as well as ipv6 native stuff, but it's happened so often without me
> noticing it that it's nice not to worry.
>
> (I have a mostly failed attempt in play for doing better with ipv6 and
> hnetd on a couple of exit nodes, but that isn't solid enough to deploy as
> yet, so it's only sort of working in the yurtlab. I really wish I could buy
> PI space for ipv6 somehow)
>
> (I have been fiddling with dns anycast to try to get more redundancy on the
> main dns gateways. That works pretty good)
>
> Now, your method is simpler! (although mine is mostly scripted) I imagine
> you bridge everything on a vlan, and use a central dhcp/dns server to serve
> up dhcp across (say) a 10.0.0.0/16 subnet. And by blocking local
> multicast/broadcast, in particular, this scales across the 3k user
> population. You've got a critical single point of failure in your gateway,
> but at least that's only one, and I imagine you have that duplicated.

I have two wifi vlans, one for 5GHz (ESSID SCALE), and one for 2.4GHz (ESSID 
SCALE-slow, no speed limits, but it does a great job of encouraging everyone who 
can to use 5GHz :-) ) There is a central DHCP server and firewall that allocates 
addresses across a /17 for each of the two networks. We don't setup active 
failover, but we have a spare box that we can put in if needed.

The APs don't have any IP addresses on either wireless network. They have an IP 
on a different VLAN that's used for management only. Makes it a bit harder for 
any attackers to do anything to them.

Remember, we need to have it work for a few days at a shot

> (In contrast my network is always broken somewhere, but unless two critical
> nodes break, it's pretty redundant and loss is confined to a a single AP -
> my biggest problem is that I need to upgrade the firmware on about half the
> network - which involves climbing trees - and my plan was to deploy hnetd
> last year so I could roll out ipv6)
>
> How do you deal with a dead AP that is not actually connecting with traffic?

Nagios type monitoring to detect that the AP isn't reachable on the wired 
network and we send a runner to find out what's happening. About three years ago 
we had a lot of problems with people unplugging the APs for some reason.

>> For the normal user that we are trying to support at a conference, it's a
>> win.
>>
>> I'll note that we also block streaming sites (which has the side effect of 
>> blocking some useful sites that share the same IPs, Amazon for example) to 
>> help make things better for everyone else, even at the cost of limiting what 
>> some people are able to do. Bandwidth is limited compared to the number of 
>> people we have, and we have to make choices.
>
> Blocking ads is also effective.

We use DNS to block things like this (or actually redirect the DNS to point to a 
server that serves an image saying that they are being blocked by SCaLE), and 
then we block port 53 to the outside to force people to use our DNS servers. 
Somewhat heavy handed, but it works.

>>> Will you attempt to deploy ipv6?
>>
>>
>> We have been offering IPv6 routable addresses for a few years.
>
> How many do you get and from whom?

I don't remember at the moment.

>>> I am of course interested in how fq_codel performs on your ISP link, and
>>> are you planning on running it for your wifi?
>>
>>
>> I'm running OpenWRT on the APs but haven't done anything in particular to
>> activate it.
>
> fq_codel is on by default in Barrier breaker and later on all interfaces. I
> note that it doesn't scale anywhere near as we would like under contention
> but that work is only beginning in chaos calmer. A thought I've had in an
> environment such as yours would be to rate limit each AP's ingress/egress
> ethernet interface to, say, 20mbits, thus pushing all the potential bloat
> to sqm on ethernet and out of the wifi (which would generally run faster).
> Might even force uploads from the users lower, also (say 10mbit). Might
> not, and just rely on people retaining low expectations. :)
>
> Was it on openwrt last year?

yes, most of what I did on the wireless side is in the paper at 
https://www.usenix.org/conference/lisa12/technical-sessions/presentation/lang_david_wireless

The first year I did the network I had a total of one month to plan and buy APs, 
so I was running stock firmware, the second year I used DD-WRT and was very 
unhappy with it. I've been running OpenWRT since.

>> I'll check what we have on the firewall (a fairly up to day
>> Debian build)
>
> fq_codel has been a part of that for a long time.
>
> I'd port over the sqm-scripts and use those, it's only a 1 line change.
>
>> What's the best way to monitor the queues?
>
> On each router?
>
> I tend to use pdsh a lot, setting up a /etc/genders file for them all so I
> can do a
>
> pdsh tc qdisc show dev wlan0 # or uptime or cat /etc/dhcp.leases | wc -l or
> whatever
>
> Been meaning to get around to something that used snmp instead for a while.

I'm gathering info on each AP about the number of users currently connected and 
the bandwidth used on all ports. I also have a central log from all APs which 
shows the MAC addresses as they associate with each AP.

So collecting the data to one place is the easy part, what I don't now is what I 
need to gather from where with what commands. Any suggestions for this are very 
welcome.

David Lang