[Cerowrt-devel] Recording RF management info _and_ associated traffic?

Mon Jan 26 19:36:06 EST 2015

Jesper now cc'd.

On Tue, Jan 27, 2015 at 1:12 PM,  <dpreed at reed.com> wrote:
> Well, we all may want to agree to disagree.  I don't buy the argument that
> hash tables are slow compared to the TCAMs - and even if cache misses
> happened, a hash table is still o(1) - you look at exactly one memory
> address on the average in a hash table - that's the point of it.  The
> constant factor is the speed of memory - not terribly slow by any means.
>
>
>
> To get into this deeper would require actual measurements, of which I am a
> great fan.  But your handwaves are pretty unquantitative, Dave, so at best
> they are similar to mine.  I'm very measurement focused, being part hardware
> architecture guy.

Two of the people doing serious optimization and measurement of linux
network behavior are now cc'd. (tho they might want to read back on
the thread). Jesper, in particular, has been working on speeding up
10GigE in preparation for 100GigE and gave a great preso at lca:

http://lwn.net/Articles/629155/ (see slides, video)

Relative to that was:

http://lwn.net/Articles/629152/

And alexander has been working specifically on dramatically improving
routing cache lookups. He tells me:

"The amount of gain seen will vary based on the routing configuration
of each system.  The biggest gain in all of this is that the
prefix-matching/backtrace portion of the look-up was reduced from
O(N^2) to O(N).  So on my test systems that were configured with
rather large tries I saw a reduction from 380ns to 16ns for performing
a prefix-match/backtrace.  What this means for most end users is that
anything that falls back to the default route on the system should
take significantly less time for look-up in the fib tables.

A hit at depth 7 in my trie costs about 31ns, though I think that
might be a cache warm hit versus a cache cold look-up.  Though you
have to keep in mind if you are dealing with a routing table that is
the main trie, and not the local trie.  That means to get a "hit" only
any route you are still going to have to have a failed look-up in the
local trie first and the size of that trie depends on the number of
local addresses you have configured on the system.

My second set of patches should cut that by about 25% to 50% since I
am dropping a couple of unnecessary items from the look-up process and
compressing things so that the pointer to the next tnode and the key
info for that tnode should always be in the same cache-line.

the second set of patches [if they work out] that should reduce the
cache utilization by up to half.  Basically it consists of pushing the
key and key information up to the same cache-line that pointer for the
tnode/leaf lives on.  However I have to sort out some RCU ugliness
that adds since I have to RCU protect the key information."

My main complaint about the work so far is that no-one has been
measuring the total system costs (and latency) of the time it takes
from when a packet enters the system to the time it departs. I am
pretty sure that immense call path could be additionally optimized...
(last I recall it transited a minimum of 34 functions)

My secondary complaint is all the work is being tested on 64 bit
hardware with huge caches.

Somewhat relevant to that:

It has long been my hope that we would see per-packet timestamping
become the default at ingress (from the card or host application's
interaction with the stack), and then merely checked at egress through
codel, rather than the fq_codel queue merely measuring itself.

>
>
> David - my comment about HP doing layer 3 switching in TCAMs just was there
> to point out that there's nothing magic about layer 2.  I was not suggesting
> that they don't use proprietary binary blobs, because they do.  But so do
> the TCAM programs in layer 2 devices.
>
>
>
> Dave - you are conflating the implementation technique of the routing
> algorithm when you focus on "prefix matching" as being hard to do.  It's not
> hard to invent a performant algorithm to do that combined with a hash table.
> A simple way to do that is to treat the address one is looking up as several
> addresses (of shorter prefixes of the address).  Then look each one up
> separately by its hash.  Its still o(1) if you do that, just a larger
> constant factor. I assume you don't actually think it is optimal to do
> linear searches on the routing table like hosts sometimes do.  Linear search
> is not necessary.

I am tracking alexander's fine work closely. See recent commits to the
net-next tree.
>
>
> There is literally nothing magical about looking up 48-bit random Ethernet
> addresses in a LAN.

The difference between 48 bits and 128 bits is quite large.

>
>
> As far as NAT'ing is concerned - that is done by the gateways.  It's
> possible in principle to create a distributed NAT face to an Enterprise - if
> you do so, then roaming within the enterprise just amounts to telling the
> NAT face about the new internal IP address that corresponds to the old one -
> an update of one address translation with another.
>
>
>
> This is how phones roam, by the way. They update their location via an HLR
> as they roam.
>
>
>
>
>
> On Sunday, January 25, 2015 10:45pm, "Dave Taht" <dave.taht at gmail.com> said:
>
>> On Sun, Jan 25, 2015 at 7:17 PM, <dpreed at reed.com> wrote:
>> > Looking up an address in a routing table is o(1) if the routing table is
>> > a
>> > hash table. That's much more efficient than a TCAM. My simple example
>> > just
>> > requires a delete/insert at each node's route lookup table.
>>
>> Regrettably it is not O(1) once you take into account the cpu cache
>> hierarchy,
>> or the potential collisions you will have once you shrink the hash to
>> something reasonable.
>>
>> Also I think you are ignoring the problem of covering routes. Say I have
>> to
>> get something to a.b.c.z/32. I do a lookup of that and find nothing. I
>> then
>> look to find a.b.c.z/31 and find nothing, then /30, then /29, /28, until I
>> find
>> a hit for the next hop. Now you can of course do a binary search for
>> likely
>> subprefixes, but in any case, the search is not O(1).
>>
>> In terms of cache efficient data structures, a straight hash is not the
>> way
>> to go, of late I have been trying to wrap my head around the hat-trie as
>> possibly being useful in these circumstances.
>>
>> Now, if you think about limiting the domain of the problem to something
>> greater than the typical mac table, but less than the whole internet,
>> it starts looking more reasonable to have a 1x1 ratio of destination
>> IPs to hash table entries for lookups, but updates have to probe/change
>> large segments of the table in order to deal with covering prefixes.
>>
>> > My point was about collections of WLAN's bridged together. Look at what
>> > happens (at the packet/radio layer) when a new node joins a bridged set
>> > of
>> > WLANs using STP. It is not exactly simple to rebuild the Ethernet
>> > layer's
>> > bridge routing tables in a complex network. And the limit of 4096
>> > entries
>> > in many inexpensive switches is not a trivial limit.
>>
>> Agreed. But see http://en.wikipedia.org/wiki/Virtual_Extensible_LAN
>>
>> >
>> >
>> >
>> > Routers used to be memory-starved (a small number of KB of RAM was the
>> > norm). Perhaps the thinking then (back before 2000) has not been
>> > revised,
>> > even though the hardware is a lot more capacious.
>>
>> The profit margins have not been revised.
>>
>> I would not mind, incidentally expanding the scope of the fqswitch project
>> ot
>> try to build something that would scale up at l3 farther than we've ever
>> seen
>> before, however funding for needed gear like:
>>
>> http://www.eetimes.com/document.asp?doc_id=1321334
>>
>> and time, and fpga expertise, is lacking. I am currently distracted by
>> evaluating
>> a very cool new cpu architecture ( see
>> http://www.millcomputing.com/wiki/Memory )
>> and even as nifty as that is I foresee a need for a lot of dedicated
>> packet
>> processing logic and memories to get into the 40GBit+ range.
>> >
>> >
>> > Remember, the Ethernet layer in WLANs is implemented by
>> > microcontrollers,
>> > typically not very capable ones, plus TCAMs which are pretty limited in
>> > their flexibility.
>>
>> I do tend to think that the next era of SDN enabled hardware will
>> eventually
>> lead to more innovation in both the control and data plane - however it
>> seems we are still in a "me-too" phase
>> of development of openvswitch (btw: there is a new software switch for
>> linux called rocker we should look at, and make sure runs fq_codel), and
>> a long way from flexibly programmable switch hardware in general.
>>
>> http://openvswitch.org/pipermail/dev/2014-September/045084.html
>> >
>> >
>> >
>> > While it is tempting to use the "pre-packaged, proprietary" Ethernet
>> > switch
>> > functionality, routing gets you out of the binary blobs, and let's you
>> > be a
>> > lot smarter and more scalable. Given that it does NOT cost more to do
>> > routing at the IP layer, building complex Ethernet bridging is not
>> > obviously
>> > a win.
>>
>> SDN is certainly a way out of this mess. Eventually. But I fear we are
>> making
>> all the same mistakes over again, and making slower hardware, where in the
>> end, it needs to be faster, to win.
>>
>> >
>> >
>> > BTW, TCAMs are used in IP layer switching, too, and also are used in
>> > packet
>> > filtering. Maybe not in cheap consumer switches, but lots of Gigabit
>> > switches implement IP layer switching and filtering. At HP, their
>> > switches
>> > routinely did all their IP layer switching entirely in TCAMs.
>>
>> Yep. I really wish big, fat TCAMS were standard equipment.
>>
>> >
>> >
>> > On Sunday, January 25, 2015 9:58pm, "Dave Taht" <dave.taht at gmail.com>
>> said:
>> >
>> >> On Sun, Jan 25, 2015 at 6:43 PM, David Lang <david at lang.hm> wrote:
>> >> > On Sun, 25 Jan 2015, Dave Taht wrote:
>> >> >
>> >> >> To your roaming point, yes this is certainly one place where
>> migrating
>> >> >> bridged vms across machines breaks down, and yet more and more
>> vm
>> >> >> layers are doing it. I would certainly prefer routing in this
>> case.
>> >> >
>> >> >
>> >> > What's the difference between "roaming" and moving a VM from one
>> place
>> >> > in
>> >> > the network to another?
>> >>
>> >> I think most people think of "roaming" as moving fairly rapidly from
>> >> one
>> >> piece of edge connectivity to another, and moving a vm is a great deal
>> >> more
>> >> permanent operation.
>> >>
>> >> > As far as layer 2 vs layer 3 goes. If you try to operate at layer 3,
>> you
>> >> > are
>> >> > going to have quite a bit of smarts in the endpoint. Even if it's
>> only
>> >> > connected vi a single link. If you think about it, even if your
>> network
>> >> > routing tables list every machine in our environment individually,
>> you
>> >> > still
>> >> > have a problem of what gateway the endpoint uses. It would have to
>> >> > change
>> >> > every time it moved. Since DHCP doesn't update frequently enough to
>> be
>> >> > transparent, you would need to have each endpoint running a routing
>> >> > protocol.
>> >>
>> >> Hmm? I don't ever use a dhcp-supplied default gateway, I depend on the
>> >> routing
>> >> protocol to supply that. In terms of each vm running a routing
>> >> protocol,
>> >> well, no, I would rely on the underlying bare metal OS to be doing
>> >> that, supplying
>> >> the FIB tables to the overlying vms, if they need it, but otherwise the
>> >> vms
>> >> just see a "default" route and don't bother with it. They do need to
>> >> inform the
>> >> bare metal OS (better term for this please? hypervisor?) of what IPs
>> they
>> >> own.
>> >>
>> >> static default gateways are evil. and easily disabled. in linux you
>> >> merely comment
>> >> out the "routers" in /etc/dhcp/dhclient.conf, in openwrt, set
>> >> "defaultroute 0" for the
>> >> interface fetching dhcp.
>> >>
>> >> When a box migrates, it tells the hypervisor it's addresses, and then
>> that
>> >> box
>> >> propagates out the route change to elsewhere.
>> >>
>> >> >
>> >> > This can work for individual hobbiests, but not when you need to
>> support
>> >> > random devices (how would you configure an iPhone to support this?)
>> >>
>> >> Carefully. :)
>> >>
>> >> I do note that this stuff does (or at least did) work on some of the
>> open
>> >> source variants of android. I would rather like it if android added
>> >> ipv6
>> >> tethering soon, and made it possible to mesh together multiple phones.
>> >>
>> >> >
>> >> >
>> >> > Letting the layer 2 equipment deal with the traffic within the
>> building
>> >> > and
>> >> > invoking layer 3 to go outside the building (or to a different
>> security
>> >> > domain) makes a lot of sense. Even if that means that layer 2 within
>> a
>> >> > building looks very similar to what layer 3 used to look like around
>> a
>> >> > city.
>> >>
>> >> Be careful what you wish for.
>> >>
>> >> >
>> >> >
>> >> > back to the topic of wifi, I'm not aware of any APs that participate
>> in
>> >> > the
>> >> > switch protocols at this level. I also don't know of any reasonably
>> >> > priced
>> >> > switches that can do anything smarter than plain spanning tree when
>> >> > connected through multiple paths (I'd love to learn otherwise)
>> >> >
>> >> > David Lang
>> >>
>> >>
>> >>
>> >> --
>> >> Dave Täht
>> >>
>> >> thttp://www.bufferbloat.net/projects/bloat/wiki/Upcoming_Talks
>> >>
>>
>>
>>
>> --
>> Dave Täht
>>
>> thttp://www.bufferbloat.net/projects/bloat/wiki/Upcoming_Talks
>>

-- 
Dave Täht

thttp://www.bufferbloat.net/projects/bloat/wiki/Upcoming_Talks