From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dave.taht@gmail.com>
Received: from mail-ob0-x234.google.com (mail-ob0-x234.google.com
	[IPv6:2607:f8b0:4003:c01::234])
	(using TLSv1 with cipher RC4-SHA (128/128 bits))
	(Client CN "smtp.gmail.com",
	Issuer "Google Internet Authority G2" (verified OK))
	by huchra.bufferbloat.net (Postfix) with ESMTPS id 617D321F2EB
	for <cerowrt-devel@lists.bufferbloat.net>;
	Mon, 26 Jan 2015 16:36:07 -0800 (PST)
Received: by mail-ob0-f180.google.com with SMTP id uz6so10848897obc.11
	for <cerowrt-devel@lists.bufferbloat.net>;
	Mon, 26 Jan 2015 16:36:07 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
	h=mime-version:in-reply-to:references:date:message-id:subject:from:to
	:cc:content-type:content-transfer-encoding;
	bh=zqvfpoRly7YfOA+wKMWKQVeXms7tjs3xRnD6MOE2Md0=;
	b=Mo8/yrkFe04wcp+3k22MDUWy6dPkZDw3VavRd4SVSnnY1r6Uu9YOrjWKR85CGfnYnt
	Ze1hk5wAZ8LesiZUDe69Q62c2W61COHLKTD3yMrwipN1XijOMtVY99SLcpBx1DgVmAwR
	kclb/nZic+TysBtU4qC3636g+psMX5Yr2w9aE/Ok7ED5o8sEdCGL5zJ4cYc5p2JBJlNl
	JxfYvIZzRYbyVFJ6l1Mwu+Gonhxt2emIBreYRyu/cLqgMJZVfyx6RLeLzZIedIRIUmzM
	7S65NZ6ZqyYKxtlj980P03UbQV2D8Ms/+7q6CBTUg0KAWWgixBmKRTnQDB+2/5tnCenm
	+50A==
MIME-Version: 1.0
X-Received: by 10.202.204.142 with SMTP id c136mr13702709oig.81.1422318966923; 
	Mon, 26 Jan 2015 16:36:06 -0800 (PST)
Received: by 10.202.51.66 with HTTP; Mon, 26 Jan 2015 16:36:06 -0800 (PST)
In-Reply-To: <1422317535.474322223@apps.rackspace.com>
References: <54B5D28A.3010906@gmail.com>
	<7B1EA8F0-FCB6-4A37-950F-2558FC751DE8@gmail.com>
	<54C038D0.1000305@gmail.com>
	<alpine.DEB.2.02.1501211553090.21864@nftneq.ynat.uz>
	<54C0BD22.3000608@gmail.com>
	<alpine.DEB.2.02.1501220110170.19609@nftneq.ynat.uz>
	<54C13F47.1010203@gmail.com>
	<1422111577.328132080@apps.rackspace.com>
	<alpine.DEB.2.02.1501242029320.19609@nftneq.ynat.uz>
	<1422217048.025611275@apps.rackspace.com>
	<alpine.DEB.2.02.1501251538031.19609@nftneq.ynat.uz>
	<1422237076.005718796@apps.rackspace.com>
	<CAA93jw4DYgbv0oFwOfJmDfnOfAz6VYAdv9BcgS51sNg-rEopCA@mail.gmail.com>
	<alpine.DEB.2.02.1501251821500.19609@nftneq.ynat.uz>
	<CAA93jw4wr+UeSdYK+y8U2S6K=KCiKN0Z7LUn__aAjBTw2qowrg@mail.gmail.com>
	<1422242279.46066942@apps.rackspace.com>
	<CAA93jw4ero-ApHh8nX-S1vPbfZ1RzUNEkG9nh8wy3=Gw6DCsjA@mail.gmail.com>
	<1422317535.474322223@apps.rackspace.com>
Date: Tue, 27 Jan 2015 13:36:06 +1300
Message-ID: <CAA93jw5HoX1JhsBWWSCPrHiTFPktGVWCao_v78DPMg+KcM_KZQ@mail.gmail.com>
From: Dave Taht <dave.taht@gmail.com>
To: David Reed <dpreed@reed.com>, Jesper Dangaard Brouer <brouer@redhat.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Cc: "cerowrt-devel@lists.bufferbloat.net" <cerowrt-devel@lists.bufferbloat.net>
Subject: Re: [Cerowrt-devel] Recording RF management info _and_ associated
	traffic?
X-BeenThere: cerowrt-devel@lists.bufferbloat.net
X-Mailman-Version: 2.1.13
Precedence: list
List-Id: Development issues regarding the cerowrt test router project
	<cerowrt-devel.lists.bufferbloat.net>
List-Unsubscribe: <https://lists.bufferbloat.net/options/cerowrt-devel>,
	<mailto:cerowrt-devel-request@lists.bufferbloat.net?subject=unsubscribe>
List-Archive: <https://lists.bufferbloat.net/pipermail/cerowrt-devel>
List-Post: <mailto:cerowrt-devel@lists.bufferbloat.net>
List-Help: <mailto:cerowrt-devel-request@lists.bufferbloat.net?subject=help>
List-Subscribe: <https://lists.bufferbloat.net/listinfo/cerowrt-devel>,
	<mailto:cerowrt-devel-request@lists.bufferbloat.net?subject=subscribe>
X-List-Received-Date: Tue, 27 Jan 2015 00:36:37 -0000

Jesper now cc'd.

On Tue, Jan 27, 2015 at 1:12 PM,  <dpreed@reed.com> wrote:
> Well, we all may want to agree to disagree.  I don't buy the argument tha=
t
> hash tables are slow compared to the TCAMs - and even if cache misses
> happened, a hash table is still o(1) - you look at exactly one memory
> address on the average in a hash table - that's the point of it.  The
> constant factor is the speed of memory - not terribly slow by any means.
>
>
>
> To get into this deeper would require actual measurements, of which I am =
a
> great fan.  But your handwaves are pretty unquantitative, Dave, so at bes=
t
> they are similar to mine.  I'm very measurement focused, being part hardw=
are
> architecture guy.

Two of the people doing serious optimization and measurement of linux
network behavior are now cc'd. (tho they might want to read back on
the thread). Jesper, in particular, has been working on speeding up
10GigE in preparation for 100GigE and gave a great preso at lca:

http://lwn.net/Articles/629155/ (see slides, video)

Relative to that was:

http://lwn.net/Articles/629152/

And alexander has been working specifically on dramatically improving
routing cache lookups. He tells me:

"The amount of gain seen will vary based on the routing configuration
of each system.  The biggest gain in all of this is that the
prefix-matching/backtrace portion of the look-up was reduced from
O(N^2) to O(N).  So on my test systems that were configured with
rather large tries I saw a reduction from 380ns to 16ns for performing
a prefix-match/backtrace.  What this means for most end users is that
anything that falls back to the default route on the system should
take significantly less time for look-up in the fib tables.

A hit at depth 7 in my trie costs about 31ns, though I think that
might be a cache warm hit versus a cache cold look-up.  Though you
have to keep in mind if you are dealing with a routing table that is
the main trie, and not the local trie.  That means to get a "hit" only
any route you are still going to have to have a failed look-up in the
local trie first and the size of that trie depends on the number of
local addresses you have configured on the system.

My second set of patches should cut that by about 25% to 50% since I
am dropping a couple of unnecessary items from the look-up process and
compressing things so that the pointer to the next tnode and the key
info for that tnode should always be in the same cache-line.

the second set of patches [if they work out] that should reduce the
cache utilization by up to half.  Basically it consists of pushing the
key and key information up to the same cache-line that pointer for the
tnode/leaf lives on.  However I have to sort out some RCU ugliness
that adds since I have to RCU protect the key information."

My main complaint about the work so far is that no-one has been
measuring the total system costs (and latency) of the time it takes
from when a packet enters the system to the time it departs. I am
pretty sure that immense call path could be additionally optimized...
(last I recall it transited a minimum of 34 functions)

My secondary complaint is all the work is being tested on 64 bit
hardware with huge caches.

Somewhat relevant to that:

It has long been my hope that we would see per-packet timestamping
become the default at ingress (from the card or host application's
interaction with the stack), and then merely checked at egress through
codel, rather than the fq_codel queue merely measuring itself.

>
>
> David - my comment about HP doing layer 3 switching in TCAMs just was the=
re
> to point out that there's nothing magic about layer 2.  I was not suggest=
ing
> that they don't use proprietary binary blobs, because they do.  But so do
> the TCAM programs in layer 2 devices.
>
>
>
> Dave - you are conflating the implementation technique of the routing
> algorithm when you focus on "prefix matching" as being hard to do.  It's =
not
> hard to invent a performant algorithm to do that combined with a hash tab=
le.
> A simple way to do that is to treat the address one is looking up as seve=
ral
> addresses (of shorter prefixes of the address).  Then look each one up
> separately by its hash.  Its still o(1) if you do that, just a larger
> constant factor. I assume you don't actually think it is optimal to do
> linear searches on the routing table like hosts sometimes do.  Linear sea=
rch
> is not necessary.

I am tracking alexander's fine work closely. See recent commits to the
net-next tree.
>
>
> There is literally nothing magical about looking up 48-bit random Etherne=
t
> addresses in a LAN.

The difference between 48 bits and 128 bits is quite large.

>
>
> As far as NAT'ing is concerned - that is done by the gateways.  It's
> possible in principle to create a distributed NAT face to an Enterprise -=
 if
> you do so, then roaming within the enterprise just amounts to telling the
> NAT face about the new internal IP address that corresponds to the old on=
e -
> an update of one address translation with another.
>
>
>
> This is how phones roam, by the way. They update their location via an HL=
R
> as they roam.
>
>
>
>
>
> On Sunday, January 25, 2015 10:45pm, "Dave Taht" <dave.taht@gmail.com> sa=
id:
>
>> On Sun, Jan 25, 2015 at 7:17 PM, <dpreed@reed.com> wrote:
>> > Looking up an address in a routing table is o(1) if the routing table =
is
>> > a
>> > hash table. That's much more efficient than a TCAM. My simple example
>> > just
>> > requires a delete/insert at each node's route lookup table.
>>
>> Regrettably it is not O(1) once you take into account the cpu cache
>> hierarchy,
>> or the potential collisions you will have once you shrink the hash to
>> something reasonable.
>>
>> Also I think you are ignoring the problem of covering routes. Say I have
>> to
>> get something to a.b.c.z/32. I do a lookup of that and find nothing. I
>> then
>> look to find a.b.c.z/31 and find nothing, then /30, then /29, /28, until=
 I
>> find
>> a hit for the next hop. Now you can of course do a binary search for
>> likely
>> subprefixes, but in any case, the search is not O(1).
>>
>> In terms of cache efficient data structures, a straight hash is not the
>> way
>> to go, of late I have been trying to wrap my head around the hat-trie as
>> possibly being useful in these circumstances.
>>
>> Now, if you think about limiting the domain of the problem to something
>> greater than the typical mac table, but less than the whole internet,
>> it starts looking more reasonable to have a 1x1 ratio of destination
>> IPs to hash table entries for lookups, but updates have to probe/change
>> large segments of the table in order to deal with covering prefixes.
>>
>> > My point was about collections of WLAN's bridged together. Look at wha=
t
>> > happens (at the packet/radio layer) when a new node joins a bridged se=
t
>> > of
>> > WLANs using STP. It is not exactly simple to rebuild the Ethernet
>> > layer's
>> > bridge routing tables in a complex network. And the limit of 4096
>> > entries
>> > in many inexpensive switches is not a trivial limit.
>>
>> Agreed. But see http://en.wikipedia.org/wiki/Virtual_Extensible_LAN
>>
>> >
>> >
>> >
>> > Routers used to be memory-starved (a small number of KB of RAM was the
>> > norm). Perhaps the thinking then (back before 2000) has not been
>> > revised,
>> > even though the hardware is a lot more capacious.
>>
>> The profit margins have not been revised.
>>
>> I would not mind, incidentally expanding the scope of the fqswitch proje=
ct
>> ot
>> try to build something that would scale up at l3 farther than we've ever
>> seen
>> before, however funding for needed gear like:
>>
>> http://www.eetimes.com/document.asp?doc_id=3D1321334
>>
>> and time, and fpga expertise, is lacking. I am currently distracted by
>> evaluating
>> a very cool new cpu architecture ( see
>> http://www.millcomputing.com/wiki/Memory )
>> and even as nifty as that is I foresee a need for a lot of dedicated
>> packet
>> processing logic and memories to get into the 40GBit+ range.
>> >
>> >
>> > Remember, the Ethernet layer in WLANs is implemented by
>> > microcontrollers,
>> > typically not very capable ones, plus TCAMs which are pretty limited i=
n
>> > their flexibility.
>>
>> I do tend to think that the next era of SDN enabled hardware will
>> eventually
>> lead to more innovation in both the control and data plane - however it
>> seems we are still in a "me-too" phase
>> of development of openvswitch (btw: there is a new software switch for
>> linux called rocker we should look at, and make sure runs fq_codel), and
>> a long way from flexibly programmable switch hardware in general.
>>
>> http://openvswitch.org/pipermail/dev/2014-September/045084.html
>> >
>> >
>> >
>> > While it is tempting to use the "pre-packaged, proprietary" Ethernet
>> > switch
>> > functionality, routing gets you out of the binary blobs, and let's you
>> > be a
>> > lot smarter and more scalable. Given that it does NOT cost more to do
>> > routing at the IP layer, building complex Ethernet bridging is not
>> > obviously
>> > a win.
>>
>> SDN is certainly a way out of this mess. Eventually. But I fear we are
>> making
>> all the same mistakes over again, and making slower hardware, where in t=
he
>> end, it needs to be faster, to win.
>>
>> >
>> >
>> > BTW, TCAMs are used in IP layer switching, too, and also are used in
>> > packet
>> > filtering. Maybe not in cheap consumer switches, but lots of Gigabit
>> > switches implement IP layer switching and filtering. At HP, their
>> > switches
>> > routinely did all their IP layer switching entirely in TCAMs.
>>
>> Yep. I really wish big, fat TCAMS were standard equipment.
>>
>> >
>> >
>> > On Sunday, January 25, 2015 9:58pm, "Dave Taht" <dave.taht@gmail.com>
>> said:
>> >
>> >> On Sun, Jan 25, 2015 at 6:43 PM, David Lang <david@lang.hm> wrote:
>> >> > On Sun, 25 Jan 2015, Dave Taht wrote:
>> >> >
>> >> >> To your roaming point, yes this is certainly one place where
>> migrating
>> >> >> bridged vms across machines breaks down, and yet more and more
>> vm
>> >> >> layers are doing it. I would certainly prefer routing in this
>> case.
>> >> >
>> >> >
>> >> > What's the difference between "roaming" and moving a VM from one
>> place
>> >> > in
>> >> > the network to another?
>> >>
>> >> I think most people think of "roaming" as moving fairly rapidly from
>> >> one
>> >> piece of edge connectivity to another, and moving a vm is a great dea=
l
>> >> more
>> >> permanent operation.
>> >>
>> >> > As far as layer 2 vs layer 3 goes. If you try to operate at layer 3=
,
>> you
>> >> > are
>> >> > going to have quite a bit of smarts in the endpoint. Even if it's
>> only
>> >> > connected vi a single link. If you think about it, even if your
>> network
>> >> > routing tables list every machine in our environment individually,
>> you
>> >> > still
>> >> > have a problem of what gateway the endpoint uses. It would have to
>> >> > change
>> >> > every time it moved. Since DHCP doesn't update frequently enough to
>> be
>> >> > transparent, you would need to have each endpoint running a routing
>> >> > protocol.
>> >>
>> >> Hmm? I don't ever use a dhcp-supplied default gateway, I depend on th=
e
>> >> routing
>> >> protocol to supply that. In terms of each vm running a routing
>> >> protocol,
>> >> well, no, I would rely on the underlying bare metal OS to be doing
>> >> that, supplying
>> >> the FIB tables to the overlying vms, if they need it, but otherwise t=
he
>> >> vms
>> >> just see a "default" route and don't bother with it. They do need to
>> >> inform the
>> >> bare metal OS (better term for this please? hypervisor?) of what IPs
>> they
>> >> own.
>> >>
>> >> static default gateways are evil. and easily disabled. in linux you
>> >> merely comment
>> >> out the "routers" in /etc/dhcp/dhclient.conf, in openwrt, set
>> >> "defaultroute 0" for the
>> >> interface fetching dhcp.
>> >>
>> >> When a box migrates, it tells the hypervisor it's addresses, and then
>> that
>> >> box
>> >> propagates out the route change to elsewhere.
>> >>
>> >> >
>> >> > This can work for individual hobbiests, but not when you need to
>> support
>> >> > random devices (how would you configure an iPhone to support this?)
>> >>
>> >> Carefully. :)
>> >>
>> >> I do note that this stuff does (or at least did) work on some of the
>> open
>> >> source variants of android. I would rather like it if android added
>> >> ipv6
>> >> tethering soon, and made it possible to mesh together multiple phones=
.
>> >>
>> >> >
>> >> >
>> >> > Letting the layer 2 equipment deal with the traffic within the
>> building
>> >> > and
>> >> > invoking layer 3 to go outside the building (or to a different
>> security
>> >> > domain) makes a lot of sense. Even if that means that layer 2 withi=
n
>> a
>> >> > building looks very similar to what layer 3 used to look like aroun=
d
>> a
>> >> > city.
>> >>
>> >> Be careful what you wish for.
>> >>
>> >> >
>> >> >
>> >> > back to the topic of wifi, I'm not aware of any APs that participat=
e
>> in
>> >> > the
>> >> > switch protocols at this level. I also don't know of any reasonably
>> >> > priced
>> >> > switches that can do anything smarter than plain spanning tree when
>> >> > connected through multiple paths (I'd love to learn otherwise)
>> >> >
>> >> > David Lang
>> >>
>> >>
>> >>
>> >> --
>> >> Dave T=C3=A4ht
>> >>
>> >> thttp://www.bufferbloat.net/projects/bloat/wiki/Upcoming_Talks
>> >>
>>
>>
>>
>> --
>> Dave T=C3=A4ht
>>
>> thttp://www.bufferbloat.net/projects/bloat/wiki/Upcoming_Talks
>>


--=20
Dave T=C3=A4ht

thttp://www.bufferbloat.net/projects/bloat/wiki/Upcoming_Talks