From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail.toke.dk (mail.toke.dk [IPv6:2a00:7660:6da:2001::664]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by lists.bufferbloat.net (Postfix) with ESMTPS id 016A93B2A4 for ; Thu, 14 Mar 2019 17:25:05 -0400 (EDT) From: Toke =?utf-8?Q?H=C3=B8iland-J=C3=B8rgensen?= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=toke.dk; s=20161023; t=1552598705; bh=H9G/SuO4aNPZ9/ty3hcxmDXA6bBcSX9ASuSHabkVQx0=; h=From:To:Subject:Date:From; b=kygrNw7VSMOwAVMuLLwnkfQzvzvaavm86N3LaaybcC7QUjYV53kkpdvOdlpfmkPjT tq16mWs30JSHUXZlntCbRFjGJ2KRamkIWFAzSl1V/Ij+hwLTxCsFYKzK+rZJp8M5LC XRFiQdCEs1sq4EH7GUK0CV7A9mHY2KWUs2RWXdnL/PtJbqDHKWDoDBAS1uD8C2kqRL TJvikvQBbuLEA7rmKhhVcjeZQOFcLXGlyZ1xyBCrH3vKBA8PWO/APy6tdB7Wieg9KZ HRYyLncJM8V5rRHs66+asyAX/LuE0HjigvcOWDyvj8K4PM4z+murj35c75YSlyVq52 4lOygPHr6fkHw== To: cerowrt-devel@lists.bufferbloat.net Date: Thu, 14 Mar 2019 22:25:03 +0100 X-Clacks-Overhead: GNU Terry Pratchett Message-ID: <87ftrpxi6o.fsf@toke.dk> MIME-Version: 1.0 Content-Type: message/rfc822 Content-Disposition: inline Subject: [Cerowrt-devel] Fwd: net: Improve route scalability via support for nexthop objects X-BeenThere: cerowrt-devel@lists.bufferbloat.net X-Mailman-Version: 2.1.20 Precedence: list List-Id: Development issues regarding the cerowrt test router project List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 14 Mar 2019 21:25:06 -0000 Delivered-To: thoiland@gapps.redhat.com Received: by 2002:a2e:9d84:0:0:0:0:0 with SMTP id c4csp16258822ljj; Thu, 14 Mar 2019 13:20:29 -0700 (PDT) X-Google-Smtp-Source: APXvYqx/BQRHbZ8ncfaCoIDfeQvIFkGY0DygxNU6e/ALP1q1dlrS7bQF1VukgNsGe0/9k82z9ZBL X-Received: by 2002:ac8:2de7:: with SMTP id q36mr17113790qta.3.1552594829406; Thu, 14 Mar 2019 13:20:29 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1552594829; cv=none; d=google.com; s=arc-20160816; b=eReHKJW3bkDFp0J556OaiPFOEKQEoAeHjP/7k6CYtWeLG02gpJBy45DiE9ew8V26a5 jY1iMze0Q/Jx1/IrPyV2NC7eeqXnRWHk1+8AvUpRBYFX/f7JvZT4vvhIqR5BxQ/WLGzv XKOekwtwg+zY6EYr0GJImQyizcMBcM88ZKgNwi5AdqyWFSkPFAXVdwAglCRHKfUzO++P 4G0PgkSTcf09wyZcczqr19Qyj+pIp/iZlgtYs85ic4EyzhyqWf8rDP7t6Ad/8MVdErWf OQKqzX/4KWxyrh4oEP8/UJKSvhdzMO3R30yeoT/6kgXQKggdom2QSLRuIb1cIqNmR0+N O6uw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:mime-version:user-agent:date:message-id:subject :from:cc:to:dkim-signature:delivered-to; bh=ERy0eG8T4i5jL8jw9EuZYzhASdD80vKY0eOpPlUr4Ac=; b=beNMEpcoZvSPZJkxvWn9+Jw3AuRoNcEQZMvjJmQA5nGwKSoFyOnsKDxQ/yIP+O3HFz VhaHLdRxtGdgzGOVm+h7tHRlVagL5XUtCRh6SEDMxXUsTSGAKkjEx7aWqGKjafoBGDQ4 Jclp3lekFbtrO7aL8EaOS8azskyMMeRJ6/K+lE626iV7I4YaxkzRiJRdLq6XH85+taWr x2uf+iLqsLFmxW1SZgcATR1KrDf1cdumeGCT1KTseL44VBTlyAxIw9IuKNdOKKiVsdjY qOpNNEYrmKb6OzHTAlU0uxqZDFwFl62VG6/PJWg8xhPQO0v3InQf/OMDlllps94xLTRO mrmA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=OLdzKBet; spf=pass (google.com: best guess record for domain of netdev-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=netdev-owner@vger.kernel.org Return-Path: Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id g189si1259184qka.158.2019.03.14.13.20.28 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 14 Mar 2019 13:20:29 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of netdev-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=OLdzKBet; spf=pass (google.com: best guess record for domain of netdev-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=netdev-owner@vger.kernel.org Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com [10.5.11.13]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id EC4A2306C3D3 for ; Thu, 14 Mar 2019 20:20:27 +0000 (UTC) Received: by smtp.corp.redhat.com (Postfix) id DF3186BF76; Thu, 14 Mar 2019 20:20:27 +0000 (UTC) Delivered-To: toke@redhat.com Received: from mx1.redhat.com (ext-mx06.extmail.prod.ext.phx2.redhat.com [10.5.110.30]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 3CE266B8FA; Thu, 14 Mar 2019 20:20:11 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mx1.redhat.com (Postfix) with ESMTP id 604043B73D; Thu, 14 Mar 2019 20:20:06 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727660AbfCNUT6 (ORCPT + 68 others); Thu, 14 Mar 2019 16:19:58 -0400 Received: from mail-pf1-f196.google.com ([209.85.210.196]:36145 "EHLO mail-pf1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727213AbfCNUT5 (ORCPT ); Thu, 14 Mar 2019 16:19:57 -0400 Received: by mail-pf1-f196.google.com with SMTP id n22so4615755pfa.3 for ; Thu, 14 Mar 2019 13:19:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=to:cc:from:subject:message-id:date:user-agent:mime-version :content-language:content-transfer-encoding; bh=ERy0eG8T4i5jL8jw9EuZYzhASdD80vKY0eOpPlUr4Ac=; b=OLdzKBetJb5NKhcnDmFHm56iy7OqNvf8s5DYexukMzqFuVW/Ec3mm/hjdhN+o1uvwr nM25mjYcPUhlSfw71+O2Bn7G/3DdTJVAhtVgMiSaYdrsznrnr+6ge848JJoHdJPKgYWk dMkbfwDm6X14E4Yt6M92kE4VTdK8jbcsigYS8FcJKGEe/N8X4PpuNQZkTPW/Btd1D21H z+S95sKpCprO5tFdXoVnxjNv/PFwkuQc4kpFxdnQt/xfpAfgBxZfD8z9BC9046xqlAB3 qlF/KbG7s+eO/sOUAW7edDepv+TIT29eXoDl7KkdZQsZnqMmoMMMz0ZBB04lw1bnPJX4 rXuA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:to:cc:from:subject:message-id:date:user-agent :mime-version:content-language:content-transfer-encoding; bh=ERy0eG8T4i5jL8jw9EuZYzhASdD80vKY0eOpPlUr4Ac=; b=p//zp0snr0PlZxbqpIlXAKIispTiQ4rh2b8WpuTqQADWyNg+U4Lp1U+nL8NnUbqDUk m3TSWxQU1zWKfkmhjtrxHKUX2jpQuS94BJJlW7//0p7zSsm9D0cw3EQ+HRCfrWEUf4t3 KzEKuLC13JS3OH3Z7XnBlwAGMmHymyaxaAYqXGLZxBIMKZaQA+drC/aBEN8Gsmo44vTx kji3/cvfwZ9cabTUHQyuhOQmr37ptp2f3MTOsEKbBisSzxMXxBIvx/FafNuOeHvzYYmo devG5WQ8sc2tsSAx5kycCyQ0PgkLbjl0bEjBNZfDJGPZ3GBKHCDjcRWSFZX4ae/DH6Yh Y+Lg== X-Gm-Message-State: APjAAAUUn2V7xvWmhPGRgS8NtFbR5Fjw7Nb511jf6M/6EUSm4Kcu3LeG T1WbV9Y2JlHKA64w2aEhXYM= X-Received: by 2002:a17:902:b216:: with SMTP id t22mr208086plr.39.1552594796615; Thu, 14 Mar 2019 13:19:56 -0700 (PDT) To: David Miller Cc: "netdev\@vger.kernel.org" , Roopa Prabhu , Ido Schimmel From: David Ahern Subject: net: Improve route scalability via support for nexthop objects X-Clacks-Overhead: GNU Terry Pratchett Message-ID: Date: Thu, 14 Mar 2019 14:19:53 -0600 Content-Language: en-US Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Greylist: Sender passed SPF test, Sender IP whitelisted by DNSRBL, ACL 216 matched, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.30]); Thu, 14 Mar 2019 20:20:09 +0000 (UTC) X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.30]); Thu, 14 Mar 2019 20:20:09 +0000 (UTC) for IP:'209.132.180.67' DOMAIN:'vger.kernel.org' HELO:'vger.kernel.org' FROM:'netdev-owner@vger.kernel.org' RCPT:'' X-RedHat-Spam-Score: -5.848 (DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI) 209.132.180.67 vger.kernel.org 209.132.180.67 vger.kernel.org X-Scanned-By: MIMEDefang 2.78 on 10.5.110.30 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.45]); Thu, 14 Mar 2019 20:20:27 +0000 (UTC) MIME-Version: 1.0 Content-Type: text/plain TL;DR: The nexthop changes are finally ready for consideration for inclusion into the networking stack. The patch count currently stands at 86 including tests. The majority of those are refactoring the existing code base with the last 16 implementing the nexthop feature and selftests. The first 27 patches move the IPv4 and IPv6 code to work with a fib_nh_common, a new struct which contains the common elements of fib_nh and fib6_nh, and then refactor the existing fib_dump_info to work for both protocols. With that in place, the next 15 patches do more changes to IPv4 to enable IPv6 gateways with IPv4 routes (a.k.a RFC 5549) using the RTA_VIA attribute. >From there the next 24 patches refactor IPv6, introducing a fib6_result similar to IPv4's fib_result which allows a fib6_nh that is not within a fib6_info and adding hooks to the ipv6 stubs (bump sernum, send route notifications and delete routes based on nexthop updates). This is followed by a few IPv4 exports and then the last 16 patches add the nexthop feature. I plan to start sending patches next week once net-next opens. Since it will take a while to get all of them in, I wanted to make sure the end goal is known and understood. For anyone interested in seeing the patches ahead of time, they are here: https://github.com/dsahern/linux nexthops-v5.1-next-v2 (order of the patches may change) == Long version: As mentioned at netconf in Seoul, we would like to introduce nexthops as independent objects from the routes to better align with both routing daemons and hardware and to improve route insertion times into the kernel. This series adds nexthop objects with their own lifecycle. The model retains a lot of the established semantics from routes. One difference with nexthop objects is the behavior better aligns with the target user - routing daemons and switch ASICs. Specifically, with the exception of the blackhole nexthop, all nexthops must reference a netdevice and the device must be admin up with carrier. If a device goes down (admin or carrier) the nexthop is evicted along with all routes referencing it. Work flow wise, nexthops are created first: { nexthop } --> { gateway, device } And then prefixes are installed pointing to the nexthop by id: { prefix } ----> { nexthop } with the resulting route looking very similar to the existing code: { prefix } ----> { nexthop } --> { gateway, device } A nexthop can be a group which references other nexthops: /---> { nexthop, weight } { nexthop } ... \---> { nexthop, weight } Prefixes referencing the group nexthop are then multipath routes: /---> { nexthop, weight } --> { gw, dev } { prefix } ----> { nexthop } ... \---> { nexthop, weight } --> { gw, dev } Nexthop data (gw, dev or entries in a group) can be updated atomoically, allowing for the efficient update of all prefixes in one replace command. Notifications ============= 1. A new rtnl group is defined, RTNLGRP_NEXTHOP. Since its group id is > 31, applications need to use the setsockopt option to add nexthop group to the listeners: unsigned int group = RTNLGRP_NEXTHOP; setsockopt(fd, SOL_NETLINK, NETLINK_ADD_MEMBERSHIP, &group, sizeof(group)); 2. Nexthop notifications are generated for the usual add, delete, replace lifecycle. 3. Notifications for route adds, replace, and delete are identical to the legacy ones with one new attribute, RTA_NHID, if the route references a standalone nexthop object. This applies to route changes and to any add, delete, replace of a nexthop object used by FIB entries. This model retains backwards compatibility such that the existing ecosystem of software that does not natively understand nexthop objects is not impacted by the use of nexthop objects. The expectation is that unknown attributes (the new RTA_NHID) are ignored by legacy apps. 4. Nexthop notifications are NOT generated when a nexthop is removed due to a device event (eg., admin or carrier down). Userspace is expected to monitor link events and remove nexthops and routes associated with the device. By extension, notifications are NOT generated for routes evicted because of the removal of a nexthop when it is removed by a device event. Key Features ============ 1. Atomic replace of the configuration data for any nexthop object - a standalone nexthop or a group. This allows existing route entries to have their nexthop config updated without the overhead of removing and re-inserting (or replacing) the routes individually. Instead, one update of the nexthop object implicitly updates all routes referencing it. One limitation with the atomic replace is that a nexthop group can only be replaced with a new group spec, and similarly a single nexthop can only be replaced by a single nexthop spec. Specifically, a nexthop id can not move between a single nexthop and a group nexthop except by delete and add. 2. Blackhole nexthop: a nexthop object can be designated a blackhole which means any lookups that resolve to it fail with the result RTN_BLACKHOLE. Blackhole nexthops can be used with nexthop groups but only as the sole nexthop. Combined with atomic replace this allows routes to be installed pointing to a blackhole nexthop or group and then switched to an actual gateway or multipath nexthop with a single replace command (or vice versa, a gateway/device nexthop can be flipped to a blackhole). 3. Nexthop groups for multipath routes. A nexthop group is a nexthop that references other nexthops with a weight for weighted multipath. A multipath group can not be used as a nexthop in another nexthop group (ie., groups can not be nested). 4. Multipath routes for IPv6 with device only nexthops. There is a demonstrated need for this feature and the existing route semantics do not allow it because of mistakes with past implementation of multipath routes. This series provides a means for that end - create a nexthop that has a device only specification. 5. IPv6 nexthops with IPv4 routes for users wanting support of RFC 5549. This feature is enabled natively (without nexthop objects) as a result of the heavy refactoring of fib_nh and fib6_nh into a common nexthop. 6. Dramatic reduction in time to install routes in the kernel, most notably with increasing number of legs in a multipath route. Formal data will be presented with the nexthop commits. 7. Lower memory footprint for IPv6 While individual data structures shows a minor increase in size old new === === fib_nh_common - 72 IPv4 fib_nh 104 120 fib_info 104 128* rtable 160 176 IPv6 fib6_nh 48 96 fib6_info 224 160* rt6_info 224 224 [*] with 0 nexthops; for 1 fib{6}_nh add the respective cost the *effective* allocation sizes for fib{6}_info have not changed. The data structure increase is due to a combination of factors: nexthop reference and list_head tracking of fib entries along with IPv6 address in IPv4 structures. While IPv4 has consolidated similar nexthop data into single fib_info instances that are referenced by multiple fib entries, IPv6 does not. For IPv6 the nexthop data is repeated for each route. This means with nexthop objects the memory overhead of IPv6 fib entries drop significantly - especially with multipath routes. 8 Future extensions I believe thee code is setup to allow a future extension where apps can pass a flag that effectively says "I understand nexthop objects - don't expand them in the route dump" and for a sysctl knob for notifiers to do the same once all apps running in the control plane are known to understand nexthop objects. Together this means less data going from kernel to userspace and less processing by userspace. Examples ======== 1. Single path $ ip nexthop add id 1 via 10.99.1.2 dev veth1 $ ip route add 10.1.1.0/24 nhid 1 $ ip next ls id 1 via 10.99.1.2 src 10.99.1.1 dev veth1 scope link $ ip ro ls 10.1.1.0/24 nhid 1 scope link ... 2. ECMP $ ip nexthop add id 2 via 10.99.3.2 dev veth3 $ ip nexthop add id 1001 group 1/2 --> creates a nexthop group with 2 component nexthops: id 1 and id 2 both the same weight $ ip route add 10.1.2.0/24 nhid 1001 $ ip next ls id 1 via 10.99.1.2 src 10.99.1.1 dev veth1 scope link id 2 via 10.99.3.2 src 10.99.3.1 dev veth3 scope link id 1001 group 1/2 $ ip ro ls 10.1.1.0/24 nhid 1 scope link 10.1.2.0/24 nhid 1001 scope link ... 3. Weighted multipath $ ip nexthop add id 1002 group 1,10/2,20 --> creates a nexthop group with 2 component nexthops: id 1 with a weight of 10 and id 2 with a weight of 20 $ ip route add 10.1.3.0/24 nhid 1002 $ ip next ls id 1 via 10.99.1.2 src 10.99.1.1 dev veth1 scope link id 2 via 10.99.3.2 src 10.99.3.1 dev veth3 scope link id 1001 group 1/2 id 1002 group 1,10/2,20 $ ip ro ls 10.1.1.0/24 nhid 1 scope link 10.1.2.0/24 nhid 1001 scope link 10.1.3.0/24 nhid 1002 scope link ... Acknowledgements - test writers - especially Stefano Brivio. The pmtu script has been invaluable in verifying changes to the exception code (and it exposed a few other bugs). - kbuild robot for catching compile errors through the maze of config options and Dan Carpenter's tools for catching a NULL vs IS_ERR check v2 - a few changes to the uapi - most notably, the requirement to have a v4 or v6 version of all single nexthops. A group is done as AF_UNSPEC, but individual nexthops must be AF_INET or AF_INET6. This is forced by the cached (per cpu) routes and exceptions, and the expectation in too many places that a fib{6}_nh exists. - backwards compatibility - route notifications changed to include nexthop data such that legacy apps are not impacted. This drove much of the refactoring towards a fib_nh_common but that also enabled the IPv6 gateways with IPv4 routes with trivial additional changes.