From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ey0-f171.google.com (mail-ey0-f171.google.com [209.85.215.171]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority" (verified OK)) by huchra.bufferbloat.net (Postfix) with ESMTPS id CB8C42008E8 for ; Tue, 10 Jul 2012 10:06:33 -0700 (PDT) Received: by eaaa12 with SMTP id a12so169099eaa.16 for ; Tue, 10 Jul 2012 10:06:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=subject:from:to:cc:in-reply-to:references:content-type:date :message-id:mime-version:x-mailer:content-transfer-encoding; bh=abwyu05FqevEkE6/BqcaRNomvfdWzfLVdYafpc7/m7E=; b=Bfnk9W3SMnXe1jKhz3qhb6V2ddviEVtBt5S5hqWdt1n8HIaQd8H9UkynMArljjxg1e RedHna+Nhic1vpUmNgOuDrSW8QTPznfKT8WFXrl4cJfocYGcxPZEXk2AC5qfLXKAXH21 wbM8T4FfmiZWPRsXrHoKeF1UP0+upqk3Bo1ZXe8ahRgEBxrHduux+PeuamNOUAQtqZ06 wDXI0cSVlNqwXSCGOmwt8E0yGf6JYnuQk1v8pnK72FkKZTpfiHJZmdBjq/sguZ6fCcxx WhFkiM4KN8cU/KHI9TAUOcyNzANo6jvKUqYZWXT03fPKKfulWCehSlB/dWvszyeJhct7 V33Q== Received: by 10.14.28.71 with SMTP id f47mr6302054eea.65.1341939991621; Tue, 10 Jul 2012 10:06:31 -0700 (PDT) Received: from [172.30.42.18] (171.237.66.86.rev.sfr.net. [86.66.237.171]) by mx.google.com with ESMTPS id h53sm104193493eea.1.2012.07.10.10.06.29 (version=SSLv3 cipher=OTHER); Tue, 10 Jul 2012 10:06:30 -0700 (PDT) From: Eric Dumazet To: David Miller In-Reply-To: <1341933215.3265.5476.camel@edumazet-glaptop> References: <1340945457.29822.7.camel@edumazet-glaptop> <1341396687.2583.1757.camel@edumazet-glaptop> <20120709.000834.1182150057463599677.davem@davemloft.net> <1341845722.3265.3065.camel@edumazet-glaptop> <1341933215.3265.5476.camel@edumazet-glaptop> Content-Type: text/plain; charset="UTF-8" Date: Tue, 10 Jul 2012 19:06:27 +0200 Message-ID: <1341939987.3265.5741.camel@edumazet-glaptop> Mime-Version: 1.0 X-Mailer: Evolution 2.28.3 Content-Transfer-Encoding: 7bit Cc: nanditad@google.com, netdev@vger.kernel.org, ycheng@google.com, codel@lists.bufferbloat.net, mattmathis@google.com, ncardwell@google.com Subject: Re: [Codel] [RFC PATCH v2] tcp: TCP Small Queues X-BeenThere: codel@lists.bufferbloat.net X-Mailman-Version: 2.1.13 Precedence: list List-Id: CoDel AQM discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 10 Jul 2012 17:06:34 -0000 On Tue, 2012-07-10 at 17:13 +0200, Eric Dumazet wrote: > This introduce TSQ (TCP Small Queues) > > TSQ goal is to reduce number of TCP packets in xmit queues (qdisc & > device queues), to reduce RTT and cwnd bias, part of the bufferbloat > problem. > > sk->sk_wmem_alloc not allowed to grow above a given limit, > allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a > given time. > > TSO packets are sized/capped to half the limit, so that we have two > TSO packets in flight, allowing better bandwidth use. > > As a side effect, setting the limit to 40000 automatically reduces the > standard gso max limit (65536) to 40000/2 : It can help to reduce > latencies of high prio packets, having smaller TSO packets. > > This means we divert sock_wfree() to a tcp_wfree() handler, to > queue/send following frames when skb_orphan() [2] is called for the > already queued skbs. > > Results on my dev machine (tg3 nic) are really impressive, using > standard pfifo_fast, and with or without TSO/GSO. Without reduction of > nominal bandwidth. > > I no longer have 3MBytes backlogged in qdisc by a single netperf > session, and both side socket autotuning no longer use 4 Mbytes. > > As skb destructor cannot restart xmit itself ( as qdisc lock might be > taken at this point ), we delegate the work to a tasklet. We use one > tasklest per cpu for performance reasons. > > > > [1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable > [2] skb_orphan() is usually called at TX completion time, > but some drivers call it in their start_xmit() handler. > These drivers should at least use BQL, or else a single TCP > session can still fill the whole NIC TX ring, since TSQ will > have no effect. > > Not-Yet-Signed-off-by: Eric Dumazet > --- By the way, Rick Jones asked me : "Is there also any chance in service demand?" I copy here my answer since its a very good point: I worked on the idea of a CoDel like feedback, to have a timed limit instead of byte limit ("allow up to 1ms" delay in qdisc/dev queue.) But it seemed a bit complex : I would need to add skb fields to properly track the residence time (sojourn time) of queued packets. Alternative would be to have a per tcp socket tracking array, but it might be expensive to search a packet in it... With multi queue devices or bad qdiscs, we can have reordering in skb orphanings. So the lookup can be relatively expensive.