From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by lists.bufferbloat.net (Postfix) with ESMTPS id 4BBEE3B2A0 for ; Mon, 21 Nov 2016 11:49:16 -0500 (EST) Received: from int-mx09.intmail.prod.int.phx2.redhat.com (int-mx09.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 8456F81226; Mon, 21 Nov 2016 16:49:15 +0000 (UTC) Received: from localhost (ovpn-200-33.brq.redhat.com [10.40.200.33]) by int-mx09.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id uALGnCVk025584; Mon, 21 Nov 2016 11:49:13 -0500 Date: Mon, 21 Nov 2016 17:49:12 +0100 From: Jesper Dangaard Brouer To: Dave Taht Cc: make-wifi-fast@lists.bufferbloat.net, brouer@redhat.com Message-ID: <20161121174912.4c041949@redhat.com> In-Reply-To: References: <20161118085524.62d6cfdd@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.68 on 10.5.11.22 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.25]); Mon, 21 Nov 2016 16:49:15 +0000 (UTC) Subject: Re: [Make-wifi-fast] a bit of profiling on the archer X-BeenThere: make-wifi-fast@lists.bufferbloat.net X-Mailman-Version: 2.1.20 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 21 Nov 2016 16:49:16 -0000 On Sat, 19 Nov 2016 09:33:51 -0800 Dave Taht wrote: > On Thu, Nov 17, 2016 at 11:55 PM, Jesper Dangaard Brouer > wrote: > > > > On Thu, 17 Nov 2016 20:14:49 -0800 Dave Taht wrote: > > > >> I have not been profiling much on lower end platforms (it's hard, you > >> can crash a box pretty easily with the wrong options or sample rates). > > > > I'm happy to hear that perf does work on this lower end HW, although > > the disclaimer of sample rates. > > > > Does anyone know if hardware based PMU (Performance Monitor Units) > > exists for these kind of devices? > > > >> While watching the ath10k peak at 150-200mbits, at 99% of cpu in > >> softirq, I spent a bit of time profiling various counters and > >> statistics. > >> > >> for this one (while downloading 12 flows at the same time via flent) > >> > >> perf record -F 99 -e cpu-clock -ag -- sleep 10 > >> perf report > > > > Below perf report is not well suite for email, could you instead > > provide output from command below: > > > > perf report --no-children --stdio --call-graph none > > Thanks! That is way more readable output. Appended at the end of this post. > > > > >> > >> [[31m 67.81%[[m 0.00% ksoftirqd/0 [kernel.kallsyms] [k] > >> run_ksoftirqd > >> | > >> ---run_ksoftirqd > >> | > >> |[[31m--67.61%-- [[m __do_softirq > >> | | > >> | |[[31m--66.80%-- [[m net_rx_action > >> | | | > >> | | |[[31m--41.07%-- [[m ag71xx_poll > >> > >> ... > > > > (Looks like you managed to copy-paste the terminal escape codes for > > colors) > > Not sure what terminal emulation perf is expecting. TERM is set to XTERM. > > >> > >> It appears we're spending 66% of the time in the *ethernet* portion of > >> the path. > > > > Be careful you don't fool yourself. In your output you have the > > "children" mode on, so everything being called "under" net_rx_action is > > summed up. It could be it goes all the way through to the wifi TX > > parts and that is part of the sum. Even the memory allocations gets > > summed into this 66% number. > > > > > >> I'm going to stop worrying so much about the performance of the new > >> wifi algorithms. > > for at least 24 hours > > # To display the perf.data header info, please use > --header/--header-only options. > # > # > # Total Lost Samples: 0 > # > # Samples: 990 of event 'cpu-clock' > # Event count (approx.): 9999999900 > # > # Overhead Command Shared Object Symbol > # ........ ........... ...................... > ......................................... > # > 4.34% ksoftirqd/0 [ebtables] [k] ebt_do_table Do you actually need ebtables? you should check.... > 2.93% ksoftirqd/0 [kernel.kallsyms] [k] __netif_receive_skb_core > 2.83% ksoftirqd/0 [ip_tables] [k] ipt_do_table One should be careful just to blame "ebt_do_table" and "ipt_do_table", because this can easily just be a symptom of a cache-miss on packet data. On these low-end CPUs with small caches, one can even experience cache-misses happening on the same packet different places in the netstack. > 2.53% ksoftirqd/0 [kernel.kallsyms] [k] nf_iterate Upstream made changes in this area. 26dfab721629 ("netfilter: merge nf_iterate() into nf_hook_slow()") https://git.kernel.org/davem/net-next/c/26dfab721629 What kernel release is this? > 2.42% ksoftirqd/0 [mac80211] [k] ieee80211_tx_dequeue > 2.02% ksoftirqd/0 [kernel.kallsyms] [k] __local_bh_enable_ip > 2.02% ksoftirqd/0 [mac80211] [k] __ieee80211_subif_start_xmit > 1.82% ksoftirqd/0 [kernel.kallsyms] [k] __dev_queue_xmit > 1.72% ksoftirqd/0 [nf_conntrack] [k] tcp_packet Notice this "tcp_packet" call happens in "nf_conntrack" > 1.52% ksoftirqd/0 [mac80211] [k] ieee80211_prepare_and_rx_handle > 1.52% ksoftirqd/0 [mac80211] [k] ieee80211_rx_handlers > 1.41% ksoftirqd/0 [kernel.kallsyms] [k] br_handle_frame > 1.41% ksoftirqd/0 [kernel.kallsyms] [k] ip_finish_output2 > 1.41% ksoftirqd/0 [kernel.kallsyms] [k] ip_forward > 1.41% ksoftirqd/0 [nf_conntrack] [k] __nf_conntrack_find_get If you sum things up, then conntrack seems to take a lot of time. I find that the FlameGraphs are the best way to get such an overview. > 1.31% ksoftirqd/0 [kernel.kallsyms] [k] format_decode > 1.31% ksoftirqd/0 [mac80211] [k] ieee80211_xmit_fast_finish > 1.21% ksoftirqd/0 [mac80211] [k] ieee80211_tx_h_select_key > 1.11% ksoftirqd/0 [kernel.kallsyms] [k] __copy_user_common > 1.11% ksoftirqd/0 [mac80211] [k] ieee80211_queue_skb > 1.01% ksoftirqd/0 [ath10k_core] [k] ath10k_htt_txrx_compl_task > 1.01% ksoftirqd/0 [ath10k_core] [k] ath10k_mac_tx.isra.35 > 1.01% ksoftirqd/0 [kernel.kallsyms] [k] dev_hard_start_xmit > 1.01% ksoftirqd/0 [kernel.kallsyms] [k] eth_type_trans > 1.01% ksoftirqd/0 [kernel.kallsyms] [k] netif_skb_features > 1.01% ksoftirqd/0 [nf_conntrack_rtcache] [k] nf_rtcache_in.part.2 AFAIK nf_conntrack_rtcache is not upstream, which is too bad. Felix Fietkau showed it improves performance a lot (slide 20): http://people.netfilter.org/hawk/presentations/NetDev1.1_2016/links.html > 0.91% ksoftirqd/0 [ath10k_pci] [k] ath10k_pci_hif_tx_sg > 0.91% ksoftirqd/0 [kernel.kallsyms] [k] ip_rcv > 0.91% ksoftirqd/0 [kernel.kallsyms] [k] memcmp > 0.91% ksoftirqd/0 [mac80211] [k] ieee80211_rx_napi > 0.81% ksoftirqd/0 [ath10k_core] [k] ath10k_dbg > 0.81% ksoftirqd/0 [ath10k_core] [k] ath10k_htt_tx > 0.81% ksoftirqd/0 [ath10k_pci] [k] ath10k_bus_pci_write32 > 0.81% ksoftirqd/0 [br_netfilter] [k] br_nf_pre_routing > 0.81% ksoftirqd/0 [ebtable_broute] [k] ebt_broute > 0.81% ksoftirqd/0 [kernel.kallsyms] [k] ag71xx_poll > 0.81% ksoftirqd/0 [kernel.kallsyms] [k] ip_output > 0.81% ksoftirqd/0 [kernel.kallsyms] [k] skb_get_hash_perturb > 0.81% ksoftirqd/0 [mac80211] [k] sta_info_hash_lookup > 0.81% ksoftirqd/0 [nf_conntrack_ipv4] [k] ipv4_helper > 0.81% kworker/0:1 [kernel.kallsyms] [k] __delay > 0.71% ksoftirqd/0 [kernel.kallsyms] [k] __bzero > 0.71% ksoftirqd/0 [kernel.kallsyms] [k] __rmemcpy > 0.71% ksoftirqd/0 [kernel.kallsyms] [k] _find_next_bit.part.0 > 0.71% ksoftirqd/0 [kernel.kallsyms] [k] ip_rcv_finish > 0.71% ksoftirqd/0 [mac80211] [k] ieee80211_lookup_ra_sta > 0.71% ksoftirqd/0 [nf_conntrack] [k] hash_conntrack_raw > 0.71% ksoftirqd/0 [nf_conntrack] [k] nf_conntrack_in > 0.71% ksoftirqd/0 [nf_conntrack] [k] tcp_error > 0.61% ksoftirqd/0 [br_netfilter] [k] br_nf_post_routing > 0.61% ksoftirqd/0 [cfg80211] [k] cfg80211_classify8021d > 0.61% ksoftirqd/0 [iptable_raw] [k] iptable_raw_hook > 0.61% ksoftirqd/0 [kernel.kallsyms] [k] br_handle_frame_finish > 0.61% ksoftirqd/0 [kernel.kallsyms] [k] netdev_pick_tx > 0.61% ksoftirqd/0 [kernel.kallsyms] [k] r4k_dma_cache_wback_inv > 0.61% ksoftirqd/0 [kernel.kallsyms] [k] skb_network_protocol > 0.61% ksoftirqd/0 [nf_conntrack_rtcache] [k] nf_ct_rtcache_find_usable > 0.61% ksoftirqd/0 [nf_nat_ipv4] [k] nf_nat_ipv4_fn > 0.51% ksoftirqd/0 [ath10k_pci] [k] ath10k_ce_send_nolock > 0.51% ksoftirqd/0 [kernel.kallsyms] [k] __skb_flow_get_ports > 0.51% ksoftirqd/0 [kernel.kallsyms] [k] atomic64_add Hmm, a 64bit add on a 32-bit platform? > 0.51% ksoftirqd/0 [kernel.kallsyms] [k] br_dev_queue_push_xmit > 0.51% ksoftirqd/0 [kernel.kallsyms] [k] br_fdb_update > 0.51% ksoftirqd/0 [kernel.kallsyms] [k] dst_release > 0.51% ksoftirqd/0 [kernel.kallsyms] [k] idr_get_empty_slot > 0.51% ksoftirqd/0 [kernel.kallsyms] [k] r4k_dma_cache_inv > 0.51% ksoftirqd/0 [nf_conntrack] [k] nf_ct_deliver_cached_events > 0.51% ksoftirqd/0 [nf_conntrack_ipv4] [k] ipv4_conntrack_in > 0.51% ksoftirqd/0 [nf_nat] [k] nf_nat_packet > 0.40% ksoftirqd/0 [ath10k_core] [k] ath10k_dbg_dump > 0.40% ksoftirqd/0 [ath10k_core] [k] ath10k_mac_op_wake_tx_queue > 0.40% ksoftirqd/0 [ath10k_pci] [k] ath10k_bus_pci_read32 > 0.40% ksoftirqd/0 [ath10k_pci] [k] ath10k_ce_completed_send_next_nolock > 0.40% ksoftirqd/0 [br_netfilter] [k] ip_sabotage_in > 0.40% ksoftirqd/0 [kernel.kallsyms] [k] __br_fdb_get > 0.40% ksoftirqd/0 [kernel.kallsyms] [k] __delay > 0.40% ksoftirqd/0 [kernel.kallsyms] [k] __skb_flow_dissect > 0.40% ksoftirqd/0 [kernel.kallsyms] [k] __wake_up_sync_key > 0.40% ksoftirqd/0 [kernel.kallsyms] [k] memcpy > 0.40% ksoftirqd/0 [kernel.kallsyms] [k] sch_direct_xmit > 0.40% ksoftirqd/0 [kernel.kallsyms] [k] skb_release_data > 0.40% ksoftirqd/0 [kernel.kallsyms] [k] validate_xmit_skb.isra.30.part.31 > 0.40% ksoftirqd/0 [mac80211] [k] fq_flow_classify.constprop.17 > 0.40% ksoftirqd/0 [mac80211] [k] ieee80211_tx_status > 0.40% ksoftirqd/0 [nf_conntrack] [k] __nf_ct_refresh_acct > 0.40% ksoftirqd/0 [nf_conntrack_ipv4] [k] ipv4_get_l4proto > 0.40% ksoftirqd/0 [nf_conntrack_rtcache] [k] nf_rtcache_forward > 0.40% ksoftirqd/0 [nf_conntrack_rtcache] [k] nf_rtcache_forward4 > 0.40% ksoftirqd/0 [nf_nat_ipv4] [k] nf_nat_ipv4_in > 0.40% ksoftirqd/0 [nf_nat_ipv4] [k] nf_nat_ipv4_out > 0.30% ksoftirqd/0 [ath10k_core] [k] __ath10k_htt_rx_ring_fill_n > 0.30% ksoftirqd/0 [ath10k_core] [k] ath10k_htt_rx_h_deliver > 0.30% ksoftirqd/0 [ath10k_core] [k] ath10k_htt_rx_h_mpdu > 0.30% ksoftirqd/0 [ath10k_core] [k] ath10k_mac_tx_push_txq > 0.30% ksoftirqd/0 [ath10k_pci] [k] ath10k_ce_completed_send_next > 0.30% ksoftirqd/0 [ath10k_pci] [k] ath10k_pci_write32 > 0.30% ksoftirqd/0 [br_netfilter] [k] br_nf_dev_xmit > 0.30% ksoftirqd/0 [br_netfilter] [k] nf_bridge_encap_header_len > 0.30% ksoftirqd/0 [cfg80211] [k] ieee80211_hdrlen > 0.30% ksoftirqd/0 [kernel.kallsyms] [k] __free_page_frag > 0.30% ksoftirqd/0 [kernel.kallsyms] [k] ag71xx_hard_start_xmit > 0.30% ksoftirqd/0 [kernel.kallsyms] [k] br_dev_xmit > 0.30% ksoftirqd/0 [kernel.kallsyms] [k] br_pass_frame_up > 0.30% ksoftirqd/0 [kernel.kallsyms] [k] build_skb > 0.30% ksoftirqd/0 [kernel.kallsyms] [k] dev_gro_receive > 0.30% ksoftirqd/0 [kernel.kallsyms] [k] idr_alloc > 0.30% ksoftirqd/0 [kernel.kallsyms] [k] inet_gro_receive > 0.30% ksoftirqd/0 [kernel.kallsyms] [k] ip_finish_output > 0.30% ksoftirqd/0 [kernel.kallsyms] [k] nf_hook_slow > 0.30% ksoftirqd/0 [kernel.kallsyms] [k] tcp_gro_receive > 0.30% ksoftirqd/0 [mac80211] [k] codel_dequeue_func > 0.30% ksoftirqd/0 [mac80211] [k] fq_flow_dequeue > 0.30% ksoftirqd/0 [mac80211] [k] ieee80211_drop_unencrypted.part.3 > 0.30% ksoftirqd/0 [mac80211] [k] ieee80211_get_bssid > 0.30% ksoftirqd/0 [nf_conntrack_ipv4] [k] ipv4_confirm > 0.30% ksoftirqd/0 [nf_defrag_ipv4] [k] ipv4_conntrack_defrag > 0.20% ksoftirqd/0 [ath10k_core] [k] ath10k_txrx_tx_unref > 0.20% ksoftirqd/0 [ath10k_pci] [k] ath10k_pci_htt_rx_cb > 0.20% ksoftirqd/0 [br_netfilter] [k] br_nf_local_in > 0.20% ksoftirqd/0 [ebtable_filter] [k] ebt_out_hook > 0.20% ksoftirqd/0 [ebtable_nat] [k] ebt_nat_in > 0.20% ksoftirqd/0 [ebtable_nat] [k] ebt_nat_out > 0.20% ksoftirqd/0 [iptable_mangle] [k] iptable_mangle_hook > 0.20% ksoftirqd/0 [kernel.kallsyms] [k] __dma_sync Interesting __dma_sync does not take more than 0.2%. > 0.20% ksoftirqd/0 [kernel.kallsyms] [k] __free_pages_ok > 0.20% ksoftirqd/0 [kernel.kallsyms] [k] __netif_receive_skb > 0.20% ksoftirqd/0 [kernel.kallsyms] [k] __slab_alloc.isra.13.constprop.17 > 0.20% ksoftirqd/0 [kernel.kallsyms] [k] __slab_free.isra.14 > 0.20% ksoftirqd/0 [kernel.kallsyms] [k] br_deliver > 0.20% ksoftirqd/0 [kernel.kallsyms] [k] finish_task_switch > 0.20% ksoftirqd/0 [kernel.kallsyms] [k] ipv4_dst_check > 0.20% ksoftirqd/0 [kernel.kallsyms] [k] is_skb_forwardable > 0.20% ksoftirqd/0 [kernel.kallsyms] [k] kmem_cache_free > 0.20% ksoftirqd/0 [kernel.kallsyms] [k] ktime_get > 0.20% ksoftirqd/0 [kernel.kallsyms] [k] net_rx_action > 0.20% ksoftirqd/0 [kernel.kallsyms] [k] skb_release_head_state > 0.20% ksoftirqd/0 [kernel.kallsyms] [k] vsnprintf > 0.20% ksoftirqd/0 [mac80211] [k] ieee80211_deliver_skb > 0.20% ksoftirqd/0 [mac80211] [k] ieee80211_select_queue > 0.20% ksoftirqd/0 [mac80211] [k] ieee80211_skb_resize > 0.20% ksoftirqd/0 [nf_conntrack_rtcache] [k] nf_rtcache_in4 > 0.10% hostapd [kernel.kallsyms] [k] __copy_user_common > 0.10% hostapd [kernel.kallsyms] [k] core_sys_select > 0.10% hostapd [kernel.kallsyms] [k] datagram_poll > 0.10% hostapd [kernel.kallsyms] [k] do_select > 0.10% hostapd [kernel.kallsyms] [k] finish_task_switch > 0.10% hostapd [kernel.kallsyms] [k] handle_sys > 0.10% hostapd [kernel.kallsyms] [k] hrtimer_try_to_cancel > 0.10% hostapd [kernel.kallsyms] [k] netlink_recvmsg > 0.10% hostapd [kernel.kallsyms] [k] poll_initwait > 0.10% hostapd [kernel.kallsyms] [k] schedule_hrtimeout_range_clock > 0.10% hostapd [kernel.kallsyms] [k] sock_poll > 0.10% hostapd [kernel.kallsyms] [k] timespec_add_safe > 0.10% hostapd libc.so [.] 0x0000ef74 > 0.10% hostapd libc.so [.] 0x000288e4 > 0.10% hostapd libc.so [.] 0x000722fc > 0.10% hostapd wpad [.] 0x0001ab97 > 0.10% hostapd wpad [.] 0x0001b1a1 > 0.10% hostapd wpad [.] 0x0001cbd9 > 0.10% hostapd wpad [.] 0x0001cd0d > 0.10% hostapd wpad [.] 0x0004c589 > 0.10% ksoftirqd/0 [ath10k_core] [k] ath10k_htc_rx_completion_handler > 0.10% ksoftirqd/0 [ath10k_core] [k] ath10k_htt_rx_amsdu_allowed.isra.4 > 0.10% ksoftirqd/0 [ath10k_core] [k] ath10k_htt_tx_inc_pending > 0.10% ksoftirqd/0 [ath10k_core] [k] ath10k_htt_tx_txq_update > 0.10% ksoftirqd/0 [ath10k_core] [k] ath10k_mac_tx_frm_has_freq > 0.10% ksoftirqd/0 [ath10k_core] [k] ath10k_mac_tx_h_get_txmode.isra.4 > 0.10% ksoftirqd/0 [ath10k_core] [k] ath10k_mac_tx_push_pending > 0.10% ksoftirqd/0 [ath10k_core] [k] ath10k_process_rx > 0.10% ksoftirqd/0 [ath10k_pci] [k] ath10k_ce_completed_recv_next_nolock > 0.10% ksoftirqd/0 [ath9k] [k] ath_rx_tasklet > 0.10% ksoftirqd/0 [ath9k_common] [k] ath9k_cmn_process_rssi > 0.10% ksoftirqd/0 [ath9k_hw] [k] ath9k_hw_numtxpending > 0.10% ksoftirqd/0 [cfg80211] [k] __ieee80211_data_to_8023 > 0.10% ksoftirqd/0 [cfg80211] [k] cfg80211_rx_mgmt > 0.10% ksoftirqd/0 [kernel.kallsyms] [k] __build_skb > 0.10% ksoftirqd/0 [kernel.kallsyms] [k] __do_softirq > 0.10% ksoftirqd/0 [kernel.kallsyms] [k] ag71xx_tx_packets > 0.10% ksoftirqd/0 [kernel.kallsyms] [k] c0_hpt_read > 0.10% ksoftirqd/0 [kernel.kallsyms] [k] consume_skb > 0.10% ksoftirqd/0 [kernel.kallsyms] [k] idr_mark_full > 0.10% ksoftirqd/0 [kernel.kallsyms] [k] idr_remove > 0.10% ksoftirqd/0 [kernel.kallsyms] [k] ip_forward_finish > 0.10% ksoftirqd/0 [kernel.kallsyms] [k] kmem_cache_alloc > 0.10% ksoftirqd/0 [kernel.kallsyms] [k] memmove > 0.10% ksoftirqd/0 [kernel.kallsyms] [k] mod_timer > 0.10% ksoftirqd/0 [kernel.kallsyms] [k] number.isra.12 > 0.10% ksoftirqd/0 [kernel.kallsyms] [k] passthru_features_check > 0.10% ksoftirqd/0 [kernel.kallsyms] [k] skb_copy_bits > 0.10% ksoftirqd/0 [kernel.kallsyms] [k] skb_pull > 0.10% ksoftirqd/0 [kernel.kallsyms] [k] skb_push > 0.10% ksoftirqd/0 [mac80211] [k] __ieee80211_beacon_add_tim > 0.10% ksoftirqd/0 [mac80211] [k] codel_should_drop.isra.4.constprop.20 > 0.10% ksoftirqd/0 [mac80211] [k] ieee80211_report_used_skb > 0.10% ksoftirqd/0 [mac80211] [k] ieee80211_subif_start_xmit > 0.10% ksoftirqd/0 [mac80211] [k] remove_monitor_info > 0.10% ksoftirqd/0 [nf_conntrack] [k] nf_ct_get_tuple > 0.10% ksoftirqd/0 [nf_conntrack] [k] tcp_get_timeouts > 0.10% ksoftirqd/0 [nf_conntrack_ipv4] [k] ipv4_pkt_to_tuple > 0.10% ksoftirqd/0 [nf_nat] [k] nf_ct_nat_ext_add No single function "jumps-out", but the amount of function calls for sending packets are astonishing, I'm sure that the icache is also hurting. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer