[Make-wifi-fast] a bit of profiling on the archer

Jesper Dangaard Brouer brouer at redhat.com
Mon Nov 21 11:49:12 EST 2016


On Sat, 19 Nov 2016 09:33:51 -0800
Dave Taht <dave.taht at gmail.com> wrote:

> On Thu, Nov 17, 2016 at 11:55 PM, Jesper Dangaard Brouer
> <brouer at redhat.com> wrote:
> >
> > On Thu, 17 Nov 2016 20:14:49 -0800 Dave Taht <dave.taht at gmail.com> wrote:
> >  
> >> I have not been profiling much on lower end platforms (it's hard, you
> >> can crash a box pretty easily with the wrong options or sample rates).  
> >
> > I'm happy to hear that perf does work on this lower end HW, although
> > the disclaimer of sample rates.
> >
> > Does anyone know if hardware based PMU (Performance Monitor Units)
> > exists for these kind of devices?
> >  
> >> While watching the ath10k peak at 150-200mbits, at 99% of cpu in
> >> softirq, I spent a bit of time profiling various counters and
> >> statistics.
> >>
> >> for this one (while downloading 12 flows at the same time via flent)
> >>
> >> perf record -F 99 -e cpu-clock -ag -- sleep 10
> >> perf report  
> >
> > Below perf report is not well suite for email, could you instead
> > provide output from command below:
> >
> > perf report --no-children --stdio --call-graph none  
> 
> Thanks! That is way more readable output. Appended at the end of this post.
> 
> >  
> >>
> >>   [[31m  67.81%[[m     0.00%  ksoftirqd/0  [kernel.kallsyms]       [k]
> >> run_ksoftirqd
> >>             |
> >>             ---run_ksoftirqd
> >>                |
> >>                |[[31m--67.61%-- [[m __do_softirq
> >>                |          |
> >>                |          |[[31m--66.80%-- [[m net_rx_action
> >>                |          |          |
> >>                |          |          |[[31m--41.07%-- [[m ag71xx_poll
> >>
> >> ...  
> >
> > (Looks like you managed to copy-paste the terminal escape codes for
> > colors)  
> 
> Not sure what terminal emulation perf is expecting. TERM is set to XTERM.
> 
> >>
> >> It appears we're spending 66% of the time in the *ethernet* portion of
> >> the path.  
> >
> > Be careful you don't fool yourself. In your output you have the
> > "children" mode on, so everything being called "under" net_rx_action is
> > summed up.  It could be it goes all the way through to the wifi TX
> > parts and that is part of the sum.  Even the memory allocations gets
> > summed into this 66% number.
> >
> >  
> >> I'm going to stop worrying so much about the performance of the new
> >> wifi algorithms.  
> 
> for at least 24 hours
> 
> # To display the perf.data header info, please use
> --header/--header-only options.
> #
> #
> # Total Lost Samples: 0
> #
> # Samples: 990  of event 'cpu-clock'
> # Event count (approx.): 9999999900
> #
> # Overhead  Command      Shared Object           Symbol
> # ........  ...........  ......................
> .........................................
> #
>      4.34%  ksoftirqd/0  [ebtables]              [k] ebt_do_table

Do you actually need ebtables? you should check....

>      2.93%  ksoftirqd/0  [kernel.kallsyms]       [k] __netif_receive_skb_core
>      2.83%  ksoftirqd/0  [ip_tables]             [k] ipt_do_table

One should be careful just to blame "ebt_do_table" and "ipt_do_table",
because this can easily just be a symptom of a cache-miss on packet
data.  On these low-end CPUs with small caches, one can even experience
cache-misses happening on the same packet different places in the
netstack.

>      2.53%  ksoftirqd/0  [kernel.kallsyms]       [k] nf_iterate

Upstream made changes in this area.

 26dfab721629 ("netfilter: merge nf_iterate() into nf_hook_slow()")
 https://git.kernel.org/davem/net-next/c/26dfab721629

What kernel release is this?

>      2.42%  ksoftirqd/0  [mac80211]              [k] ieee80211_tx_dequeue
>      2.02%  ksoftirqd/0  [kernel.kallsyms]       [k] __local_bh_enable_ip
>      2.02%  ksoftirqd/0  [mac80211]              [k] __ieee80211_subif_start_xmit
>      1.82%  ksoftirqd/0  [kernel.kallsyms]       [k] __dev_queue_xmit
>      1.72%  ksoftirqd/0  [nf_conntrack]          [k] tcp_packet

Notice this "tcp_packet" call happens in "nf_conntrack"

>      1.52%  ksoftirqd/0  [mac80211]              [k] ieee80211_prepare_and_rx_handle
>      1.52%  ksoftirqd/0  [mac80211]              [k] ieee80211_rx_handlers
>      1.41%  ksoftirqd/0  [kernel.kallsyms]       [k] br_handle_frame
>      1.41%  ksoftirqd/0  [kernel.kallsyms]       [k] ip_finish_output2
>      1.41%  ksoftirqd/0  [kernel.kallsyms]       [k] ip_forward
>      1.41%  ksoftirqd/0  [nf_conntrack]          [k] __nf_conntrack_find_get

If you sum things up, then conntrack seems to take a lot of time.
I find that the FlameGraphs are the best way to get such an overview.


>      1.31%  ksoftirqd/0  [kernel.kallsyms]       [k] format_decode
>      1.31%  ksoftirqd/0  [mac80211]              [k] ieee80211_xmit_fast_finish
>      1.21%  ksoftirqd/0  [mac80211]              [k] ieee80211_tx_h_select_key
>      1.11%  ksoftirqd/0  [kernel.kallsyms]       [k] __copy_user_common
>      1.11%  ksoftirqd/0  [mac80211]              [k] ieee80211_queue_skb
>      1.01%  ksoftirqd/0  [ath10k_core]           [k] ath10k_htt_txrx_compl_task
>      1.01%  ksoftirqd/0  [ath10k_core]           [k] ath10k_mac_tx.isra.35
>      1.01%  ksoftirqd/0  [kernel.kallsyms]       [k] dev_hard_start_xmit
>      1.01%  ksoftirqd/0  [kernel.kallsyms]       [k] eth_type_trans
>      1.01%  ksoftirqd/0  [kernel.kallsyms]       [k] netif_skb_features
>      1.01%  ksoftirqd/0  [nf_conntrack_rtcache]  [k] nf_rtcache_in.part.2

AFAIK nf_conntrack_rtcache is not upstream, which is too bad.

Felix Fietkau showed it improves performance a lot (slide 20):
 http://people.netfilter.org/hawk/presentations/NetDev1.1_2016/links.html

>      0.91%  ksoftirqd/0  [ath10k_pci]            [k] ath10k_pci_hif_tx_sg
>      0.91%  ksoftirqd/0  [kernel.kallsyms]       [k] ip_rcv
>      0.91%  ksoftirqd/0  [kernel.kallsyms]       [k] memcmp
>      0.91%  ksoftirqd/0  [mac80211]              [k] ieee80211_rx_napi
>      0.81%  ksoftirqd/0  [ath10k_core]           [k] ath10k_dbg
>      0.81%  ksoftirqd/0  [ath10k_core]           [k] ath10k_htt_tx
>      0.81%  ksoftirqd/0  [ath10k_pci]            [k] ath10k_bus_pci_write32
>      0.81%  ksoftirqd/0  [br_netfilter]          [k] br_nf_pre_routing
>      0.81%  ksoftirqd/0  [ebtable_broute]        [k] ebt_broute
>      0.81%  ksoftirqd/0  [kernel.kallsyms]       [k] ag71xx_poll
>      0.81%  ksoftirqd/0  [kernel.kallsyms]       [k] ip_output
>      0.81%  ksoftirqd/0  [kernel.kallsyms]       [k] skb_get_hash_perturb
>      0.81%  ksoftirqd/0  [mac80211]              [k] sta_info_hash_lookup
>      0.81%  ksoftirqd/0  [nf_conntrack_ipv4]     [k] ipv4_helper
>      0.81%  kworker/0:1  [kernel.kallsyms]       [k] __delay
>      0.71%  ksoftirqd/0  [kernel.kallsyms]       [k] __bzero
>      0.71%  ksoftirqd/0  [kernel.kallsyms]       [k] __rmemcpy
>      0.71%  ksoftirqd/0  [kernel.kallsyms]       [k] _find_next_bit.part.0
>      0.71%  ksoftirqd/0  [kernel.kallsyms]       [k] ip_rcv_finish
>      0.71%  ksoftirqd/0  [mac80211]              [k] ieee80211_lookup_ra_sta
>      0.71%  ksoftirqd/0  [nf_conntrack]          [k] hash_conntrack_raw
>      0.71%  ksoftirqd/0  [nf_conntrack]          [k] nf_conntrack_in
>      0.71%  ksoftirqd/0  [nf_conntrack]          [k] tcp_error
>      0.61%  ksoftirqd/0  [br_netfilter]          [k] br_nf_post_routing
>      0.61%  ksoftirqd/0  [cfg80211]              [k] cfg80211_classify8021d
>      0.61%  ksoftirqd/0  [iptable_raw]           [k] iptable_raw_hook
>      0.61%  ksoftirqd/0  [kernel.kallsyms]       [k] br_handle_frame_finish
>      0.61%  ksoftirqd/0  [kernel.kallsyms]       [k] netdev_pick_tx
>      0.61%  ksoftirqd/0  [kernel.kallsyms]       [k] r4k_dma_cache_wback_inv
>      0.61%  ksoftirqd/0  [kernel.kallsyms]       [k] skb_network_protocol
>      0.61%  ksoftirqd/0  [nf_conntrack_rtcache]  [k] nf_ct_rtcache_find_usable
>      0.61%  ksoftirqd/0  [nf_nat_ipv4]           [k] nf_nat_ipv4_fn
>      0.51%  ksoftirqd/0  [ath10k_pci]            [k] ath10k_ce_send_nolock
>      0.51%  ksoftirqd/0  [kernel.kallsyms]       [k] __skb_flow_get_ports
>      0.51%  ksoftirqd/0  [kernel.kallsyms]       [k] atomic64_add

Hmm, a 64bit add on a 32-bit platform?

>      0.51%  ksoftirqd/0  [kernel.kallsyms]       [k] br_dev_queue_push_xmit
>      0.51%  ksoftirqd/0  [kernel.kallsyms]       [k] br_fdb_update
>      0.51%  ksoftirqd/0  [kernel.kallsyms]       [k] dst_release
>      0.51%  ksoftirqd/0  [kernel.kallsyms]       [k] idr_get_empty_slot
>      0.51%  ksoftirqd/0  [kernel.kallsyms]       [k] r4k_dma_cache_inv
>      0.51%  ksoftirqd/0  [nf_conntrack]          [k] nf_ct_deliver_cached_events
>      0.51%  ksoftirqd/0  [nf_conntrack_ipv4]     [k] ipv4_conntrack_in
>      0.51%  ksoftirqd/0  [nf_nat]                [k] nf_nat_packet
>      0.40%  ksoftirqd/0  [ath10k_core]           [k] ath10k_dbg_dump
>      0.40%  ksoftirqd/0  [ath10k_core]           [k] ath10k_mac_op_wake_tx_queue
>      0.40%  ksoftirqd/0  [ath10k_pci]            [k] ath10k_bus_pci_read32
>      0.40%  ksoftirqd/0  [ath10k_pci]            [k] ath10k_ce_completed_send_next_nolock
>      0.40%  ksoftirqd/0  [br_netfilter]          [k] ip_sabotage_in
>      0.40%  ksoftirqd/0  [kernel.kallsyms]       [k] __br_fdb_get
>      0.40%  ksoftirqd/0  [kernel.kallsyms]       [k] __delay
>      0.40%  ksoftirqd/0  [kernel.kallsyms]       [k] __skb_flow_dissect
>      0.40%  ksoftirqd/0  [kernel.kallsyms]       [k] __wake_up_sync_key
>      0.40%  ksoftirqd/0  [kernel.kallsyms]       [k] memcpy
>      0.40%  ksoftirqd/0  [kernel.kallsyms]       [k] sch_direct_xmit
>      0.40%  ksoftirqd/0  [kernel.kallsyms]       [k] skb_release_data
>      0.40%  ksoftirqd/0  [kernel.kallsyms]       [k] validate_xmit_skb.isra.30.part.31
>      0.40%  ksoftirqd/0  [mac80211]              [k] fq_flow_classify.constprop.17
>      0.40%  ksoftirqd/0  [mac80211]              [k] ieee80211_tx_status
>      0.40%  ksoftirqd/0  [nf_conntrack]          [k] __nf_ct_refresh_acct
>      0.40%  ksoftirqd/0  [nf_conntrack_ipv4]     [k] ipv4_get_l4proto
>      0.40%  ksoftirqd/0  [nf_conntrack_rtcache]  [k] nf_rtcache_forward
>      0.40%  ksoftirqd/0  [nf_conntrack_rtcache]  [k] nf_rtcache_forward4
>      0.40%  ksoftirqd/0  [nf_nat_ipv4]           [k] nf_nat_ipv4_in
>      0.40%  ksoftirqd/0  [nf_nat_ipv4]           [k] nf_nat_ipv4_out
>      0.30%  ksoftirqd/0  [ath10k_core]           [k] __ath10k_htt_rx_ring_fill_n
>      0.30%  ksoftirqd/0  [ath10k_core]           [k] ath10k_htt_rx_h_deliver
>      0.30%  ksoftirqd/0  [ath10k_core]           [k] ath10k_htt_rx_h_mpdu
>      0.30%  ksoftirqd/0  [ath10k_core]           [k] ath10k_mac_tx_push_txq
>      0.30%  ksoftirqd/0  [ath10k_pci]            [k] ath10k_ce_completed_send_next
>      0.30%  ksoftirqd/0  [ath10k_pci]            [k] ath10k_pci_write32
>      0.30%  ksoftirqd/0  [br_netfilter]          [k] br_nf_dev_xmit
>      0.30%  ksoftirqd/0  [br_netfilter]          [k] nf_bridge_encap_header_len
>      0.30%  ksoftirqd/0  [cfg80211]              [k] ieee80211_hdrlen
>      0.30%  ksoftirqd/0  [kernel.kallsyms]       [k] __free_page_frag
>      0.30%  ksoftirqd/0  [kernel.kallsyms]       [k] ag71xx_hard_start_xmit
>      0.30%  ksoftirqd/0  [kernel.kallsyms]       [k] br_dev_xmit
>      0.30%  ksoftirqd/0  [kernel.kallsyms]       [k] br_pass_frame_up
>      0.30%  ksoftirqd/0  [kernel.kallsyms]       [k] build_skb
>      0.30%  ksoftirqd/0  [kernel.kallsyms]       [k] dev_gro_receive
>      0.30%  ksoftirqd/0  [kernel.kallsyms]       [k] idr_alloc
>      0.30%  ksoftirqd/0  [kernel.kallsyms]       [k] inet_gro_receive
>      0.30%  ksoftirqd/0  [kernel.kallsyms]       [k] ip_finish_output
>      0.30%  ksoftirqd/0  [kernel.kallsyms]       [k] nf_hook_slow
>      0.30%  ksoftirqd/0  [kernel.kallsyms]       [k] tcp_gro_receive
>      0.30%  ksoftirqd/0  [mac80211]              [k] codel_dequeue_func
>      0.30%  ksoftirqd/0  [mac80211]              [k] fq_flow_dequeue
>      0.30%  ksoftirqd/0  [mac80211]              [k] ieee80211_drop_unencrypted.part.3
>      0.30%  ksoftirqd/0  [mac80211]              [k] ieee80211_get_bssid
>      0.30%  ksoftirqd/0  [nf_conntrack_ipv4]     [k] ipv4_confirm
>      0.30%  ksoftirqd/0  [nf_defrag_ipv4]        [k] ipv4_conntrack_defrag
>      0.20%  ksoftirqd/0  [ath10k_core]           [k] ath10k_txrx_tx_unref
>      0.20%  ksoftirqd/0  [ath10k_pci]            [k] ath10k_pci_htt_rx_cb
>      0.20%  ksoftirqd/0  [br_netfilter]          [k] br_nf_local_in
>      0.20%  ksoftirqd/0  [ebtable_filter]        [k] ebt_out_hook
>      0.20%  ksoftirqd/0  [ebtable_nat]           [k] ebt_nat_in
>      0.20%  ksoftirqd/0  [ebtable_nat]           [k] ebt_nat_out
>      0.20%  ksoftirqd/0  [iptable_mangle]        [k] iptable_mangle_hook
>      0.20%  ksoftirqd/0  [kernel.kallsyms]       [k] __dma_sync

Interesting __dma_sync does not take more than 0.2%.

>      0.20%  ksoftirqd/0  [kernel.kallsyms]       [k] __free_pages_ok
>      0.20%  ksoftirqd/0  [kernel.kallsyms]       [k] __netif_receive_skb
>      0.20%  ksoftirqd/0  [kernel.kallsyms]       [k] __slab_alloc.isra.13.constprop.17
>      0.20%  ksoftirqd/0  [kernel.kallsyms]       [k] __slab_free.isra.14
>      0.20%  ksoftirqd/0  [kernel.kallsyms]       [k] br_deliver
>      0.20%  ksoftirqd/0  [kernel.kallsyms]       [k] finish_task_switch
>      0.20%  ksoftirqd/0  [kernel.kallsyms]       [k] ipv4_dst_check
>      0.20%  ksoftirqd/0  [kernel.kallsyms]       [k] is_skb_forwardable
>      0.20%  ksoftirqd/0  [kernel.kallsyms]       [k] kmem_cache_free
>      0.20%  ksoftirqd/0  [kernel.kallsyms]       [k] ktime_get
>      0.20%  ksoftirqd/0  [kernel.kallsyms]       [k] net_rx_action
>      0.20%  ksoftirqd/0  [kernel.kallsyms]       [k] skb_release_head_state
>      0.20%  ksoftirqd/0  [kernel.kallsyms]       [k] vsnprintf
>      0.20%  ksoftirqd/0  [mac80211]              [k] ieee80211_deliver_skb
>      0.20%  ksoftirqd/0  [mac80211]              [k] ieee80211_select_queue
>      0.20%  ksoftirqd/0  [mac80211]              [k] ieee80211_skb_resize
>      0.20%  ksoftirqd/0  [nf_conntrack_rtcache]  [k] nf_rtcache_in4
>      0.10%  hostapd      [kernel.kallsyms]       [k] __copy_user_common
>      0.10%  hostapd      [kernel.kallsyms]       [k] core_sys_select
>      0.10%  hostapd      [kernel.kallsyms]       [k] datagram_poll
>      0.10%  hostapd      [kernel.kallsyms]       [k] do_select
>      0.10%  hostapd      [kernel.kallsyms]       [k] finish_task_switch
>      0.10%  hostapd      [kernel.kallsyms]       [k] handle_sys
>      0.10%  hostapd      [kernel.kallsyms]       [k] hrtimer_try_to_cancel
>      0.10%  hostapd      [kernel.kallsyms]       [k] netlink_recvmsg
>      0.10%  hostapd      [kernel.kallsyms]       [k] poll_initwait
>      0.10%  hostapd      [kernel.kallsyms]       [k] schedule_hrtimeout_range_clock
>      0.10%  hostapd      [kernel.kallsyms]       [k] sock_poll
>      0.10%  hostapd      [kernel.kallsyms]       [k] timespec_add_safe
>      0.10%  hostapd      libc.so                 [.] 0x0000ef74
>      0.10%  hostapd      libc.so                 [.] 0x000288e4
>      0.10%  hostapd      libc.so                 [.] 0x000722fc
>      0.10%  hostapd      wpad                    [.] 0x0001ab97
>      0.10%  hostapd      wpad                    [.] 0x0001b1a1
>      0.10%  hostapd      wpad                    [.] 0x0001cbd9
>      0.10%  hostapd      wpad                    [.] 0x0001cd0d
>      0.10%  hostapd      wpad                    [.] 0x0004c589
>      0.10%  ksoftirqd/0  [ath10k_core]           [k] ath10k_htc_rx_completion_handler
>      0.10%  ksoftirqd/0  [ath10k_core]           [k] ath10k_htt_rx_amsdu_allowed.isra.4
>      0.10%  ksoftirqd/0  [ath10k_core]           [k] ath10k_htt_tx_inc_pending
>      0.10%  ksoftirqd/0  [ath10k_core]           [k] ath10k_htt_tx_txq_update
>      0.10%  ksoftirqd/0  [ath10k_core]           [k] ath10k_mac_tx_frm_has_freq
>      0.10%  ksoftirqd/0  [ath10k_core]           [k] ath10k_mac_tx_h_get_txmode.isra.4
>      0.10%  ksoftirqd/0  [ath10k_core]           [k] ath10k_mac_tx_push_pending
>      0.10%  ksoftirqd/0  [ath10k_core]           [k] ath10k_process_rx
>      0.10%  ksoftirqd/0  [ath10k_pci]            [k] ath10k_ce_completed_recv_next_nolock
>      0.10%  ksoftirqd/0  [ath9k]                 [k] ath_rx_tasklet
>      0.10%  ksoftirqd/0  [ath9k_common]          [k] ath9k_cmn_process_rssi
>      0.10%  ksoftirqd/0  [ath9k_hw]              [k] ath9k_hw_numtxpending
>      0.10%  ksoftirqd/0  [cfg80211]              [k] __ieee80211_data_to_8023
>      0.10%  ksoftirqd/0  [cfg80211]              [k] cfg80211_rx_mgmt
>      0.10%  ksoftirqd/0  [kernel.kallsyms]       [k] __build_skb
>      0.10%  ksoftirqd/0  [kernel.kallsyms]       [k] __do_softirq
>      0.10%  ksoftirqd/0  [kernel.kallsyms]       [k] ag71xx_tx_packets
>      0.10%  ksoftirqd/0  [kernel.kallsyms]       [k] c0_hpt_read
>      0.10%  ksoftirqd/0  [kernel.kallsyms]       [k] consume_skb
>      0.10%  ksoftirqd/0  [kernel.kallsyms]       [k] idr_mark_full
>      0.10%  ksoftirqd/0  [kernel.kallsyms]       [k] idr_remove
>      0.10%  ksoftirqd/0  [kernel.kallsyms]       [k] ip_forward_finish
>      0.10%  ksoftirqd/0  [kernel.kallsyms]       [k] kmem_cache_alloc
>      0.10%  ksoftirqd/0  [kernel.kallsyms]       [k] memmove
>      0.10%  ksoftirqd/0  [kernel.kallsyms]       [k] mod_timer
>      0.10%  ksoftirqd/0  [kernel.kallsyms]       [k] number.isra.12
>      0.10%  ksoftirqd/0  [kernel.kallsyms]       [k] passthru_features_check
>      0.10%  ksoftirqd/0  [kernel.kallsyms]       [k] skb_copy_bits
>      0.10%  ksoftirqd/0  [kernel.kallsyms]       [k] skb_pull
>      0.10%  ksoftirqd/0  [kernel.kallsyms]       [k] skb_push
>      0.10%  ksoftirqd/0  [mac80211]              [k] __ieee80211_beacon_add_tim
>      0.10%  ksoftirqd/0  [mac80211]              [k] codel_should_drop.isra.4.constprop.20
>      0.10%  ksoftirqd/0  [mac80211]              [k] ieee80211_report_used_skb
>      0.10%  ksoftirqd/0  [mac80211]              [k] ieee80211_subif_start_xmit
>      0.10%  ksoftirqd/0  [mac80211]              [k] remove_monitor_info
>      0.10%  ksoftirqd/0  [nf_conntrack]          [k] nf_ct_get_tuple
>      0.10%  ksoftirqd/0  [nf_conntrack]          [k] tcp_get_timeouts
>      0.10%  ksoftirqd/0  [nf_conntrack_ipv4]     [k] ipv4_pkt_to_tuple
>      0.10%  ksoftirqd/0  [nf_nat]                [k] nf_ct_nat_ext_add

No single function "jumps-out", but the amount of function calls for
sending packets are astonishing, I'm sure that the icache is also
hurting.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer


More information about the Make-wifi-fast mailing list