[Make-wifi-fast] a bit of profiling on the archer
Jesper Dangaard Brouer
brouer at redhat.com
Mon Nov 21 11:49:12 EST 2016
On Sat, 19 Nov 2016 09:33:51 -0800
Dave Taht <dave.taht at gmail.com> wrote:
> On Thu, Nov 17, 2016 at 11:55 PM, Jesper Dangaard Brouer
> <brouer at redhat.com> wrote:
> >
> > On Thu, 17 Nov 2016 20:14:49 -0800 Dave Taht <dave.taht at gmail.com> wrote:
> >
> >> I have not been profiling much on lower end platforms (it's hard, you
> >> can crash a box pretty easily with the wrong options or sample rates).
> >
> > I'm happy to hear that perf does work on this lower end HW, although
> > the disclaimer of sample rates.
> >
> > Does anyone know if hardware based PMU (Performance Monitor Units)
> > exists for these kind of devices?
> >
> >> While watching the ath10k peak at 150-200mbits, at 99% of cpu in
> >> softirq, I spent a bit of time profiling various counters and
> >> statistics.
> >>
> >> for this one (while downloading 12 flows at the same time via flent)
> >>
> >> perf record -F 99 -e cpu-clock -ag -- sleep 10
> >> perf report
> >
> > Below perf report is not well suite for email, could you instead
> > provide output from command below:
> >
> > perf report --no-children --stdio --call-graph none
>
> Thanks! That is way more readable output. Appended at the end of this post.
>
> >
> >>
> >> [[31m 67.81%[[m 0.00% ksoftirqd/0 [kernel.kallsyms] [k]
> >> run_ksoftirqd
> >> |
> >> ---run_ksoftirqd
> >> |
> >> |[[31m--67.61%-- [[m __do_softirq
> >> | |
> >> | |[[31m--66.80%-- [[m net_rx_action
> >> | | |
> >> | | |[[31m--41.07%-- [[m ag71xx_poll
> >>
> >> ...
> >
> > (Looks like you managed to copy-paste the terminal escape codes for
> > colors)
>
> Not sure what terminal emulation perf is expecting. TERM is set to XTERM.
>
> >>
> >> It appears we're spending 66% of the time in the *ethernet* portion of
> >> the path.
> >
> > Be careful you don't fool yourself. In your output you have the
> > "children" mode on, so everything being called "under" net_rx_action is
> > summed up. It could be it goes all the way through to the wifi TX
> > parts and that is part of the sum. Even the memory allocations gets
> > summed into this 66% number.
> >
> >
> >> I'm going to stop worrying so much about the performance of the new
> >> wifi algorithms.
>
> for at least 24 hours
>
> # To display the perf.data header info, please use
> --header/--header-only options.
> #
> #
> # Total Lost Samples: 0
> #
> # Samples: 990 of event 'cpu-clock'
> # Event count (approx.): 9999999900
> #
> # Overhead Command Shared Object Symbol
> # ........ ........... ......................
> .........................................
> #
> 4.34% ksoftirqd/0 [ebtables] [k] ebt_do_table
Do you actually need ebtables? you should check....
> 2.93% ksoftirqd/0 [kernel.kallsyms] [k] __netif_receive_skb_core
> 2.83% ksoftirqd/0 [ip_tables] [k] ipt_do_table
One should be careful just to blame "ebt_do_table" and "ipt_do_table",
because this can easily just be a symptom of a cache-miss on packet
data. On these low-end CPUs with small caches, one can even experience
cache-misses happening on the same packet different places in the
netstack.
> 2.53% ksoftirqd/0 [kernel.kallsyms] [k] nf_iterate
Upstream made changes in this area.
26dfab721629 ("netfilter: merge nf_iterate() into nf_hook_slow()")
https://git.kernel.org/davem/net-next/c/26dfab721629
What kernel release is this?
> 2.42% ksoftirqd/0 [mac80211] [k] ieee80211_tx_dequeue
> 2.02% ksoftirqd/0 [kernel.kallsyms] [k] __local_bh_enable_ip
> 2.02% ksoftirqd/0 [mac80211] [k] __ieee80211_subif_start_xmit
> 1.82% ksoftirqd/0 [kernel.kallsyms] [k] __dev_queue_xmit
> 1.72% ksoftirqd/0 [nf_conntrack] [k] tcp_packet
Notice this "tcp_packet" call happens in "nf_conntrack"
> 1.52% ksoftirqd/0 [mac80211] [k] ieee80211_prepare_and_rx_handle
> 1.52% ksoftirqd/0 [mac80211] [k] ieee80211_rx_handlers
> 1.41% ksoftirqd/0 [kernel.kallsyms] [k] br_handle_frame
> 1.41% ksoftirqd/0 [kernel.kallsyms] [k] ip_finish_output2
> 1.41% ksoftirqd/0 [kernel.kallsyms] [k] ip_forward
> 1.41% ksoftirqd/0 [nf_conntrack] [k] __nf_conntrack_find_get
If you sum things up, then conntrack seems to take a lot of time.
I find that the FlameGraphs are the best way to get such an overview.
> 1.31% ksoftirqd/0 [kernel.kallsyms] [k] format_decode
> 1.31% ksoftirqd/0 [mac80211] [k] ieee80211_xmit_fast_finish
> 1.21% ksoftirqd/0 [mac80211] [k] ieee80211_tx_h_select_key
> 1.11% ksoftirqd/0 [kernel.kallsyms] [k] __copy_user_common
> 1.11% ksoftirqd/0 [mac80211] [k] ieee80211_queue_skb
> 1.01% ksoftirqd/0 [ath10k_core] [k] ath10k_htt_txrx_compl_task
> 1.01% ksoftirqd/0 [ath10k_core] [k] ath10k_mac_tx.isra.35
> 1.01% ksoftirqd/0 [kernel.kallsyms] [k] dev_hard_start_xmit
> 1.01% ksoftirqd/0 [kernel.kallsyms] [k] eth_type_trans
> 1.01% ksoftirqd/0 [kernel.kallsyms] [k] netif_skb_features
> 1.01% ksoftirqd/0 [nf_conntrack_rtcache] [k] nf_rtcache_in.part.2
AFAIK nf_conntrack_rtcache is not upstream, which is too bad.
Felix Fietkau showed it improves performance a lot (slide 20):
http://people.netfilter.org/hawk/presentations/NetDev1.1_2016/links.html
> 0.91% ksoftirqd/0 [ath10k_pci] [k] ath10k_pci_hif_tx_sg
> 0.91% ksoftirqd/0 [kernel.kallsyms] [k] ip_rcv
> 0.91% ksoftirqd/0 [kernel.kallsyms] [k] memcmp
> 0.91% ksoftirqd/0 [mac80211] [k] ieee80211_rx_napi
> 0.81% ksoftirqd/0 [ath10k_core] [k] ath10k_dbg
> 0.81% ksoftirqd/0 [ath10k_core] [k] ath10k_htt_tx
> 0.81% ksoftirqd/0 [ath10k_pci] [k] ath10k_bus_pci_write32
> 0.81% ksoftirqd/0 [br_netfilter] [k] br_nf_pre_routing
> 0.81% ksoftirqd/0 [ebtable_broute] [k] ebt_broute
> 0.81% ksoftirqd/0 [kernel.kallsyms] [k] ag71xx_poll
> 0.81% ksoftirqd/0 [kernel.kallsyms] [k] ip_output
> 0.81% ksoftirqd/0 [kernel.kallsyms] [k] skb_get_hash_perturb
> 0.81% ksoftirqd/0 [mac80211] [k] sta_info_hash_lookup
> 0.81% ksoftirqd/0 [nf_conntrack_ipv4] [k] ipv4_helper
> 0.81% kworker/0:1 [kernel.kallsyms] [k] __delay
> 0.71% ksoftirqd/0 [kernel.kallsyms] [k] __bzero
> 0.71% ksoftirqd/0 [kernel.kallsyms] [k] __rmemcpy
> 0.71% ksoftirqd/0 [kernel.kallsyms] [k] _find_next_bit.part.0
> 0.71% ksoftirqd/0 [kernel.kallsyms] [k] ip_rcv_finish
> 0.71% ksoftirqd/0 [mac80211] [k] ieee80211_lookup_ra_sta
> 0.71% ksoftirqd/0 [nf_conntrack] [k] hash_conntrack_raw
> 0.71% ksoftirqd/0 [nf_conntrack] [k] nf_conntrack_in
> 0.71% ksoftirqd/0 [nf_conntrack] [k] tcp_error
> 0.61% ksoftirqd/0 [br_netfilter] [k] br_nf_post_routing
> 0.61% ksoftirqd/0 [cfg80211] [k] cfg80211_classify8021d
> 0.61% ksoftirqd/0 [iptable_raw] [k] iptable_raw_hook
> 0.61% ksoftirqd/0 [kernel.kallsyms] [k] br_handle_frame_finish
> 0.61% ksoftirqd/0 [kernel.kallsyms] [k] netdev_pick_tx
> 0.61% ksoftirqd/0 [kernel.kallsyms] [k] r4k_dma_cache_wback_inv
> 0.61% ksoftirqd/0 [kernel.kallsyms] [k] skb_network_protocol
> 0.61% ksoftirqd/0 [nf_conntrack_rtcache] [k] nf_ct_rtcache_find_usable
> 0.61% ksoftirqd/0 [nf_nat_ipv4] [k] nf_nat_ipv4_fn
> 0.51% ksoftirqd/0 [ath10k_pci] [k] ath10k_ce_send_nolock
> 0.51% ksoftirqd/0 [kernel.kallsyms] [k] __skb_flow_get_ports
> 0.51% ksoftirqd/0 [kernel.kallsyms] [k] atomic64_add
Hmm, a 64bit add on a 32-bit platform?
> 0.51% ksoftirqd/0 [kernel.kallsyms] [k] br_dev_queue_push_xmit
> 0.51% ksoftirqd/0 [kernel.kallsyms] [k] br_fdb_update
> 0.51% ksoftirqd/0 [kernel.kallsyms] [k] dst_release
> 0.51% ksoftirqd/0 [kernel.kallsyms] [k] idr_get_empty_slot
> 0.51% ksoftirqd/0 [kernel.kallsyms] [k] r4k_dma_cache_inv
> 0.51% ksoftirqd/0 [nf_conntrack] [k] nf_ct_deliver_cached_events
> 0.51% ksoftirqd/0 [nf_conntrack_ipv4] [k] ipv4_conntrack_in
> 0.51% ksoftirqd/0 [nf_nat] [k] nf_nat_packet
> 0.40% ksoftirqd/0 [ath10k_core] [k] ath10k_dbg_dump
> 0.40% ksoftirqd/0 [ath10k_core] [k] ath10k_mac_op_wake_tx_queue
> 0.40% ksoftirqd/0 [ath10k_pci] [k] ath10k_bus_pci_read32
> 0.40% ksoftirqd/0 [ath10k_pci] [k] ath10k_ce_completed_send_next_nolock
> 0.40% ksoftirqd/0 [br_netfilter] [k] ip_sabotage_in
> 0.40% ksoftirqd/0 [kernel.kallsyms] [k] __br_fdb_get
> 0.40% ksoftirqd/0 [kernel.kallsyms] [k] __delay
> 0.40% ksoftirqd/0 [kernel.kallsyms] [k] __skb_flow_dissect
> 0.40% ksoftirqd/0 [kernel.kallsyms] [k] __wake_up_sync_key
> 0.40% ksoftirqd/0 [kernel.kallsyms] [k] memcpy
> 0.40% ksoftirqd/0 [kernel.kallsyms] [k] sch_direct_xmit
> 0.40% ksoftirqd/0 [kernel.kallsyms] [k] skb_release_data
> 0.40% ksoftirqd/0 [kernel.kallsyms] [k] validate_xmit_skb.isra.30.part.31
> 0.40% ksoftirqd/0 [mac80211] [k] fq_flow_classify.constprop.17
> 0.40% ksoftirqd/0 [mac80211] [k] ieee80211_tx_status
> 0.40% ksoftirqd/0 [nf_conntrack] [k] __nf_ct_refresh_acct
> 0.40% ksoftirqd/0 [nf_conntrack_ipv4] [k] ipv4_get_l4proto
> 0.40% ksoftirqd/0 [nf_conntrack_rtcache] [k] nf_rtcache_forward
> 0.40% ksoftirqd/0 [nf_conntrack_rtcache] [k] nf_rtcache_forward4
> 0.40% ksoftirqd/0 [nf_nat_ipv4] [k] nf_nat_ipv4_in
> 0.40% ksoftirqd/0 [nf_nat_ipv4] [k] nf_nat_ipv4_out
> 0.30% ksoftirqd/0 [ath10k_core] [k] __ath10k_htt_rx_ring_fill_n
> 0.30% ksoftirqd/0 [ath10k_core] [k] ath10k_htt_rx_h_deliver
> 0.30% ksoftirqd/0 [ath10k_core] [k] ath10k_htt_rx_h_mpdu
> 0.30% ksoftirqd/0 [ath10k_core] [k] ath10k_mac_tx_push_txq
> 0.30% ksoftirqd/0 [ath10k_pci] [k] ath10k_ce_completed_send_next
> 0.30% ksoftirqd/0 [ath10k_pci] [k] ath10k_pci_write32
> 0.30% ksoftirqd/0 [br_netfilter] [k] br_nf_dev_xmit
> 0.30% ksoftirqd/0 [br_netfilter] [k] nf_bridge_encap_header_len
> 0.30% ksoftirqd/0 [cfg80211] [k] ieee80211_hdrlen
> 0.30% ksoftirqd/0 [kernel.kallsyms] [k] __free_page_frag
> 0.30% ksoftirqd/0 [kernel.kallsyms] [k] ag71xx_hard_start_xmit
> 0.30% ksoftirqd/0 [kernel.kallsyms] [k] br_dev_xmit
> 0.30% ksoftirqd/0 [kernel.kallsyms] [k] br_pass_frame_up
> 0.30% ksoftirqd/0 [kernel.kallsyms] [k] build_skb
> 0.30% ksoftirqd/0 [kernel.kallsyms] [k] dev_gro_receive
> 0.30% ksoftirqd/0 [kernel.kallsyms] [k] idr_alloc
> 0.30% ksoftirqd/0 [kernel.kallsyms] [k] inet_gro_receive
> 0.30% ksoftirqd/0 [kernel.kallsyms] [k] ip_finish_output
> 0.30% ksoftirqd/0 [kernel.kallsyms] [k] nf_hook_slow
> 0.30% ksoftirqd/0 [kernel.kallsyms] [k] tcp_gro_receive
> 0.30% ksoftirqd/0 [mac80211] [k] codel_dequeue_func
> 0.30% ksoftirqd/0 [mac80211] [k] fq_flow_dequeue
> 0.30% ksoftirqd/0 [mac80211] [k] ieee80211_drop_unencrypted.part.3
> 0.30% ksoftirqd/0 [mac80211] [k] ieee80211_get_bssid
> 0.30% ksoftirqd/0 [nf_conntrack_ipv4] [k] ipv4_confirm
> 0.30% ksoftirqd/0 [nf_defrag_ipv4] [k] ipv4_conntrack_defrag
> 0.20% ksoftirqd/0 [ath10k_core] [k] ath10k_txrx_tx_unref
> 0.20% ksoftirqd/0 [ath10k_pci] [k] ath10k_pci_htt_rx_cb
> 0.20% ksoftirqd/0 [br_netfilter] [k] br_nf_local_in
> 0.20% ksoftirqd/0 [ebtable_filter] [k] ebt_out_hook
> 0.20% ksoftirqd/0 [ebtable_nat] [k] ebt_nat_in
> 0.20% ksoftirqd/0 [ebtable_nat] [k] ebt_nat_out
> 0.20% ksoftirqd/0 [iptable_mangle] [k] iptable_mangle_hook
> 0.20% ksoftirqd/0 [kernel.kallsyms] [k] __dma_sync
Interesting __dma_sync does not take more than 0.2%.
> 0.20% ksoftirqd/0 [kernel.kallsyms] [k] __free_pages_ok
> 0.20% ksoftirqd/0 [kernel.kallsyms] [k] __netif_receive_skb
> 0.20% ksoftirqd/0 [kernel.kallsyms] [k] __slab_alloc.isra.13.constprop.17
> 0.20% ksoftirqd/0 [kernel.kallsyms] [k] __slab_free.isra.14
> 0.20% ksoftirqd/0 [kernel.kallsyms] [k] br_deliver
> 0.20% ksoftirqd/0 [kernel.kallsyms] [k] finish_task_switch
> 0.20% ksoftirqd/0 [kernel.kallsyms] [k] ipv4_dst_check
> 0.20% ksoftirqd/0 [kernel.kallsyms] [k] is_skb_forwardable
> 0.20% ksoftirqd/0 [kernel.kallsyms] [k] kmem_cache_free
> 0.20% ksoftirqd/0 [kernel.kallsyms] [k] ktime_get
> 0.20% ksoftirqd/0 [kernel.kallsyms] [k] net_rx_action
> 0.20% ksoftirqd/0 [kernel.kallsyms] [k] skb_release_head_state
> 0.20% ksoftirqd/0 [kernel.kallsyms] [k] vsnprintf
> 0.20% ksoftirqd/0 [mac80211] [k] ieee80211_deliver_skb
> 0.20% ksoftirqd/0 [mac80211] [k] ieee80211_select_queue
> 0.20% ksoftirqd/0 [mac80211] [k] ieee80211_skb_resize
> 0.20% ksoftirqd/0 [nf_conntrack_rtcache] [k] nf_rtcache_in4
> 0.10% hostapd [kernel.kallsyms] [k] __copy_user_common
> 0.10% hostapd [kernel.kallsyms] [k] core_sys_select
> 0.10% hostapd [kernel.kallsyms] [k] datagram_poll
> 0.10% hostapd [kernel.kallsyms] [k] do_select
> 0.10% hostapd [kernel.kallsyms] [k] finish_task_switch
> 0.10% hostapd [kernel.kallsyms] [k] handle_sys
> 0.10% hostapd [kernel.kallsyms] [k] hrtimer_try_to_cancel
> 0.10% hostapd [kernel.kallsyms] [k] netlink_recvmsg
> 0.10% hostapd [kernel.kallsyms] [k] poll_initwait
> 0.10% hostapd [kernel.kallsyms] [k] schedule_hrtimeout_range_clock
> 0.10% hostapd [kernel.kallsyms] [k] sock_poll
> 0.10% hostapd [kernel.kallsyms] [k] timespec_add_safe
> 0.10% hostapd libc.so [.] 0x0000ef74
> 0.10% hostapd libc.so [.] 0x000288e4
> 0.10% hostapd libc.so [.] 0x000722fc
> 0.10% hostapd wpad [.] 0x0001ab97
> 0.10% hostapd wpad [.] 0x0001b1a1
> 0.10% hostapd wpad [.] 0x0001cbd9
> 0.10% hostapd wpad [.] 0x0001cd0d
> 0.10% hostapd wpad [.] 0x0004c589
> 0.10% ksoftirqd/0 [ath10k_core] [k] ath10k_htc_rx_completion_handler
> 0.10% ksoftirqd/0 [ath10k_core] [k] ath10k_htt_rx_amsdu_allowed.isra.4
> 0.10% ksoftirqd/0 [ath10k_core] [k] ath10k_htt_tx_inc_pending
> 0.10% ksoftirqd/0 [ath10k_core] [k] ath10k_htt_tx_txq_update
> 0.10% ksoftirqd/0 [ath10k_core] [k] ath10k_mac_tx_frm_has_freq
> 0.10% ksoftirqd/0 [ath10k_core] [k] ath10k_mac_tx_h_get_txmode.isra.4
> 0.10% ksoftirqd/0 [ath10k_core] [k] ath10k_mac_tx_push_pending
> 0.10% ksoftirqd/0 [ath10k_core] [k] ath10k_process_rx
> 0.10% ksoftirqd/0 [ath10k_pci] [k] ath10k_ce_completed_recv_next_nolock
> 0.10% ksoftirqd/0 [ath9k] [k] ath_rx_tasklet
> 0.10% ksoftirqd/0 [ath9k_common] [k] ath9k_cmn_process_rssi
> 0.10% ksoftirqd/0 [ath9k_hw] [k] ath9k_hw_numtxpending
> 0.10% ksoftirqd/0 [cfg80211] [k] __ieee80211_data_to_8023
> 0.10% ksoftirqd/0 [cfg80211] [k] cfg80211_rx_mgmt
> 0.10% ksoftirqd/0 [kernel.kallsyms] [k] __build_skb
> 0.10% ksoftirqd/0 [kernel.kallsyms] [k] __do_softirq
> 0.10% ksoftirqd/0 [kernel.kallsyms] [k] ag71xx_tx_packets
> 0.10% ksoftirqd/0 [kernel.kallsyms] [k] c0_hpt_read
> 0.10% ksoftirqd/0 [kernel.kallsyms] [k] consume_skb
> 0.10% ksoftirqd/0 [kernel.kallsyms] [k] idr_mark_full
> 0.10% ksoftirqd/0 [kernel.kallsyms] [k] idr_remove
> 0.10% ksoftirqd/0 [kernel.kallsyms] [k] ip_forward_finish
> 0.10% ksoftirqd/0 [kernel.kallsyms] [k] kmem_cache_alloc
> 0.10% ksoftirqd/0 [kernel.kallsyms] [k] memmove
> 0.10% ksoftirqd/0 [kernel.kallsyms] [k] mod_timer
> 0.10% ksoftirqd/0 [kernel.kallsyms] [k] number.isra.12
> 0.10% ksoftirqd/0 [kernel.kallsyms] [k] passthru_features_check
> 0.10% ksoftirqd/0 [kernel.kallsyms] [k] skb_copy_bits
> 0.10% ksoftirqd/0 [kernel.kallsyms] [k] skb_pull
> 0.10% ksoftirqd/0 [kernel.kallsyms] [k] skb_push
> 0.10% ksoftirqd/0 [mac80211] [k] __ieee80211_beacon_add_tim
> 0.10% ksoftirqd/0 [mac80211] [k] codel_should_drop.isra.4.constprop.20
> 0.10% ksoftirqd/0 [mac80211] [k] ieee80211_report_used_skb
> 0.10% ksoftirqd/0 [mac80211] [k] ieee80211_subif_start_xmit
> 0.10% ksoftirqd/0 [mac80211] [k] remove_monitor_info
> 0.10% ksoftirqd/0 [nf_conntrack] [k] nf_ct_get_tuple
> 0.10% ksoftirqd/0 [nf_conntrack] [k] tcp_get_timeouts
> 0.10% ksoftirqd/0 [nf_conntrack_ipv4] [k] ipv4_pkt_to_tuple
> 0.10% ksoftirqd/0 [nf_nat] [k] nf_ct_nat_ext_add
No single function "jumps-out", but the amount of function calls for
sending packets are astonishing, I'm sure that the icache is also
hurting.
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer
More information about the Make-wifi-fast
mailing list