Historic archive of defunct list bloat-devel@lists.bufferbloat.net
 help / color / mirror / Atom feed
* oprofiling is much saner looking now with rc6-smoketest
@ 2011-08-31  0:32 Dave Taht
  2011-08-31  1:01 ` Rick Jones
  2011-08-31  1:41 ` Dave Taht
  0 siblings, 2 replies; 10+ messages in thread
From: Dave Taht @ 2011-08-31  0:32 UTC (permalink / raw)
  To: bloat-devel

I get about 190Mbit/sec from netperf now, on GigE, with oprofiling
enabled, driver buffers of 4, txqueue of 8, cerowrt default iptables
rules,
AND web10g patched into kernel 3.0.3.

This is much saner than rc3, and judging from the csum_partial and
copy_user being roughly equal, there isn't much left to be gained...

Nice work.

(Without oprofiling, and without web10g and with tcp cubic I can get
past 250Mbit)


CPU: MIPS 24K, speed 0 MHz (estimated)
Counted INSTRUCTIONS events (Instructions completed) with a unit mask
of 0x00 (No unit mask) count 100000
samples  %        app name                 symbol name
-------------------------------------------------------------------------------
17277    13.8798  vmlinux                  csum_partial
  17277    100.000  vmlinux                  csum_partial [self]
-------------------------------------------------------------------------------
16607    13.3415  vmlinux                  __copy_user
  16607    100.000  vmlinux                  __copy_user [self]
-------------------------------------------------------------------------------
11913     9.5705  ip_tables                /ip_tables
  11913    100.000  ip_tables                /ip_tables [self]
-------------------------------------------------------------------------------
8949      7.1893  nf_conntrack             /nf_conntrack
  8949     100.000  nf_conntrack             /nf_conntrack [self]


In this case I was going from laptop - gige - through another
rc6-smoketest router - to_this_box's internal lan port.

It bugs me that iptables and conntrack eat so much cpu for what
is an internal-only connection, e.g. one that
doesn't need conntracking.

That said, I understand that people like their statistics, and me,
I'm trying to make split-tcp work better, ultimately, one day....

I'm going to rerun this without the fw rules next.

-- 
Dave Täht
SKYPE: davetaht
US Tel: 1-239-829-5608
http://the-edge.blogspot.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: oprofiling is much saner looking now with rc6-smoketest
  2011-08-31  0:32 oprofiling is much saner looking now with rc6-smoketest Dave Taht
@ 2011-08-31  1:01 ` Rick Jones
  2011-08-31  1:10   ` Simon Barber
  2011-08-31  1:45   ` Dave Taht
  2011-08-31  1:41 ` Dave Taht
  1 sibling, 2 replies; 10+ messages in thread
From: Rick Jones @ 2011-08-31  1:01 UTC (permalink / raw)
  To: Dave Taht; +Cc: bloat-devel

On 08/30/2011 05:32 PM, Dave Taht wrote:
> I get about 190Mbit/sec from netperf now, on GigE, with oprofiling
> enabled, driver buffers of 4, txqueue of 8, cerowrt default iptables
> rules, AND web10g patched into kernel 3.0.3.
>
> This is much saner than rc3, and judging from the csum_partial and
> copy_user being roughly equal, there isn't much left to be gained...
 >
> Nice work.
>
> (Without oprofiling, and without web10g and with tcp cubic I can get
> past 250Mbit)
>
>
> CPU: MIPS 24K, speed 0 MHz (estimated)
> Counted INSTRUCTIONS events (Instructions completed) with a unit mask
> of 0x00 (No unit mask) count 100000
> samples  %        app name                 symbol name
> -------------------------------------------------------------------------------
> 17277    13.8798  vmlinux                  csum_partial
>    17277    100.000  vmlinux                  csum_partial [self]
> -------------------------------------------------------------------------------
> 16607    13.3415  vmlinux                  __copy_user
>    16607    100.000  vmlinux                  __copy_user [self]
> -------------------------------------------------------------------------------
> 11913     9.5705  ip_tables                /ip_tables
>    11913    100.000  ip_tables                /ip_tables [self]
> -------------------------------------------------------------------------------
> 8949      7.1893  nf_conntrack             /nf_conntrack
>    8949     100.000  nf_conntrack             /nf_conntrack [self]
>
> In this case I was going from laptop - gige - through another
> rc6-smoketest router - to_this_box's internal lan port.
>
> It bugs me that iptables and conntrack eat so much cpu for what
> is an internal-only connection, e.g. one that
> doesn't need conntracking.

The csum_partial is a bit surprising - I thought every NIC and its dog 
offered CKO these days - or is that something happening with 
ip_tables/contrack?  I also thought that Linux used an integrated 
copy/checksum in at least one direction, or did that go away when CKO 
became prevalent?

If this is inbound, and there is just plain checksumming and not 
anything funny from conntrack, I would have expected checksum to be much 
larger than copy.  Checksum (in the inbound direction) will take the 
cache misses and the copy would not.  Unless... the data cache of the 
processor is getting completely trashed - say from the netserver running 
on the router not keeping up with the inbound data fully and so the copy 
gets "far away" from the checksum verification.

Does perf/perf_events (whatever the followon to perfmon2 is called) have 
support for the CPU used in the device?  (Assuming it even has a PMU to 
be queried in the first place)

> That said, I understand that people like their statistics, and me,
> I'm trying to make split-tcp work better, ultimately, one day....
>
> I'm going to rerun this without the fw rules next.

It would be interesting to see if the csum time goes away.  Long ago and 
far away when I was beating on a 32-core system with aggregate netperf 
TCP_RR and enabling or not FW rules, conntrack had a non-trivial effect 
indeed on performance.

http://markmail.org/message/exjtzel7vq2ugt66#query:netdev%20conntrack%20rick%20jones%2032%20netperf+page:1+mid:s5v5kylvmlfrpb7a+state:results

I think will get to the start of that thread.  The subject is '32 core 
net-next stack/netfilter "scaling"'

rick jones

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: oprofiling is much saner looking now with rc6-smoketest
  2011-08-31  1:01 ` Rick Jones
@ 2011-08-31  1:10   ` Simon Barber
  2011-08-31  1:20     ` Simon Barber
  2011-08-31  1:45   ` Dave Taht
  1 sibling, 1 reply; 10+ messages in thread
From: Simon Barber @ 2011-08-31  1:10 UTC (permalink / raw)
  To: bloat-devel

Why is conntrack even getting involved?

Simon

On 08/30/2011 06:01 PM, Rick Jones wrote:
> On 08/30/2011 05:32 PM, Dave Taht wrote:
>> I get about 190Mbit/sec from netperf now, on GigE, with oprofiling
>> enabled, driver buffers of 4, txqueue of 8, cerowrt default iptables
>> rules, AND web10g patched into kernel 3.0.3.
>>
>> This is much saner than rc3, and judging from the csum_partial and
>> copy_user being roughly equal, there isn't much left to be gained...
>  >
>> Nice work.
>>
>> (Without oprofiling, and without web10g and with tcp cubic I can get
>> past 250Mbit)
>>
>>
>> CPU: MIPS 24K, speed 0 MHz (estimated)
>> Counted INSTRUCTIONS events (Instructions completed) with a unit mask
>> of 0x00 (No unit mask) count 100000
>> samples % app name symbol name
>> -------------------------------------------------------------------------------
>>
>> 17277 13.8798 vmlinux csum_partial
>> 17277 100.000 vmlinux csum_partial [self]
>> -------------------------------------------------------------------------------
>>
>> 16607 13.3415 vmlinux __copy_user
>> 16607 100.000 vmlinux __copy_user [self]
>> -------------------------------------------------------------------------------
>>
>> 11913 9.5705 ip_tables /ip_tables
>> 11913 100.000 ip_tables /ip_tables [self]
>> -------------------------------------------------------------------------------
>>
>> 8949 7.1893 nf_conntrack /nf_conntrack
>> 8949 100.000 nf_conntrack /nf_conntrack [self]
>>
>> In this case I was going from laptop - gige - through another
>> rc6-smoketest router - to_this_box's internal lan port.
>>
>> It bugs me that iptables and conntrack eat so much cpu for what
>> is an internal-only connection, e.g. one that
>> doesn't need conntracking.
>
> The csum_partial is a bit surprising - I thought every NIC and its dog
> offered CKO these days - or is that something happening with
> ip_tables/contrack? I also thought that Linux used an integrated
> copy/checksum in at least one direction, or did that go away when CKO
> became prevalent?
>
> If this is inbound, and there is just plain checksumming and not
> anything funny from conntrack, I would have expected checksum to be much
> larger than copy. Checksum (in the inbound direction) will take the
> cache misses and the copy would not. Unless... the data cache of the
> processor is getting completely trashed - say from the netserver running
> on the router not keeping up with the inbound data fully and so the copy
> gets "far away" from the checksum verification.
>
> Does perf/perf_events (whatever the followon to perfmon2 is called) have
> support for the CPU used in the device? (Assuming it even has a PMU to
> be queried in the first place)
>
>> That said, I understand that people like their statistics, and me,
>> I'm trying to make split-tcp work better, ultimately, one day....
>>
>> I'm going to rerun this without the fw rules next.
>
> It would be interesting to see if the csum time goes away. Long ago and
> far away when I was beating on a 32-core system with aggregate netperf
> TCP_RR and enabling or not FW rules, conntrack had a non-trivial effect
> indeed on performance.
>
> http://markmail.org/message/exjtzel7vq2ugt66#query:netdev%20conntrack%20rick%20jones%2032%20netperf+page:1+mid:s5v5kylvmlfrpb7a+state:results
>
>
> I think will get to the start of that thread. The subject is '32 core
> net-next stack/netfilter "scaling"'
>
> rick jones
> _______________________________________________
> Bloat-devel mailing list
> Bloat-devel@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: oprofiling is much saner looking now with rc6-smoketest
  2011-08-31  1:10   ` Simon Barber
@ 2011-08-31  1:20     ` Simon Barber
  0 siblings, 0 replies; 10+ messages in thread
From: Simon Barber @ 2011-08-31  1:20 UTC (permalink / raw)
  To: bloat-devel

Apologies - I should have read the scenario better - the connection is 
terminated on the router.

Simon

On 08/30/2011 06:10 PM, Simon Barber wrote:
> Why is conntrack even getting involved?
>
> Simon
>
> On 08/30/2011 06:01 PM, Rick Jones wrote:
>> On 08/30/2011 05:32 PM, Dave Taht wrote:
>>> I get about 190Mbit/sec from netperf now, on GigE, with oprofiling
>>> enabled, driver buffers of 4, txqueue of 8, cerowrt default iptables
>>> rules, AND web10g patched into kernel 3.0.3.
>>>
>>> This is much saner than rc3, and judging from the csum_partial and
>>> copy_user being roughly equal, there isn't much left to be gained...
>> >
>>> Nice work.
>>>
>>> (Without oprofiling, and without web10g and with tcp cubic I can get
>>> past 250Mbit)
>>>
>>>
>>> CPU: MIPS 24K, speed 0 MHz (estimated)
>>> Counted INSTRUCTIONS events (Instructions completed) with a unit mask
>>> of 0x00 (No unit mask) count 100000
>>> samples % app name symbol name
>>> -------------------------------------------------------------------------------
>>>
>>>
>>> 17277 13.8798 vmlinux csum_partial
>>> 17277 100.000 vmlinux csum_partial [self]
>>> -------------------------------------------------------------------------------
>>>
>>>
>>> 16607 13.3415 vmlinux __copy_user
>>> 16607 100.000 vmlinux __copy_user [self]
>>> -------------------------------------------------------------------------------
>>>
>>>
>>> 11913 9.5705 ip_tables /ip_tables
>>> 11913 100.000 ip_tables /ip_tables [self]
>>> -------------------------------------------------------------------------------
>>>
>>>
>>> 8949 7.1893 nf_conntrack /nf_conntrack
>>> 8949 100.000 nf_conntrack /nf_conntrack [self]
>>>
>>> In this case I was going from laptop - gige - through another
>>> rc6-smoketest router - to_this_box's internal lan port.
>>>
>>> It bugs me that iptables and conntrack eat so much cpu for what
>>> is an internal-only connection, e.g. one that
>>> doesn't need conntracking.
>>
>> The csum_partial is a bit surprising - I thought every NIC and its dog
>> offered CKO these days - or is that something happening with
>> ip_tables/contrack? I also thought that Linux used an integrated
>> copy/checksum in at least one direction, or did that go away when CKO
>> became prevalent?
>>
>> If this is inbound, and there is just plain checksumming and not
>> anything funny from conntrack, I would have expected checksum to be much
>> larger than copy. Checksum (in the inbound direction) will take the
>> cache misses and the copy would not. Unless... the data cache of the
>> processor is getting completely trashed - say from the netserver running
>> on the router not keeping up with the inbound data fully and so the copy
>> gets "far away" from the checksum verification.
>>
>> Does perf/perf_events (whatever the followon to perfmon2 is called) have
>> support for the CPU used in the device? (Assuming it even has a PMU to
>> be queried in the first place)
>>
>>> That said, I understand that people like their statistics, and me,
>>> I'm trying to make split-tcp work better, ultimately, one day....
>>>
>>> I'm going to rerun this without the fw rules next.
>>
>> It would be interesting to see if the csum time goes away. Long ago and
>> far away when I was beating on a 32-core system with aggregate netperf
>> TCP_RR and enabling or not FW rules, conntrack had a non-trivial effect
>> indeed on performance.
>>
>> http://markmail.org/message/exjtzel7vq2ugt66#query:netdev%20conntrack%20rick%20jones%2032%20netperf+page:1+mid:s5v5kylvmlfrpb7a+state:results
>>
>>
>>
>> I think will get to the start of that thread. The subject is '32 core
>> net-next stack/netfilter "scaling"'
>>
>> rick jones
>> _______________________________________________
>> Bloat-devel mailing list
>> Bloat-devel@lists.bufferbloat.net
>> https://lists.bufferbloat.net/listinfo/bloat-devel
> _______________________________________________
> Bloat-devel mailing list
> Bloat-devel@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: oprofiling is much saner looking now with rc6-smoketest
  2011-08-31  0:32 oprofiling is much saner looking now with rc6-smoketest Dave Taht
  2011-08-31  1:01 ` Rick Jones
@ 2011-08-31  1:41 ` Dave Taht
  1 sibling, 0 replies; 10+ messages in thread
From: Dave Taht @ 2011-08-31  1:41 UTC (permalink / raw)
  To: bloat-devel

[-- Attachment #1: Type: text/plain, Size: 2424 bytes --]

This is the same test, repeated, sans firewall rules - all iptables
rules cleared. I get about 220MB/sec without oprofile running, and 208
with it running. (vs about 190MB with the iptables rules in the
previous run)

As noted earlier this is with netperf running on the router itself,
with web10g patched in. Web10g supplies some interesting statistics
(attached), and I have a tcptrace/xplot.org screenshots of the
previous run here:

http://huchra.bufferbloat.net/~d/rc6-smoke-captures/

that is also interesting. There is a knee at 30 seconds and other
somewhat odd looking behavior when you zoom in.

I further note that the laptop driving this test (via a gige pcmcia
card) has a default txqueuelen of 1000 and I don't presently know the
length of the dma tx ring, and is running the std ubuntu 11.4 kernel (
 2.6.38-11-generic), with ecn,sack,dsack enabled.

And the best performance I've got from the laptop to a pentium 4 box
was 290Mbit. I note that I'm pretty happy with 220Mbit OK! I wanted to
have a baseline value before I started fiddling with vlan and dscp
stuff....


CPU: MIPS 24K, speed 0 MHz (estimated)
Counted INSTRUCTIONS events (Instructions completed) with a unit mask
of 0x00 (No unit mask) count 100000
samples  %        app name                 symbol name
-------------------------------------------------------------------------------
17141    14.8045  vmlinux                  csum_partial
  17141    100.000  vmlinux                  csum_partial [self]
-------------------------------------------------------------------------------
17024    14.7035  vmlinux                  __copy_user
  17024    100.000  vmlinux                  __copy_user [self]
-------------------------------------------------------------------------------
8888      7.6765  nf_conntrack             /nf_conntrack
  8888     100.000  nf_conntrack             /nf_conntrack [self]
-------------------------------------------------------------------------------
4139      3.5748  vmlinux                  __do_softirq
  4139     100.000  vmlinux                  __do_softirq [self]
-------------------------------------------------------------------------------
4055      3.5023  ip_tables                /ip_tables
  4055     100.000  ip_tables                /ip_tables [self]



-- 
Dave Täht
SKYPE: davetaht
US Tel: 1-239-829-5608
http://the-edge.blogspot.com

[-- Attachment #2: estats.stuff.txt --]
[-- Type: text/plain, Size: 11288 bytes --]

Connection 64 (172.29.1.33_22 172.29.1.123_51581)
    LocalAddressType    : 1
    LocalAddress        : 172.29.1.33
    LocalPort           : 22
    RemAddressType      : 1
    RemAddress          : 172.29.1.123
    RemPort             : 51581
    SegsOut             : 2668
    DataSegsOut         : 2668
    DataOctetsOut       : 493373
    SegsRetrans         : 3
    OctetsRetrans       : 208
    SegsIn              : 4272
    DataSegsIn          : 1885
    DataOctetsIn        : 93215
    ElapsedSecs         : 6618
    ElapsedMicroSecs    : 723887
    CurMSS              : 1460
    PipeSize            : 0
    MaxPipeSize         : 4380
    SmoothedRTT         : 10
    CurRTO              : 210
    CongSignals         : 0
    CurCwnd             : 8760
    CurSsthresh         : 4294967295
    Timeouts            : 0
    CurRwinSent         : 16888
    MaxRwinSent         : 16888
    ZeroRwinSent        : 0
    CurRwinRcvd         : 293248
    MaxRwinRcvd         : 293248
    ZeroRwinRcvd        : 0
    SndLimTransRwin     : 0
    SndLimTransCwnd     : 1
    SndLimTransSnd      : 2
    SndLimTimeRwin      : 0
    SndLimTimeCwnd      : 400477716
    SndLimTimeSnd       : 4073612272
    SendStall           : 0
    RetranThresh        : 3
    NonRecovDAEpisodes  : 1
    SumOctetsReordered  : 256
    NonRecovDA          : 0
    SampleRTT           : 0
    RTTVar              : 50
    MaxRTT              : 40
    MinRTT              : 0
    SumRTT              : 9160
    CountRTT            : 2507
    MaxRTO              : 230
    MinRTO              : 210
    IpTtl               : 64
    IpTosIn             : 16
    IpTosOut            : 0
    PreCongSumCwnd      : 0
    PreCongSumRTT       : 0
    PostCongSumRTT      : 0
    PostCongCountRTT    : 0
    ECNsignals          : 0
    DupAckEpisodes      : 0
    RcvRTT              : 3134114
    DupAcksOut          : 0
    CERcvd              : 0
    ECESent             : 0
    ActiveOpen          : 0
    MSSSent             : 1460
    MSSRcvd             : 1460
    WinScaleSent        : 1
    WinScaleRcvd        : 7
    TimeStamps          : 2
    ECN                 : 3
    WillSendSACK        : 1
    WillUseSACK         : 1
    State               : 1953653102
    Nagle               : 1
    MaxSsCwnd           : 14600
    MaxCaCwnd           : 10220
    MaxSsthresh         : 5840
    MinSsthresh         : 2920
    InRecovery          : 2
    DupAcksIn           : 0
    SpuriousFrDetected  : 0
    SpuriousRtoDetected : 0
    SoftErrors          : 0
    SoftErrorReason     : 0
    SlowStart           : 2
    CongAvoid           : 41
    OtherReductions     : 0
    CongOverCount       : 0
    FastRetran          : 0
    SubsequentTimeouts  : 0
    CurTimeoutCount     : 0
    AbruptTimeouts      : 0
    SACKsRcvd           : 0
    SACKBlocksRcvd      : 0
    DSACKDups           : 0
    MaxMSS              : 1460
    MinMSS              : 1440
    SndInitial          : 2230931079
    RecInitial          : 1469570772
    CurRetxQueue        : 0
    MaxRetxQueue        : 0
    CurReasmQueue       : 0
    MaxReasmQueue       : 0
    SndUna              : 2231370872
    SndNxt              : 2231370872
    SndMax              : 2231370872
    ThruOctetsAcked     : 439793
    RcvNxt              : 1469663891
    ThruOctetsReceived  : 93119
    CurAppWQueue        : 0
    MaxAppWQueue        : 0
    CurAppRQueue        : 0
    MaxAppRQueue        : 1144
    LimCwnd             : 4294965836
    LimSsthresh         : 0
    LimRwin             : 64075
    LimMSS              : 95682560
    OtherReductionsCV   : 0
    OtherReductionsCM   : 0
    StartTimeSecs       : 1314747615
    StartTimeMicroSecs  : 690002
    Sndbuf              : 16384
    Rcvbuf              : 87380
Connection 56 (172.29.1.97_22 172.29.1.123_52588)
    LocalAddressType    : 1
    LocalAddress        : 172.29.1.97
    LocalPort           : 22
    RemAddressType      : 1
    RemAddress          : 172.29.1.123
    RemPort             : 52588
    SegsOut             : 1241
    DataSegsOut         : 1241
    DataOctetsOut       : 110645
    SegsRetrans         : 0
    OctetsRetrans       : 0
    SegsIn              : 2000
    DataSegsIn          : 914
    DataOctetsIn        : 45711
    ElapsedSecs         : 4990
    ElapsedMicroSecs    : 629033
    CurMSS              : 1460
    PipeSize            : 0
    MaxPipeSize         : 1460
    SmoothedRTT         : 10
    CurRTO              : 210
    CongSignals         : 0
    CurCwnd             : 14600
    CurSsthresh         : 4294965836
    Timeouts            : 0
    CurRwinSent         : 16888
    MaxRwinSent         : 16888
    ZeroRwinSent        : 0
    CurRwinRcvd         : 64128
    MaxRwinRcvd         : 64128
    ZeroRwinRcvd        : 0
    SndLimTransRwin     : 0
    SndLimTransCwnd     : 0
    SndLimTransSnd      : 1
    SndLimTimeRwin      : 0
    SndLimTimeCwnd      : 0
    SndLimTimeSnd       : 4171379644
    SendStall           : 0
    RetranThresh        : 3
    NonRecovDAEpisodes  : 0
    SumOctetsReordered  : 0
    NonRecovDA          : 0
    SampleRTT           : 0
    RTTVar              : 50
    MaxRTT              : 120
    MinRTT              : 0
    SumRTT              : 3280
    CountRTT            : 1114
    MaxRTO              : 230
    MinRTO              : 210
    IpTtl               : 64
    IpTosIn             : 16
    IpTosOut            : 0
    PreCongSumCwnd      : 0
    PreCongSumRTT       : 0
    PostCongSumRTT      : 0
    PostCongCountRTT    : 0
    ECNsignals          : 0
    DupAckEpisodes      : 0
    RcvRTT              : 1532633
    DupAcksOut          : 0
    CERcvd              : 0
    ECESent             : 0
    ActiveOpen          : 0
    MSSSent             : 1460
    MSSRcvd             : 1460
    WinScaleSent        : 1
    WinScaleRcvd        : 7
    TimeStamps          : 2
    ECN                 : 3
    WillSendSACK        : 1
    WillUseSACK         : 1
    State               : 1953653102
    Nagle               : 1
    MaxSsCwnd           : 14600
    MaxCaCwnd           : 0
    MaxSsthresh         : 0
    MinSsthresh         : 4294967295
    InRecovery          : 2
    DupAcksIn           : 0
    SpuriousFrDetected  : 0
    SpuriousRtoDetected : 0
    SoftErrors          : 0
    SoftErrorReason     : 0
    SlowStart           : 0
    CongAvoid           : 0
    OtherReductions     : 0
    CongOverCount       : 0
    FastRetran          : 0
    SubsequentTimeouts  : 0
    CurTimeoutCount     : 0
    AbruptTimeouts      : 0
    SACKsRcvd           : 0
    SACKBlocksRcvd      : 0
    DSACKDups           : 0
    MaxMSS              : 1460
    MinMSS              : 1440
    SndInitial          : 1838946708
    RecInitial          : 749866429
    CurRetxQueue        : 0
    MaxRetxQueue        : 0
    CurReasmQueue       : 0
    MaxReasmQueue       : 0
    SndUna              : 1839032533
    SndNxt              : 1839032533
    SndMax              : 1839032533
    ThruOctetsAcked     : 85825
    RcvNxt              : 749912140
    ThruOctetsReceived  : 45711
    CurAppWQueue        : 0
    MaxAppWQueue        : 0
    CurAppRQueue        : 0
    MaxAppRQueue        : 1144
    LimCwnd             : 4294965836
    LimSsthresh         : 0
    LimRwin             : 64075
    LimMSS              : 95682560
    OtherReductionsCV   : 0
    OtherReductionsCM   : 0
    StartTimeSecs       : 1314742638
    StartTimeMicroSecs  : 218763
    Sndbuf              : 16384
    Rcvbuf              : 87380
Connection 0 (172.29.1.97_22 172.29.1.123_55029)
    LocalAddressType    : 1
    LocalAddress        : 172.29.1.97
    LocalPort           : 22
    RemAddressType      : 1
    RemAddress          : 172.29.1.123
    RemPort             : 55029
    SegsOut             : 2927
    DataSegsOut         : 2927
    DataOctetsOut       : 563021
    SegsRetrans         : 0
    OctetsRetrans       : 0
    SegsIn              : 4394
    DataSegsIn          : 1820
    DataOctetsIn        : 93439
    ElapsedSecs         : 65542
    ElapsedMicroSecs    : 507312
    CurMSS              : 1460
    PipeSize            : 0
    MaxPipeSize         : 4380
    SmoothedRTT         : 20
    CurRTO              : 220
    CongSignals         : 0
    CurCwnd             : 14600
    CurSsthresh         : 4294965836
    Timeouts            : 0
    CurRwinSent         : 22712
    MaxRwinSent         : 22712
    ZeroRwinSent        : 0
    CurRwinRcvd         : 357760
    MaxRwinRcvd         : 357760
    ZeroRwinRcvd        : 0
    SndLimTransRwin     : 0
    SndLimTransCwnd     : 0
    SndLimTransSnd      : 1
    SndLimTimeRwin      : 0
    SndLimTimeCwnd      : 0
    SndLimTimeSnd       : 2439344612
    SendStall           : 0
    RetranThresh        : 3
    NonRecovDAEpisodes  : 0
    SumOctetsReordered  : 0
    NonRecovDA          : 0
    SampleRTT           : 0
    RTTVar              : 50
    MaxRTT              : 50
    MinRTT              : 0
    SumRTT              : 12590
    CountRTT            : 2722
    MaxRTO              : 230
    MinRTO              : 210
    IpTtl               : 64
    IpTosIn             : 16
    IpTosOut            : 0
    PreCongSumCwnd      : 0
    PreCongSumRTT       : 0
    PostCongSumRTT      : 0
    PostCongCountRTT    : 0
    ECNsignals          : 0
    DupAckEpisodes      : 0
    RcvRTT              : 644526
    DupAcksOut          : 0
    CERcvd              : 0
    ECESent             : 0
    ActiveOpen          : 0
    MSSSent             : 1460
    MSSRcvd             : 1460
    WinScaleSent        : 1
    WinScaleRcvd        : 7
    TimeStamps          : 2
    ECN                 : 3
    WillSendSACK        : 1
    WillUseSACK         : 1
    State               : 1953653102
    Nagle               : 1
    MaxSsCwnd           : 14600
    MaxCaCwnd           : 0
    MaxSsthresh         : 0
    MinSsthresh         : 4294967295
    InRecovery          : 2
    DupAcksIn           : 7
    SpuriousFrDetected  : 0
    SpuriousRtoDetected : 0
    SoftErrors          : 7
    SoftErrorReason     : 1
    SlowStart           : 0
    CongAvoid           : 0
    OtherReductions     : 0
    CongOverCount       : 0
    FastRetran          : 0
    SubsequentTimeouts  : 0
    CurTimeoutCount     : 0
    AbruptTimeouts      : 0
    SACKsRcvd           : 0
    SACKBlocksRcvd      : 0
    DSACKDups           : 0
    MaxMSS              : 1460
    MinMSS              : 1440
    SndInitial          : 1018333932
    RecInitial          : 1871301081
    CurRetxQueue        : 0
    MaxRetxQueue        : 0
    CurReasmQueue       : 0
    MaxReasmQueue       : 0
    SndUna              : 1018838413
    SndNxt              : 1018838413
    SndMax              : 1018838413
    ThruOctetsAcked     : 504481
    RcvNxt              : 1871394520
    ThruOctetsReceived  : 93439
    CurAppWQueue        : 0
    MaxAppWQueue        : 0
    CurAppRQueue        : 0
    MaxAppRQueue        : 1456
    LimCwnd             : 4294965836
    LimSsthresh         : 0
    LimRwin             : 64075
    LimMSS              : 95682560
    OtherReductionsCV   : 0
    OtherReductionsCM   : 0
    StartTimeSecs       : 1314684283
    StartTimeMicroSecs  : 261134
    Sndbuf              : 16384
    Rcvbuf              : 87380

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: oprofiling is much saner looking now with rc6-smoketest
  2011-08-31  1:01 ` Rick Jones
  2011-08-31  1:10   ` Simon Barber
@ 2011-08-31  1:45   ` Dave Taht
  2011-08-31  1:58     ` Dave Taht
  2011-08-31 15:55     ` Rick Jones
  1 sibling, 2 replies; 10+ messages in thread
From: Dave Taht @ 2011-08-31  1:45 UTC (permalink / raw)
  To: Rick Jones; +Cc: bloat-devel

On Tue, Aug 30, 2011 at 6:01 PM, Rick Jones <rick.jones2@hp.com> wrote:
> On 08/30/2011 05:32 PM, Dave Taht wrote:

>> It bugs me that iptables and conntrack eat so much cpu for what
>> is an internal-only connection, e.g. one that
>> doesn't need conntracking.
>
> The csum_partial is a bit surprising - I thought every NIC and its dog
> offered CKO these days - or is that something happening with
> ip_tables/contrack?

If this chipset supports it, so far as I know, it isn't documented or
implemented.

> I also thought that Linux used an integrated
> copy/checksum in at least one direction, or did that go away when CKO became
> prevalent?

Don't know.

>
> If this is inbound, and there is just plain checksumming and not anything
> funny from conntrack, I would have expected checksum to be much larger than
> copy.  Checksum (in the inbound direction) will take the cache misses and
> the copy would not.  Unless... the data cache of the processor is getting
> completely trashed - say from the netserver running on the router not
> keeping up with the inbound data fully and so the copy gets "far away" from
> the checksum verification.

220Mbit isn't good enough for ya? Previous tests ran at about 140Mbit, but due
to some major optimizations by felix to fix a bunch of mis-alignment
issues. Through the router, I've seen 260Mbit - which is perilously
close to the speed that I can drive it at from the test boxes.

>
> Does perf/perf_events (whatever the followon to perfmon2 is called) have
> support for the CPU used in the device?  (Assuming it even has a PMU to be
> queried in the first place)

Yes. Don't think it's enabled. It is running flat out, according to top.

>
>> That said, I understand that people like their statistics, and me,
>> I'm trying to make split-tcp work better, ultimately, one day....
>>
>> I'm going to rerun this without the fw rules next.
>
> It would be interesting to see if the csum time goes away.  Long ago and far
> away when I was beating on a 32-core system with aggregate netperf TCP_RR
> and enabling or not FW rules, conntrack had a non-trivial effect indeed on
> performance.

Stays about the same. iptables time drops. How to disable conntrack?
Don't you only really
need it for nat?

>
> http://markmail.org/message/exjtzel7vq2ugt66#query:netdev%20conntrack%20rick%20jones%2032%20netperf+page:1+mid:s5v5kylvmlfrpb7a+state:results
>
> I think will get to the start of that thread.  The subject is '32 core
> net-next stack/netfilter "scaling"'
>
> rick jones
>



-- 
Dave Täht
SKYPE: davetaht
US Tel: 1-239-829-5608
http://the-edge.blogspot.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: oprofiling is much saner looking now with rc6-smoketest
  2011-08-31  1:45   ` Dave Taht
@ 2011-08-31  1:58     ` Dave Taht
  2011-08-31  3:28       ` Dave Taht
  2011-08-31 15:55     ` Rick Jones
  1 sibling, 1 reply; 10+ messages in thread
From: Dave Taht @ 2011-08-31  1:58 UTC (permalink / raw)
  To: Rick Jones; +Cc: bloat-devel

I have put the current rc6 smoketest up at:

http://huchra.bufferbloat.net/~cero1/rc6-smoketest/

So far it's proving very stable. Wireless performance is excellent and
wired performance dramatically improved. No crash bugs thus far,
though I had a scare...

For the final rc6, which I hope to have done by friday, I'm in the
process of cleanly re-assembling the patch set (sorry, the sources are
a bit of a mess at present). For this rc, I'm hoping that a new
iptables lands, in particular, and I have numerous other little things
in the queue to sort out.

All that said, getting oprofile running is not hard, and I do
appreciate smoke testers helping out!!! as I don't think I'll be able
to get another release candidate done before linux plumbers.

install the correct image on your router from the above via web
interface or sysupgrade -n
reboot
edit /etc/opkg.conf to have that url in it
opkg update
opkg install oprofile
cd /tmp
mkdir /tmp/oprofile
wget http://huchra.bufferbloat.net/~d/rc6-smoke-captures/vmlinux
opcontrol --vmlinux=/tmp/vmlinux --session-dir=/tmp/oprofile (saving
profile data to flash is a bad idea)

opcontrol --start
# do your testing
opcontrol --dump

opreport -c # or whatever options you like.


On Tue, Aug 30, 2011 at 6:45 PM, Dave Taht <dave.taht@gmail.com> wrote:
> On Tue, Aug 30, 2011 at 6:01 PM, Rick Jones <rick.jones2@hp.com> wrote:
>> On 08/30/2011 05:32 PM, Dave Taht wrote:
>
>>> It bugs me that iptables and conntrack eat so much cpu for what
>>> is an internal-only connection, e.g. one that
>>> doesn't need conntracking.
>>
>> The csum_partial is a bit surprising - I thought every NIC and its dog
>> offered CKO these days - or is that something happening with
>> ip_tables/contrack?
>
> If this chipset supports it, so far as I know, it isn't documented or
> implemented.
>
>> I also thought that Linux used an integrated
>> copy/checksum in at least one direction, or did that go away when CKO became
>> prevalent?
>
> Don't know.
>
>>
>> If this is inbound, and there is just plain checksumming and not anything
>> funny from conntrack, I would have expected checksum to be much larger than
>> copy.  Checksum (in the inbound direction) will take the cache misses and
>> the copy would not.  Unless... the data cache of the processor is getting
>> completely trashed - say from the netserver running on the router not
>> keeping up with the inbound data fully and so the copy gets "far away" from
>> the checksum verification.
>
> 220Mbit isn't good enough for ya? Previous tests ran at about 140Mbit, but due
> to some major optimizations by felix to fix a bunch of mis-alignment
> issues. Through the router, I've seen 260Mbit - which is perilously
> close to the speed that I can drive it at from the test boxes.
>
>>
>> Does perf/perf_events (whatever the followon to perfmon2 is called) have
>> support for the CPU used in the device?  (Assuming it even has a PMU to be
>> queried in the first place)
>
> Yes. Don't think it's enabled. It is running flat out, according to top.
>
>>
>>> That said, I understand that people like their statistics, and me,
>>> I'm trying to make split-tcp work better, ultimately, one day....
>>>
>>> I'm going to rerun this without the fw rules next.
>>
>> It would be interesting to see if the csum time goes away.  Long ago and far
>> away when I was beating on a 32-core system with aggregate netperf TCP_RR
>> and enabling or not FW rules, conntrack had a non-trivial effect indeed on
>> performance.
>
> Stays about the same. iptables time drops. How to disable conntrack?
> Don't you only really
> need it for nat?
>
>>
>> http://markmail.org/message/exjtzel7vq2ugt66#query:netdev%20conntrack%20rick%20jones%2032%20netperf+page:1+mid:s5v5kylvmlfrpb7a+state:results
>>
>> I think will get to the start of that thread.  The subject is '32 core
>> net-next stack/netfilter "scaling"'
>>
>> rick jones
>>
>
>
>
> --
> Dave Täht
> SKYPE: davetaht
> US Tel: 1-239-829-5608
> http://the-edge.blogspot.com
>



-- 
Dave Täht
SKYPE: davetaht
US Tel: 1-239-829-5608
http://the-edge.blogspot.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: oprofiling is much saner looking now with rc6-smoketest
  2011-08-31  1:58     ` Dave Taht
@ 2011-08-31  3:28       ` Dave Taht
  2011-08-31 16:19         ` Rick Jones
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Taht @ 2011-08-31  3:28 UTC (permalink / raw)
  To: Rick Jones; +Cc: bloat-devel

I took a little more time out to play with netperf at these extreme
performance values, while puzzled about the performance knee observed
midway through the previous tests.

The three tests runs this evening (and captures!) are up at:

http://huchra.bufferbloat.net/~d/rc6-smoke-captures/

For test 3, I rebooted the router into it's default tx ring (64), and
set a txqueuelen of 128, running cubic...

Measured throughput was mildly better (admittedly on a fresh boot,
oprofile not even loaded) 229Mbit,
and we didn't have a drop off at all, so I'm still chasing that...

What I found interesting was the 10 second periodicity of the
drop-offs. My assumption is that this is a timer being fired from
somewhere (netperf?) that blocks the transmission...

http://huchra.bufferbloat.net/~d/rc6-smoke-captures/txqueuelen128and10seconddropcycle.png

test 4 will repeat the above sans oprofile, with the current default
cerowrt settings for dma tx (4) and txqueuelen 8. If I get to it
tonight.

On Tue, Aug 30, 2011 at 6:58 PM, Dave Taht <dave.taht@gmail.com> wrote:
> I have put the current rc6 smoketest up at:
>
> http://huchra.bufferbloat.net/~cero1/rc6-smoketest/
>
> So far it's proving very stable. Wireless performance is excellent and
> wired performance dramatically improved. No crash bugs thus far,
> though I had a scare...
>
> For the final rc6, which I hope to have done by friday, I'm in the
> process of cleanly re-assembling the patch set (sorry, the sources are
> a bit of a mess at present). For this rc, I'm hoping that a new
> iptables lands, in particular, and I have numerous other little things
> in the queue to sort out.
>
> All that said, getting oprofile running is not hard, and I do
> appreciate smoke testers helping out!!! as I don't think I'll be able
> to get another release candidate done before linux plumbers.
>
> install the correct image on your router from the above via web
> interface or sysupgrade -n
> reboot
> edit /etc/opkg.conf to have that url in it
> opkg update
> opkg install oprofile
> cd /tmp
> mkdir /tmp/oprofile
> wget http://huchra.bufferbloat.net/~d/rc6-smoke-captures/vmlinux
> opcontrol --vmlinux=/tmp/vmlinux --session-dir=/tmp/oprofile (saving
> profile data to flash is a bad idea)
>
> opcontrol --start
> # do your testing
> opcontrol --dump
>
> opreport -c # or whatever options you like.
>
>
> On Tue, Aug 30, 2011 at 6:45 PM, Dave Taht <dave.taht@gmail.com> wrote:
>> On Tue, Aug 30, 2011 at 6:01 PM, Rick Jones <rick.jones2@hp.com> wrote:
>>> On 08/30/2011 05:32 PM, Dave Taht wrote:
>>
>>>> It bugs me that iptables and conntrack eat so much cpu for what
>>>> is an internal-only connection, e.g. one that
>>>> doesn't need conntracking.
>>>
>>> The csum_partial is a bit surprising - I thought every NIC and its dog
>>> offered CKO these days - or is that something happening with
>>> ip_tables/contrack?
>>
>> If this chipset supports it, so far as I know, it isn't documented or
>> implemented.
>>
>>> I also thought that Linux used an integrated
>>> copy/checksum in at least one direction, or did that go away when CKO became
>>> prevalent?
>>
>> Don't know.
>>
>>>
>>> If this is inbound, and there is just plain checksumming and not anything
>>> funny from conntrack, I would have expected checksum to be much larger than
>>> copy.  Checksum (in the inbound direction) will take the cache misses and
>>> the copy would not.  Unless... the data cache of the processor is getting
>>> completely trashed - say from the netserver running on the router not
>>> keeping up with the inbound data fully and so the copy gets "far away" from
>>> the checksum verification.
>>
>> 220Mbit isn't good enough for ya? Previous tests ran at about 140Mbit, but due
>> to some major optimizations by felix to fix a bunch of mis-alignment
>> issues. Through the router, I've seen 260Mbit - which is perilously
>> close to the speed that I can drive it at from the test boxes.
>>
>>>
>>> Does perf/perf_events (whatever the followon to perfmon2 is called) have
>>> support for the CPU used in the device?  (Assuming it even has a PMU to be
>>> queried in the first place)
>>
>> Yes. Don't think it's enabled. It is running flat out, according to top.
>>
>>>
>>>> That said, I understand that people like their statistics, and me,
>>>> I'm trying to make split-tcp work better, ultimately, one day....
>>>>
>>>> I'm going to rerun this without the fw rules next.
>>>
>>> It would be interesting to see if the csum time goes away.  Long ago and far
>>> away when I was beating on a 32-core system with aggregate netperf TCP_RR
>>> and enabling or not FW rules, conntrack had a non-trivial effect indeed on
>>> performance.
>>
>> Stays about the same. iptables time drops. How to disable conntrack?
>> Don't you only really
>> need it for nat?
>>
>>>
>>> http://markmail.org/message/exjtzel7vq2ugt66#query:netdev%20conntrack%20rick%20jones%2032%20netperf+page:1+mid:s5v5kylvmlfrpb7a+state:results
>>>
>>> I think will get to the start of that thread.  The subject is '32 core
>>> net-next stack/netfilter "scaling"'
>>>
>>> rick jones
>>>
>>
>>
>>
>> --
>> Dave Täht
>> SKYPE: davetaht
>> US Tel: 1-239-829-5608
>> http://the-edge.blogspot.com
>>
>
>
>
> --
> Dave Täht
> SKYPE: davetaht
> US Tel: 1-239-829-5608
> http://the-edge.blogspot.com
>



-- 
Dave Täht
SKYPE: davetaht
US Tel: 1-239-829-5608
http://the-edge.blogspot.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: oprofiling is much saner looking now with rc6-smoketest
  2011-08-31  1:45   ` Dave Taht
  2011-08-31  1:58     ` Dave Taht
@ 2011-08-31 15:55     ` Rick Jones
  1 sibling, 0 replies; 10+ messages in thread
From: Rick Jones @ 2011-08-31 15:55 UTC (permalink / raw)
  To: Dave Taht; +Cc: bloat-devel


>> If this is inbound, and there is just plain checksumming and not anything
>> funny from conntrack, I would have expected checksum to be much larger than
>> copy.  Checksum (in the inbound direction) will take the cache misses and
>> the copy would not.  Unless... the data cache of the processor is getting
>> completely trashed - say from the netserver running on the router not
>> keeping up with the inbound data fully and so the copy gets "far away" from
>> the checksum verification.
>
> 220Mbit isn't good enough for ya? Previous tests ran at about 140Mbit, but due
> to some major optimizations by felix to fix a bunch of mis-alignment
> issues. Through the router, I've seen 260Mbit - which is perilously
> close to the speed that I can drive it at from the test boxes.

It is all a question of context.  The last time I was in a context where 
220 Mbit/s was high speed was when 100 BT first shipped or perhaps FDDI 
before that :)

>> Does perf/perf_events (whatever the followon to perfmon2 is called) have
>> support for the CPU used in the device?  (Assuming it even has a PMU to be
>> queried in the first place)
>
> Yes. Don't think it's enabled. It is running flat out, according to top.

Well, flat-out as far as the basic OS utilities can tell.  Stalled 
hardware manifests as CPU time consumed in something like top even 
though the processor may be sitting "idle," (in its context) twiddling 
its thumbs waiting on cache misses.  Hence the question about PMU support.

rick jones

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: oprofiling is much saner looking now with rc6-smoketest
  2011-08-31  3:28       ` Dave Taht
@ 2011-08-31 16:19         ` Rick Jones
  0 siblings, 0 replies; 10+ messages in thread
From: Rick Jones @ 2011-08-31 16:19 UTC (permalink / raw)
  To: Dave Taht; +Cc: bloat-devel

On 08/30/2011 08:28 PM, Dave Taht wrote:
> What I found interesting was the 10 second periodicity of the
> drop-offs. My assumption is that this is a timer being fired from
> somewhere (netperf?) that blocks the transmission...

The only timer that netperf would fire-off in the middle of a run would 
be if netperf were ./configure'd with --enable-intervals and one set a 
burst interval of 10 seconds with some combination of global -b <burst 
length> -w <burst interval>

happy benchmarking,

rick jones

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2011-08-31 16:19 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-31  0:32 oprofiling is much saner looking now with rc6-smoketest Dave Taht
2011-08-31  1:01 ` Rick Jones
2011-08-31  1:10   ` Simon Barber
2011-08-31  1:20     ` Simon Barber
2011-08-31  1:45   ` Dave Taht
2011-08-31  1:58     ` Dave Taht
2011-08-31  3:28       ` Dave Taht
2011-08-31 16:19         ` Rick Jones
2011-08-31 15:55     ` Rick Jones
2011-08-31  1:41 ` Dave Taht

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox