* oprofiling is much saner looking now with rc6-smoketest
@ 2011-08-31 0:32 Dave Taht
2011-08-31 1:01 ` Rick Jones
2011-08-31 1:41 ` Dave Taht
0 siblings, 2 replies; 10+ messages in thread
From: Dave Taht @ 2011-08-31 0:32 UTC (permalink / raw)
To: bloat-devel
I get about 190Mbit/sec from netperf now, on GigE, with oprofiling
enabled, driver buffers of 4, txqueue of 8, cerowrt default iptables
rules,
AND web10g patched into kernel 3.0.3.
This is much saner than rc3, and judging from the csum_partial and
copy_user being roughly equal, there isn't much left to be gained...
Nice work.
(Without oprofiling, and without web10g and with tcp cubic I can get
past 250Mbit)
CPU: MIPS 24K, speed 0 MHz (estimated)
Counted INSTRUCTIONS events (Instructions completed) with a unit mask
of 0x00 (No unit mask) count 100000
samples % app name symbol name
-------------------------------------------------------------------------------
17277 13.8798 vmlinux csum_partial
17277 100.000 vmlinux csum_partial [self]
-------------------------------------------------------------------------------
16607 13.3415 vmlinux __copy_user
16607 100.000 vmlinux __copy_user [self]
-------------------------------------------------------------------------------
11913 9.5705 ip_tables /ip_tables
11913 100.000 ip_tables /ip_tables [self]
-------------------------------------------------------------------------------
8949 7.1893 nf_conntrack /nf_conntrack
8949 100.000 nf_conntrack /nf_conntrack [self]
In this case I was going from laptop - gige - through another
rc6-smoketest router - to_this_box's internal lan port.
It bugs me that iptables and conntrack eat so much cpu for what
is an internal-only connection, e.g. one that
doesn't need conntracking.
That said, I understand that people like their statistics, and me,
I'm trying to make split-tcp work better, ultimately, one day....
I'm going to rerun this without the fw rules next.
--
Dave Täht
SKYPE: davetaht
US Tel: 1-239-829-5608
http://the-edge.blogspot.com
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: oprofiling is much saner looking now with rc6-smoketest
2011-08-31 0:32 oprofiling is much saner looking now with rc6-smoketest Dave Taht
@ 2011-08-31 1:01 ` Rick Jones
2011-08-31 1:10 ` Simon Barber
2011-08-31 1:45 ` Dave Taht
2011-08-31 1:41 ` Dave Taht
1 sibling, 2 replies; 10+ messages in thread
From: Rick Jones @ 2011-08-31 1:01 UTC (permalink / raw)
To: Dave Taht; +Cc: bloat-devel
On 08/30/2011 05:32 PM, Dave Taht wrote:
> I get about 190Mbit/sec from netperf now, on GigE, with oprofiling
> enabled, driver buffers of 4, txqueue of 8, cerowrt default iptables
> rules, AND web10g patched into kernel 3.0.3.
>
> This is much saner than rc3, and judging from the csum_partial and
> copy_user being roughly equal, there isn't much left to be gained...
>
> Nice work.
>
> (Without oprofiling, and without web10g and with tcp cubic I can get
> past 250Mbit)
>
>
> CPU: MIPS 24K, speed 0 MHz (estimated)
> Counted INSTRUCTIONS events (Instructions completed) with a unit mask
> of 0x00 (No unit mask) count 100000
> samples % app name symbol name
> -------------------------------------------------------------------------------
> 17277 13.8798 vmlinux csum_partial
> 17277 100.000 vmlinux csum_partial [self]
> -------------------------------------------------------------------------------
> 16607 13.3415 vmlinux __copy_user
> 16607 100.000 vmlinux __copy_user [self]
> -------------------------------------------------------------------------------
> 11913 9.5705 ip_tables /ip_tables
> 11913 100.000 ip_tables /ip_tables [self]
> -------------------------------------------------------------------------------
> 8949 7.1893 nf_conntrack /nf_conntrack
> 8949 100.000 nf_conntrack /nf_conntrack [self]
>
> In this case I was going from laptop - gige - through another
> rc6-smoketest router - to_this_box's internal lan port.
>
> It bugs me that iptables and conntrack eat so much cpu for what
> is an internal-only connection, e.g. one that
> doesn't need conntracking.
The csum_partial is a bit surprising - I thought every NIC and its dog
offered CKO these days - or is that something happening with
ip_tables/contrack? I also thought that Linux used an integrated
copy/checksum in at least one direction, or did that go away when CKO
became prevalent?
If this is inbound, and there is just plain checksumming and not
anything funny from conntrack, I would have expected checksum to be much
larger than copy. Checksum (in the inbound direction) will take the
cache misses and the copy would not. Unless... the data cache of the
processor is getting completely trashed - say from the netserver running
on the router not keeping up with the inbound data fully and so the copy
gets "far away" from the checksum verification.
Does perf/perf_events (whatever the followon to perfmon2 is called) have
support for the CPU used in the device? (Assuming it even has a PMU to
be queried in the first place)
> That said, I understand that people like their statistics, and me,
> I'm trying to make split-tcp work better, ultimately, one day....
>
> I'm going to rerun this without the fw rules next.
It would be interesting to see if the csum time goes away. Long ago and
far away when I was beating on a 32-core system with aggregate netperf
TCP_RR and enabling or not FW rules, conntrack had a non-trivial effect
indeed on performance.
http://markmail.org/message/exjtzel7vq2ugt66#query:netdev%20conntrack%20rick%20jones%2032%20netperf+page:1+mid:s5v5kylvmlfrpb7a+state:results
I think will get to the start of that thread. The subject is '32 core
net-next stack/netfilter "scaling"'
rick jones
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: oprofiling is much saner looking now with rc6-smoketest
2011-08-31 1:01 ` Rick Jones
@ 2011-08-31 1:10 ` Simon Barber
2011-08-31 1:20 ` Simon Barber
2011-08-31 1:45 ` Dave Taht
1 sibling, 1 reply; 10+ messages in thread
From: Simon Barber @ 2011-08-31 1:10 UTC (permalink / raw)
To: bloat-devel
Why is conntrack even getting involved?
Simon
On 08/30/2011 06:01 PM, Rick Jones wrote:
> On 08/30/2011 05:32 PM, Dave Taht wrote:
>> I get about 190Mbit/sec from netperf now, on GigE, with oprofiling
>> enabled, driver buffers of 4, txqueue of 8, cerowrt default iptables
>> rules, AND web10g patched into kernel 3.0.3.
>>
>> This is much saner than rc3, and judging from the csum_partial and
>> copy_user being roughly equal, there isn't much left to be gained...
> >
>> Nice work.
>>
>> (Without oprofiling, and without web10g and with tcp cubic I can get
>> past 250Mbit)
>>
>>
>> CPU: MIPS 24K, speed 0 MHz (estimated)
>> Counted INSTRUCTIONS events (Instructions completed) with a unit mask
>> of 0x00 (No unit mask) count 100000
>> samples % app name symbol name
>> -------------------------------------------------------------------------------
>>
>> 17277 13.8798 vmlinux csum_partial
>> 17277 100.000 vmlinux csum_partial [self]
>> -------------------------------------------------------------------------------
>>
>> 16607 13.3415 vmlinux __copy_user
>> 16607 100.000 vmlinux __copy_user [self]
>> -------------------------------------------------------------------------------
>>
>> 11913 9.5705 ip_tables /ip_tables
>> 11913 100.000 ip_tables /ip_tables [self]
>> -------------------------------------------------------------------------------
>>
>> 8949 7.1893 nf_conntrack /nf_conntrack
>> 8949 100.000 nf_conntrack /nf_conntrack [self]
>>
>> In this case I was going from laptop - gige - through another
>> rc6-smoketest router - to_this_box's internal lan port.
>>
>> It bugs me that iptables and conntrack eat so much cpu for what
>> is an internal-only connection, e.g. one that
>> doesn't need conntracking.
>
> The csum_partial is a bit surprising - I thought every NIC and its dog
> offered CKO these days - or is that something happening with
> ip_tables/contrack? I also thought that Linux used an integrated
> copy/checksum in at least one direction, or did that go away when CKO
> became prevalent?
>
> If this is inbound, and there is just plain checksumming and not
> anything funny from conntrack, I would have expected checksum to be much
> larger than copy. Checksum (in the inbound direction) will take the
> cache misses and the copy would not. Unless... the data cache of the
> processor is getting completely trashed - say from the netserver running
> on the router not keeping up with the inbound data fully and so the copy
> gets "far away" from the checksum verification.
>
> Does perf/perf_events (whatever the followon to perfmon2 is called) have
> support for the CPU used in the device? (Assuming it even has a PMU to
> be queried in the first place)
>
>> That said, I understand that people like their statistics, and me,
>> I'm trying to make split-tcp work better, ultimately, one day....
>>
>> I'm going to rerun this without the fw rules next.
>
> It would be interesting to see if the csum time goes away. Long ago and
> far away when I was beating on a 32-core system with aggregate netperf
> TCP_RR and enabling or not FW rules, conntrack had a non-trivial effect
> indeed on performance.
>
> http://markmail.org/message/exjtzel7vq2ugt66#query:netdev%20conntrack%20rick%20jones%2032%20netperf+page:1+mid:s5v5kylvmlfrpb7a+state:results
>
>
> I think will get to the start of that thread. The subject is '32 core
> net-next stack/netfilter "scaling"'
>
> rick jones
> _______________________________________________
> Bloat-devel mailing list
> Bloat-devel@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat-devel
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: oprofiling is much saner looking now with rc6-smoketest
2011-08-31 1:10 ` Simon Barber
@ 2011-08-31 1:20 ` Simon Barber
0 siblings, 0 replies; 10+ messages in thread
From: Simon Barber @ 2011-08-31 1:20 UTC (permalink / raw)
To: bloat-devel
Apologies - I should have read the scenario better - the connection is
terminated on the router.
Simon
On 08/30/2011 06:10 PM, Simon Barber wrote:
> Why is conntrack even getting involved?
>
> Simon
>
> On 08/30/2011 06:01 PM, Rick Jones wrote:
>> On 08/30/2011 05:32 PM, Dave Taht wrote:
>>> I get about 190Mbit/sec from netperf now, on GigE, with oprofiling
>>> enabled, driver buffers of 4, txqueue of 8, cerowrt default iptables
>>> rules, AND web10g patched into kernel 3.0.3.
>>>
>>> This is much saner than rc3, and judging from the csum_partial and
>>> copy_user being roughly equal, there isn't much left to be gained...
>> >
>>> Nice work.
>>>
>>> (Without oprofiling, and without web10g and with tcp cubic I can get
>>> past 250Mbit)
>>>
>>>
>>> CPU: MIPS 24K, speed 0 MHz (estimated)
>>> Counted INSTRUCTIONS events (Instructions completed) with a unit mask
>>> of 0x00 (No unit mask) count 100000
>>> samples % app name symbol name
>>> -------------------------------------------------------------------------------
>>>
>>>
>>> 17277 13.8798 vmlinux csum_partial
>>> 17277 100.000 vmlinux csum_partial [self]
>>> -------------------------------------------------------------------------------
>>>
>>>
>>> 16607 13.3415 vmlinux __copy_user
>>> 16607 100.000 vmlinux __copy_user [self]
>>> -------------------------------------------------------------------------------
>>>
>>>
>>> 11913 9.5705 ip_tables /ip_tables
>>> 11913 100.000 ip_tables /ip_tables [self]
>>> -------------------------------------------------------------------------------
>>>
>>>
>>> 8949 7.1893 nf_conntrack /nf_conntrack
>>> 8949 100.000 nf_conntrack /nf_conntrack [self]
>>>
>>> In this case I was going from laptop - gige - through another
>>> rc6-smoketest router - to_this_box's internal lan port.
>>>
>>> It bugs me that iptables and conntrack eat so much cpu for what
>>> is an internal-only connection, e.g. one that
>>> doesn't need conntracking.
>>
>> The csum_partial is a bit surprising - I thought every NIC and its dog
>> offered CKO these days - or is that something happening with
>> ip_tables/contrack? I also thought that Linux used an integrated
>> copy/checksum in at least one direction, or did that go away when CKO
>> became prevalent?
>>
>> If this is inbound, and there is just plain checksumming and not
>> anything funny from conntrack, I would have expected checksum to be much
>> larger than copy. Checksum (in the inbound direction) will take the
>> cache misses and the copy would not. Unless... the data cache of the
>> processor is getting completely trashed - say from the netserver running
>> on the router not keeping up with the inbound data fully and so the copy
>> gets "far away" from the checksum verification.
>>
>> Does perf/perf_events (whatever the followon to perfmon2 is called) have
>> support for the CPU used in the device? (Assuming it even has a PMU to
>> be queried in the first place)
>>
>>> That said, I understand that people like their statistics, and me,
>>> I'm trying to make split-tcp work better, ultimately, one day....
>>>
>>> I'm going to rerun this without the fw rules next.
>>
>> It would be interesting to see if the csum time goes away. Long ago and
>> far away when I was beating on a 32-core system with aggregate netperf
>> TCP_RR and enabling or not FW rules, conntrack had a non-trivial effect
>> indeed on performance.
>>
>> http://markmail.org/message/exjtzel7vq2ugt66#query:netdev%20conntrack%20rick%20jones%2032%20netperf+page:1+mid:s5v5kylvmlfrpb7a+state:results
>>
>>
>>
>> I think will get to the start of that thread. The subject is '32 core
>> net-next stack/netfilter "scaling"'
>>
>> rick jones
>> _______________________________________________
>> Bloat-devel mailing list
>> Bloat-devel@lists.bufferbloat.net
>> https://lists.bufferbloat.net/listinfo/bloat-devel
> _______________________________________________
> Bloat-devel mailing list
> Bloat-devel@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat-devel
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: oprofiling is much saner looking now with rc6-smoketest
2011-08-31 0:32 oprofiling is much saner looking now with rc6-smoketest Dave Taht
2011-08-31 1:01 ` Rick Jones
@ 2011-08-31 1:41 ` Dave Taht
1 sibling, 0 replies; 10+ messages in thread
From: Dave Taht @ 2011-08-31 1:41 UTC (permalink / raw)
To: bloat-devel
[-- Attachment #1: Type: text/plain, Size: 2424 bytes --]
This is the same test, repeated, sans firewall rules - all iptables
rules cleared. I get about 220MB/sec without oprofile running, and 208
with it running. (vs about 190MB with the iptables rules in the
previous run)
As noted earlier this is with netperf running on the router itself,
with web10g patched in. Web10g supplies some interesting statistics
(attached), and I have a tcptrace/xplot.org screenshots of the
previous run here:
http://huchra.bufferbloat.net/~d/rc6-smoke-captures/
that is also interesting. There is a knee at 30 seconds and other
somewhat odd looking behavior when you zoom in.
I further note that the laptop driving this test (via a gige pcmcia
card) has a default txqueuelen of 1000 and I don't presently know the
length of the dma tx ring, and is running the std ubuntu 11.4 kernel (
2.6.38-11-generic), with ecn,sack,dsack enabled.
And the best performance I've got from the laptop to a pentium 4 box
was 290Mbit. I note that I'm pretty happy with 220Mbit OK! I wanted to
have a baseline value before I started fiddling with vlan and dscp
stuff....
CPU: MIPS 24K, speed 0 MHz (estimated)
Counted INSTRUCTIONS events (Instructions completed) with a unit mask
of 0x00 (No unit mask) count 100000
samples % app name symbol name
-------------------------------------------------------------------------------
17141 14.8045 vmlinux csum_partial
17141 100.000 vmlinux csum_partial [self]
-------------------------------------------------------------------------------
17024 14.7035 vmlinux __copy_user
17024 100.000 vmlinux __copy_user [self]
-------------------------------------------------------------------------------
8888 7.6765 nf_conntrack /nf_conntrack
8888 100.000 nf_conntrack /nf_conntrack [self]
-------------------------------------------------------------------------------
4139 3.5748 vmlinux __do_softirq
4139 100.000 vmlinux __do_softirq [self]
-------------------------------------------------------------------------------
4055 3.5023 ip_tables /ip_tables
4055 100.000 ip_tables /ip_tables [self]
--
Dave Täht
SKYPE: davetaht
US Tel: 1-239-829-5608
http://the-edge.blogspot.com
[-- Attachment #2: estats.stuff.txt --]
[-- Type: text/plain, Size: 11288 bytes --]
Connection 64 (172.29.1.33_22 172.29.1.123_51581)
LocalAddressType : 1
LocalAddress : 172.29.1.33
LocalPort : 22
RemAddressType : 1
RemAddress : 172.29.1.123
RemPort : 51581
SegsOut : 2668
DataSegsOut : 2668
DataOctetsOut : 493373
SegsRetrans : 3
OctetsRetrans : 208
SegsIn : 4272
DataSegsIn : 1885
DataOctetsIn : 93215
ElapsedSecs : 6618
ElapsedMicroSecs : 723887
CurMSS : 1460
PipeSize : 0
MaxPipeSize : 4380
SmoothedRTT : 10
CurRTO : 210
CongSignals : 0
CurCwnd : 8760
CurSsthresh : 4294967295
Timeouts : 0
CurRwinSent : 16888
MaxRwinSent : 16888
ZeroRwinSent : 0
CurRwinRcvd : 293248
MaxRwinRcvd : 293248
ZeroRwinRcvd : 0
SndLimTransRwin : 0
SndLimTransCwnd : 1
SndLimTransSnd : 2
SndLimTimeRwin : 0
SndLimTimeCwnd : 400477716
SndLimTimeSnd : 4073612272
SendStall : 0
RetranThresh : 3
NonRecovDAEpisodes : 1
SumOctetsReordered : 256
NonRecovDA : 0
SampleRTT : 0
RTTVar : 50
MaxRTT : 40
MinRTT : 0
SumRTT : 9160
CountRTT : 2507
MaxRTO : 230
MinRTO : 210
IpTtl : 64
IpTosIn : 16
IpTosOut : 0
PreCongSumCwnd : 0
PreCongSumRTT : 0
PostCongSumRTT : 0
PostCongCountRTT : 0
ECNsignals : 0
DupAckEpisodes : 0
RcvRTT : 3134114
DupAcksOut : 0
CERcvd : 0
ECESent : 0
ActiveOpen : 0
MSSSent : 1460
MSSRcvd : 1460
WinScaleSent : 1
WinScaleRcvd : 7
TimeStamps : 2
ECN : 3
WillSendSACK : 1
WillUseSACK : 1
State : 1953653102
Nagle : 1
MaxSsCwnd : 14600
MaxCaCwnd : 10220
MaxSsthresh : 5840
MinSsthresh : 2920
InRecovery : 2
DupAcksIn : 0
SpuriousFrDetected : 0
SpuriousRtoDetected : 0
SoftErrors : 0
SoftErrorReason : 0
SlowStart : 2
CongAvoid : 41
OtherReductions : 0
CongOverCount : 0
FastRetran : 0
SubsequentTimeouts : 0
CurTimeoutCount : 0
AbruptTimeouts : 0
SACKsRcvd : 0
SACKBlocksRcvd : 0
DSACKDups : 0
MaxMSS : 1460
MinMSS : 1440
SndInitial : 2230931079
RecInitial : 1469570772
CurRetxQueue : 0
MaxRetxQueue : 0
CurReasmQueue : 0
MaxReasmQueue : 0
SndUna : 2231370872
SndNxt : 2231370872
SndMax : 2231370872
ThruOctetsAcked : 439793
RcvNxt : 1469663891
ThruOctetsReceived : 93119
CurAppWQueue : 0
MaxAppWQueue : 0
CurAppRQueue : 0
MaxAppRQueue : 1144
LimCwnd : 4294965836
LimSsthresh : 0
LimRwin : 64075
LimMSS : 95682560
OtherReductionsCV : 0
OtherReductionsCM : 0
StartTimeSecs : 1314747615
StartTimeMicroSecs : 690002
Sndbuf : 16384
Rcvbuf : 87380
Connection 56 (172.29.1.97_22 172.29.1.123_52588)
LocalAddressType : 1
LocalAddress : 172.29.1.97
LocalPort : 22
RemAddressType : 1
RemAddress : 172.29.1.123
RemPort : 52588
SegsOut : 1241
DataSegsOut : 1241
DataOctetsOut : 110645
SegsRetrans : 0
OctetsRetrans : 0
SegsIn : 2000
DataSegsIn : 914
DataOctetsIn : 45711
ElapsedSecs : 4990
ElapsedMicroSecs : 629033
CurMSS : 1460
PipeSize : 0
MaxPipeSize : 1460
SmoothedRTT : 10
CurRTO : 210
CongSignals : 0
CurCwnd : 14600
CurSsthresh : 4294965836
Timeouts : 0
CurRwinSent : 16888
MaxRwinSent : 16888
ZeroRwinSent : 0
CurRwinRcvd : 64128
MaxRwinRcvd : 64128
ZeroRwinRcvd : 0
SndLimTransRwin : 0
SndLimTransCwnd : 0
SndLimTransSnd : 1
SndLimTimeRwin : 0
SndLimTimeCwnd : 0
SndLimTimeSnd : 4171379644
SendStall : 0
RetranThresh : 3
NonRecovDAEpisodes : 0
SumOctetsReordered : 0
NonRecovDA : 0
SampleRTT : 0
RTTVar : 50
MaxRTT : 120
MinRTT : 0
SumRTT : 3280
CountRTT : 1114
MaxRTO : 230
MinRTO : 210
IpTtl : 64
IpTosIn : 16
IpTosOut : 0
PreCongSumCwnd : 0
PreCongSumRTT : 0
PostCongSumRTT : 0
PostCongCountRTT : 0
ECNsignals : 0
DupAckEpisodes : 0
RcvRTT : 1532633
DupAcksOut : 0
CERcvd : 0
ECESent : 0
ActiveOpen : 0
MSSSent : 1460
MSSRcvd : 1460
WinScaleSent : 1
WinScaleRcvd : 7
TimeStamps : 2
ECN : 3
WillSendSACK : 1
WillUseSACK : 1
State : 1953653102
Nagle : 1
MaxSsCwnd : 14600
MaxCaCwnd : 0
MaxSsthresh : 0
MinSsthresh : 4294967295
InRecovery : 2
DupAcksIn : 0
SpuriousFrDetected : 0
SpuriousRtoDetected : 0
SoftErrors : 0
SoftErrorReason : 0
SlowStart : 0
CongAvoid : 0
OtherReductions : 0
CongOverCount : 0
FastRetran : 0
SubsequentTimeouts : 0
CurTimeoutCount : 0
AbruptTimeouts : 0
SACKsRcvd : 0
SACKBlocksRcvd : 0
DSACKDups : 0
MaxMSS : 1460
MinMSS : 1440
SndInitial : 1838946708
RecInitial : 749866429
CurRetxQueue : 0
MaxRetxQueue : 0
CurReasmQueue : 0
MaxReasmQueue : 0
SndUna : 1839032533
SndNxt : 1839032533
SndMax : 1839032533
ThruOctetsAcked : 85825
RcvNxt : 749912140
ThruOctetsReceived : 45711
CurAppWQueue : 0
MaxAppWQueue : 0
CurAppRQueue : 0
MaxAppRQueue : 1144
LimCwnd : 4294965836
LimSsthresh : 0
LimRwin : 64075
LimMSS : 95682560
OtherReductionsCV : 0
OtherReductionsCM : 0
StartTimeSecs : 1314742638
StartTimeMicroSecs : 218763
Sndbuf : 16384
Rcvbuf : 87380
Connection 0 (172.29.1.97_22 172.29.1.123_55029)
LocalAddressType : 1
LocalAddress : 172.29.1.97
LocalPort : 22
RemAddressType : 1
RemAddress : 172.29.1.123
RemPort : 55029
SegsOut : 2927
DataSegsOut : 2927
DataOctetsOut : 563021
SegsRetrans : 0
OctetsRetrans : 0
SegsIn : 4394
DataSegsIn : 1820
DataOctetsIn : 93439
ElapsedSecs : 65542
ElapsedMicroSecs : 507312
CurMSS : 1460
PipeSize : 0
MaxPipeSize : 4380
SmoothedRTT : 20
CurRTO : 220
CongSignals : 0
CurCwnd : 14600
CurSsthresh : 4294965836
Timeouts : 0
CurRwinSent : 22712
MaxRwinSent : 22712
ZeroRwinSent : 0
CurRwinRcvd : 357760
MaxRwinRcvd : 357760
ZeroRwinRcvd : 0
SndLimTransRwin : 0
SndLimTransCwnd : 0
SndLimTransSnd : 1
SndLimTimeRwin : 0
SndLimTimeCwnd : 0
SndLimTimeSnd : 2439344612
SendStall : 0
RetranThresh : 3
NonRecovDAEpisodes : 0
SumOctetsReordered : 0
NonRecovDA : 0
SampleRTT : 0
RTTVar : 50
MaxRTT : 50
MinRTT : 0
SumRTT : 12590
CountRTT : 2722
MaxRTO : 230
MinRTO : 210
IpTtl : 64
IpTosIn : 16
IpTosOut : 0
PreCongSumCwnd : 0
PreCongSumRTT : 0
PostCongSumRTT : 0
PostCongCountRTT : 0
ECNsignals : 0
DupAckEpisodes : 0
RcvRTT : 644526
DupAcksOut : 0
CERcvd : 0
ECESent : 0
ActiveOpen : 0
MSSSent : 1460
MSSRcvd : 1460
WinScaleSent : 1
WinScaleRcvd : 7
TimeStamps : 2
ECN : 3
WillSendSACK : 1
WillUseSACK : 1
State : 1953653102
Nagle : 1
MaxSsCwnd : 14600
MaxCaCwnd : 0
MaxSsthresh : 0
MinSsthresh : 4294967295
InRecovery : 2
DupAcksIn : 7
SpuriousFrDetected : 0
SpuriousRtoDetected : 0
SoftErrors : 7
SoftErrorReason : 1
SlowStart : 0
CongAvoid : 0
OtherReductions : 0
CongOverCount : 0
FastRetran : 0
SubsequentTimeouts : 0
CurTimeoutCount : 0
AbruptTimeouts : 0
SACKsRcvd : 0
SACKBlocksRcvd : 0
DSACKDups : 0
MaxMSS : 1460
MinMSS : 1440
SndInitial : 1018333932
RecInitial : 1871301081
CurRetxQueue : 0
MaxRetxQueue : 0
CurReasmQueue : 0
MaxReasmQueue : 0
SndUna : 1018838413
SndNxt : 1018838413
SndMax : 1018838413
ThruOctetsAcked : 504481
RcvNxt : 1871394520
ThruOctetsReceived : 93439
CurAppWQueue : 0
MaxAppWQueue : 0
CurAppRQueue : 0
MaxAppRQueue : 1456
LimCwnd : 4294965836
LimSsthresh : 0
LimRwin : 64075
LimMSS : 95682560
OtherReductionsCV : 0
OtherReductionsCM : 0
StartTimeSecs : 1314684283
StartTimeMicroSecs : 261134
Sndbuf : 16384
Rcvbuf : 87380
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: oprofiling is much saner looking now with rc6-smoketest
2011-08-31 1:01 ` Rick Jones
2011-08-31 1:10 ` Simon Barber
@ 2011-08-31 1:45 ` Dave Taht
2011-08-31 1:58 ` Dave Taht
2011-08-31 15:55 ` Rick Jones
1 sibling, 2 replies; 10+ messages in thread
From: Dave Taht @ 2011-08-31 1:45 UTC (permalink / raw)
To: Rick Jones; +Cc: bloat-devel
On Tue, Aug 30, 2011 at 6:01 PM, Rick Jones <rick.jones2@hp.com> wrote:
> On 08/30/2011 05:32 PM, Dave Taht wrote:
>> It bugs me that iptables and conntrack eat so much cpu for what
>> is an internal-only connection, e.g. one that
>> doesn't need conntracking.
>
> The csum_partial is a bit surprising - I thought every NIC and its dog
> offered CKO these days - or is that something happening with
> ip_tables/contrack?
If this chipset supports it, so far as I know, it isn't documented or
implemented.
> I also thought that Linux used an integrated
> copy/checksum in at least one direction, or did that go away when CKO became
> prevalent?
Don't know.
>
> If this is inbound, and there is just plain checksumming and not anything
> funny from conntrack, I would have expected checksum to be much larger than
> copy. Checksum (in the inbound direction) will take the cache misses and
> the copy would not. Unless... the data cache of the processor is getting
> completely trashed - say from the netserver running on the router not
> keeping up with the inbound data fully and so the copy gets "far away" from
> the checksum verification.
220Mbit isn't good enough for ya? Previous tests ran at about 140Mbit, but due
to some major optimizations by felix to fix a bunch of mis-alignment
issues. Through the router, I've seen 260Mbit - which is perilously
close to the speed that I can drive it at from the test boxes.
>
> Does perf/perf_events (whatever the followon to perfmon2 is called) have
> support for the CPU used in the device? (Assuming it even has a PMU to be
> queried in the first place)
Yes. Don't think it's enabled. It is running flat out, according to top.
>
>> That said, I understand that people like their statistics, and me,
>> I'm trying to make split-tcp work better, ultimately, one day....
>>
>> I'm going to rerun this without the fw rules next.
>
> It would be interesting to see if the csum time goes away. Long ago and far
> away when I was beating on a 32-core system with aggregate netperf TCP_RR
> and enabling or not FW rules, conntrack had a non-trivial effect indeed on
> performance.
Stays about the same. iptables time drops. How to disable conntrack?
Don't you only really
need it for nat?
>
> http://markmail.org/message/exjtzel7vq2ugt66#query:netdev%20conntrack%20rick%20jones%2032%20netperf+page:1+mid:s5v5kylvmlfrpb7a+state:results
>
> I think will get to the start of that thread. The subject is '32 core
> net-next stack/netfilter "scaling"'
>
> rick jones
>
--
Dave Täht
SKYPE: davetaht
US Tel: 1-239-829-5608
http://the-edge.blogspot.com
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: oprofiling is much saner looking now with rc6-smoketest
2011-08-31 1:45 ` Dave Taht
@ 2011-08-31 1:58 ` Dave Taht
2011-08-31 3:28 ` Dave Taht
2011-08-31 15:55 ` Rick Jones
1 sibling, 1 reply; 10+ messages in thread
From: Dave Taht @ 2011-08-31 1:58 UTC (permalink / raw)
To: Rick Jones; +Cc: bloat-devel
I have put the current rc6 smoketest up at:
http://huchra.bufferbloat.net/~cero1/rc6-smoketest/
So far it's proving very stable. Wireless performance is excellent and
wired performance dramatically improved. No crash bugs thus far,
though I had a scare...
For the final rc6, which I hope to have done by friday, I'm in the
process of cleanly re-assembling the patch set (sorry, the sources are
a bit of a mess at present). For this rc, I'm hoping that a new
iptables lands, in particular, and I have numerous other little things
in the queue to sort out.
All that said, getting oprofile running is not hard, and I do
appreciate smoke testers helping out!!! as I don't think I'll be able
to get another release candidate done before linux plumbers.
install the correct image on your router from the above via web
interface or sysupgrade -n
reboot
edit /etc/opkg.conf to have that url in it
opkg update
opkg install oprofile
cd /tmp
mkdir /tmp/oprofile
wget http://huchra.bufferbloat.net/~d/rc6-smoke-captures/vmlinux
opcontrol --vmlinux=/tmp/vmlinux --session-dir=/tmp/oprofile (saving
profile data to flash is a bad idea)
opcontrol --start
# do your testing
opcontrol --dump
opreport -c # or whatever options you like.
On Tue, Aug 30, 2011 at 6:45 PM, Dave Taht <dave.taht@gmail.com> wrote:
> On Tue, Aug 30, 2011 at 6:01 PM, Rick Jones <rick.jones2@hp.com> wrote:
>> On 08/30/2011 05:32 PM, Dave Taht wrote:
>
>>> It bugs me that iptables and conntrack eat so much cpu for what
>>> is an internal-only connection, e.g. one that
>>> doesn't need conntracking.
>>
>> The csum_partial is a bit surprising - I thought every NIC and its dog
>> offered CKO these days - or is that something happening with
>> ip_tables/contrack?
>
> If this chipset supports it, so far as I know, it isn't documented or
> implemented.
>
>> I also thought that Linux used an integrated
>> copy/checksum in at least one direction, or did that go away when CKO became
>> prevalent?
>
> Don't know.
>
>>
>> If this is inbound, and there is just plain checksumming and not anything
>> funny from conntrack, I would have expected checksum to be much larger than
>> copy. Checksum (in the inbound direction) will take the cache misses and
>> the copy would not. Unless... the data cache of the processor is getting
>> completely trashed - say from the netserver running on the router not
>> keeping up with the inbound data fully and so the copy gets "far away" from
>> the checksum verification.
>
> 220Mbit isn't good enough for ya? Previous tests ran at about 140Mbit, but due
> to some major optimizations by felix to fix a bunch of mis-alignment
> issues. Through the router, I've seen 260Mbit - which is perilously
> close to the speed that I can drive it at from the test boxes.
>
>>
>> Does perf/perf_events (whatever the followon to perfmon2 is called) have
>> support for the CPU used in the device? (Assuming it even has a PMU to be
>> queried in the first place)
>
> Yes. Don't think it's enabled. It is running flat out, according to top.
>
>>
>>> That said, I understand that people like their statistics, and me,
>>> I'm trying to make split-tcp work better, ultimately, one day....
>>>
>>> I'm going to rerun this without the fw rules next.
>>
>> It would be interesting to see if the csum time goes away. Long ago and far
>> away when I was beating on a 32-core system with aggregate netperf TCP_RR
>> and enabling or not FW rules, conntrack had a non-trivial effect indeed on
>> performance.
>
> Stays about the same. iptables time drops. How to disable conntrack?
> Don't you only really
> need it for nat?
>
>>
>> http://markmail.org/message/exjtzel7vq2ugt66#query:netdev%20conntrack%20rick%20jones%2032%20netperf+page:1+mid:s5v5kylvmlfrpb7a+state:results
>>
>> I think will get to the start of that thread. The subject is '32 core
>> net-next stack/netfilter "scaling"'
>>
>> rick jones
>>
>
>
>
> --
> Dave Täht
> SKYPE: davetaht
> US Tel: 1-239-829-5608
> http://the-edge.blogspot.com
>
--
Dave Täht
SKYPE: davetaht
US Tel: 1-239-829-5608
http://the-edge.blogspot.com
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: oprofiling is much saner looking now with rc6-smoketest
2011-08-31 1:58 ` Dave Taht
@ 2011-08-31 3:28 ` Dave Taht
2011-08-31 16:19 ` Rick Jones
0 siblings, 1 reply; 10+ messages in thread
From: Dave Taht @ 2011-08-31 3:28 UTC (permalink / raw)
To: Rick Jones; +Cc: bloat-devel
I took a little more time out to play with netperf at these extreme
performance values, while puzzled about the performance knee observed
midway through the previous tests.
The three tests runs this evening (and captures!) are up at:
http://huchra.bufferbloat.net/~d/rc6-smoke-captures/
For test 3, I rebooted the router into it's default tx ring (64), and
set a txqueuelen of 128, running cubic...
Measured throughput was mildly better (admittedly on a fresh boot,
oprofile not even loaded) 229Mbit,
and we didn't have a drop off at all, so I'm still chasing that...
What I found interesting was the 10 second periodicity of the
drop-offs. My assumption is that this is a timer being fired from
somewhere (netperf?) that blocks the transmission...
http://huchra.bufferbloat.net/~d/rc6-smoke-captures/txqueuelen128and10seconddropcycle.png
test 4 will repeat the above sans oprofile, with the current default
cerowrt settings for dma tx (4) and txqueuelen 8. If I get to it
tonight.
On Tue, Aug 30, 2011 at 6:58 PM, Dave Taht <dave.taht@gmail.com> wrote:
> I have put the current rc6 smoketest up at:
>
> http://huchra.bufferbloat.net/~cero1/rc6-smoketest/
>
> So far it's proving very stable. Wireless performance is excellent and
> wired performance dramatically improved. No crash bugs thus far,
> though I had a scare...
>
> For the final rc6, which I hope to have done by friday, I'm in the
> process of cleanly re-assembling the patch set (sorry, the sources are
> a bit of a mess at present). For this rc, I'm hoping that a new
> iptables lands, in particular, and I have numerous other little things
> in the queue to sort out.
>
> All that said, getting oprofile running is not hard, and I do
> appreciate smoke testers helping out!!! as I don't think I'll be able
> to get another release candidate done before linux plumbers.
>
> install the correct image on your router from the above via web
> interface or sysupgrade -n
> reboot
> edit /etc/opkg.conf to have that url in it
> opkg update
> opkg install oprofile
> cd /tmp
> mkdir /tmp/oprofile
> wget http://huchra.bufferbloat.net/~d/rc6-smoke-captures/vmlinux
> opcontrol --vmlinux=/tmp/vmlinux --session-dir=/tmp/oprofile (saving
> profile data to flash is a bad idea)
>
> opcontrol --start
> # do your testing
> opcontrol --dump
>
> opreport -c # or whatever options you like.
>
>
> On Tue, Aug 30, 2011 at 6:45 PM, Dave Taht <dave.taht@gmail.com> wrote:
>> On Tue, Aug 30, 2011 at 6:01 PM, Rick Jones <rick.jones2@hp.com> wrote:
>>> On 08/30/2011 05:32 PM, Dave Taht wrote:
>>
>>>> It bugs me that iptables and conntrack eat so much cpu for what
>>>> is an internal-only connection, e.g. one that
>>>> doesn't need conntracking.
>>>
>>> The csum_partial is a bit surprising - I thought every NIC and its dog
>>> offered CKO these days - or is that something happening with
>>> ip_tables/contrack?
>>
>> If this chipset supports it, so far as I know, it isn't documented or
>> implemented.
>>
>>> I also thought that Linux used an integrated
>>> copy/checksum in at least one direction, or did that go away when CKO became
>>> prevalent?
>>
>> Don't know.
>>
>>>
>>> If this is inbound, and there is just plain checksumming and not anything
>>> funny from conntrack, I would have expected checksum to be much larger than
>>> copy. Checksum (in the inbound direction) will take the cache misses and
>>> the copy would not. Unless... the data cache of the processor is getting
>>> completely trashed - say from the netserver running on the router not
>>> keeping up with the inbound data fully and so the copy gets "far away" from
>>> the checksum verification.
>>
>> 220Mbit isn't good enough for ya? Previous tests ran at about 140Mbit, but due
>> to some major optimizations by felix to fix a bunch of mis-alignment
>> issues. Through the router, I've seen 260Mbit - which is perilously
>> close to the speed that I can drive it at from the test boxes.
>>
>>>
>>> Does perf/perf_events (whatever the followon to perfmon2 is called) have
>>> support for the CPU used in the device? (Assuming it even has a PMU to be
>>> queried in the first place)
>>
>> Yes. Don't think it's enabled. It is running flat out, according to top.
>>
>>>
>>>> That said, I understand that people like their statistics, and me,
>>>> I'm trying to make split-tcp work better, ultimately, one day....
>>>>
>>>> I'm going to rerun this without the fw rules next.
>>>
>>> It would be interesting to see if the csum time goes away. Long ago and far
>>> away when I was beating on a 32-core system with aggregate netperf TCP_RR
>>> and enabling or not FW rules, conntrack had a non-trivial effect indeed on
>>> performance.
>>
>> Stays about the same. iptables time drops. How to disable conntrack?
>> Don't you only really
>> need it for nat?
>>
>>>
>>> http://markmail.org/message/exjtzel7vq2ugt66#query:netdev%20conntrack%20rick%20jones%2032%20netperf+page:1+mid:s5v5kylvmlfrpb7a+state:results
>>>
>>> I think will get to the start of that thread. The subject is '32 core
>>> net-next stack/netfilter "scaling"'
>>>
>>> rick jones
>>>
>>
>>
>>
>> --
>> Dave Täht
>> SKYPE: davetaht
>> US Tel: 1-239-829-5608
>> http://the-edge.blogspot.com
>>
>
>
>
> --
> Dave Täht
> SKYPE: davetaht
> US Tel: 1-239-829-5608
> http://the-edge.blogspot.com
>
--
Dave Täht
SKYPE: davetaht
US Tel: 1-239-829-5608
http://the-edge.blogspot.com
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: oprofiling is much saner looking now with rc6-smoketest
2011-08-31 1:45 ` Dave Taht
2011-08-31 1:58 ` Dave Taht
@ 2011-08-31 15:55 ` Rick Jones
1 sibling, 0 replies; 10+ messages in thread
From: Rick Jones @ 2011-08-31 15:55 UTC (permalink / raw)
To: Dave Taht; +Cc: bloat-devel
>> If this is inbound, and there is just plain checksumming and not anything
>> funny from conntrack, I would have expected checksum to be much larger than
>> copy. Checksum (in the inbound direction) will take the cache misses and
>> the copy would not. Unless... the data cache of the processor is getting
>> completely trashed - say from the netserver running on the router not
>> keeping up with the inbound data fully and so the copy gets "far away" from
>> the checksum verification.
>
> 220Mbit isn't good enough for ya? Previous tests ran at about 140Mbit, but due
> to some major optimizations by felix to fix a bunch of mis-alignment
> issues. Through the router, I've seen 260Mbit - which is perilously
> close to the speed that I can drive it at from the test boxes.
It is all a question of context. The last time I was in a context where
220 Mbit/s was high speed was when 100 BT first shipped or perhaps FDDI
before that :)
>> Does perf/perf_events (whatever the followon to perfmon2 is called) have
>> support for the CPU used in the device? (Assuming it even has a PMU to be
>> queried in the first place)
>
> Yes. Don't think it's enabled. It is running flat out, according to top.
Well, flat-out as far as the basic OS utilities can tell. Stalled
hardware manifests as CPU time consumed in something like top even
though the processor may be sitting "idle," (in its context) twiddling
its thumbs waiting on cache misses. Hence the question about PMU support.
rick jones
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: oprofiling is much saner looking now with rc6-smoketest
2011-08-31 3:28 ` Dave Taht
@ 2011-08-31 16:19 ` Rick Jones
0 siblings, 0 replies; 10+ messages in thread
From: Rick Jones @ 2011-08-31 16:19 UTC (permalink / raw)
To: Dave Taht; +Cc: bloat-devel
On 08/30/2011 08:28 PM, Dave Taht wrote:
> What I found interesting was the 10 second periodicity of the
> drop-offs. My assumption is that this is a timer being fired from
> somewhere (netperf?) that blocks the transmission...
The only timer that netperf would fire-off in the middle of a run would
be if netperf were ./configure'd with --enable-intervals and one set a
burst interval of 10 seconds with some combination of global -b <burst
length> -w <burst interval>
happy benchmarking,
rick jones
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2011-08-31 16:19 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-31 0:32 oprofiling is much saner looking now with rc6-smoketest Dave Taht
2011-08-31 1:01 ` Rick Jones
2011-08-31 1:10 ` Simon Barber
2011-08-31 1:20 ` Simon Barber
2011-08-31 1:45 ` Dave Taht
2011-08-31 1:58 ` Dave Taht
2011-08-31 3:28 ` Dave Taht
2011-08-31 16:19 ` Rick Jones
2011-08-31 15:55 ` Rick Jones
2011-08-31 1:41 ` Dave Taht
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox