* oprofiling is much saner looking now with rc6-smoketest @ 2011-08-31 0:32 Dave Taht 2011-08-31 1:01 ` Rick Jones 2011-08-31 1:41 ` Dave Taht 0 siblings, 2 replies; 10+ messages in thread From: Dave Taht @ 2011-08-31 0:32 UTC (permalink / raw) To: bloat-devel I get about 190Mbit/sec from netperf now, on GigE, with oprofiling enabled, driver buffers of 4, txqueue of 8, cerowrt default iptables rules, AND web10g patched into kernel 3.0.3. This is much saner than rc3, and judging from the csum_partial and copy_user being roughly equal, there isn't much left to be gained... Nice work. (Without oprofiling, and without web10g and with tcp cubic I can get past 250Mbit) CPU: MIPS 24K, speed 0 MHz (estimated) Counted INSTRUCTIONS events (Instructions completed) with a unit mask of 0x00 (No unit mask) count 100000 samples % app name symbol name ------------------------------------------------------------------------------- 17277 13.8798 vmlinux csum_partial 17277 100.000 vmlinux csum_partial [self] ------------------------------------------------------------------------------- 16607 13.3415 vmlinux __copy_user 16607 100.000 vmlinux __copy_user [self] ------------------------------------------------------------------------------- 11913 9.5705 ip_tables /ip_tables 11913 100.000 ip_tables /ip_tables [self] ------------------------------------------------------------------------------- 8949 7.1893 nf_conntrack /nf_conntrack 8949 100.000 nf_conntrack /nf_conntrack [self] In this case I was going from laptop - gige - through another rc6-smoketest router - to_this_box's internal lan port. It bugs me that iptables and conntrack eat so much cpu for what is an internal-only connection, e.g. one that doesn't need conntracking. That said, I understand that people like their statistics, and me, I'm trying to make split-tcp work better, ultimately, one day.... I'm going to rerun this without the fw rules next. -- Dave Täht SKYPE: davetaht US Tel: 1-239-829-5608 http://the-edge.blogspot.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: oprofiling is much saner looking now with rc6-smoketest 2011-08-31 0:32 oprofiling is much saner looking now with rc6-smoketest Dave Taht @ 2011-08-31 1:01 ` Rick Jones 2011-08-31 1:10 ` Simon Barber 2011-08-31 1:45 ` Dave Taht 2011-08-31 1:41 ` Dave Taht 1 sibling, 2 replies; 10+ messages in thread From: Rick Jones @ 2011-08-31 1:01 UTC (permalink / raw) To: Dave Taht; +Cc: bloat-devel On 08/30/2011 05:32 PM, Dave Taht wrote: > I get about 190Mbit/sec from netperf now, on GigE, with oprofiling > enabled, driver buffers of 4, txqueue of 8, cerowrt default iptables > rules, AND web10g patched into kernel 3.0.3. > > This is much saner than rc3, and judging from the csum_partial and > copy_user being roughly equal, there isn't much left to be gained... > > Nice work. > > (Without oprofiling, and without web10g and with tcp cubic I can get > past 250Mbit) > > > CPU: MIPS 24K, speed 0 MHz (estimated) > Counted INSTRUCTIONS events (Instructions completed) with a unit mask > of 0x00 (No unit mask) count 100000 > samples % app name symbol name > ------------------------------------------------------------------------------- > 17277 13.8798 vmlinux csum_partial > 17277 100.000 vmlinux csum_partial [self] > ------------------------------------------------------------------------------- > 16607 13.3415 vmlinux __copy_user > 16607 100.000 vmlinux __copy_user [self] > ------------------------------------------------------------------------------- > 11913 9.5705 ip_tables /ip_tables > 11913 100.000 ip_tables /ip_tables [self] > ------------------------------------------------------------------------------- > 8949 7.1893 nf_conntrack /nf_conntrack > 8949 100.000 nf_conntrack /nf_conntrack [self] > > In this case I was going from laptop - gige - through another > rc6-smoketest router - to_this_box's internal lan port. > > It bugs me that iptables and conntrack eat so much cpu for what > is an internal-only connection, e.g. one that > doesn't need conntracking. The csum_partial is a bit surprising - I thought every NIC and its dog offered CKO these days - or is that something happening with ip_tables/contrack? I also thought that Linux used an integrated copy/checksum in at least one direction, or did that go away when CKO became prevalent? If this is inbound, and there is just plain checksumming and not anything funny from conntrack, I would have expected checksum to be much larger than copy. Checksum (in the inbound direction) will take the cache misses and the copy would not. Unless... the data cache of the processor is getting completely trashed - say from the netserver running on the router not keeping up with the inbound data fully and so the copy gets "far away" from the checksum verification. Does perf/perf_events (whatever the followon to perfmon2 is called) have support for the CPU used in the device? (Assuming it even has a PMU to be queried in the first place) > That said, I understand that people like their statistics, and me, > I'm trying to make split-tcp work better, ultimately, one day.... > > I'm going to rerun this without the fw rules next. It would be interesting to see if the csum time goes away. Long ago and far away when I was beating on a 32-core system with aggregate netperf TCP_RR and enabling or not FW rules, conntrack had a non-trivial effect indeed on performance. http://markmail.org/message/exjtzel7vq2ugt66#query:netdev%20conntrack%20rick%20jones%2032%20netperf+page:1+mid:s5v5kylvmlfrpb7a+state:results I think will get to the start of that thread. The subject is '32 core net-next stack/netfilter "scaling"' rick jones ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: oprofiling is much saner looking now with rc6-smoketest 2011-08-31 1:01 ` Rick Jones @ 2011-08-31 1:10 ` Simon Barber 2011-08-31 1:20 ` Simon Barber 2011-08-31 1:45 ` Dave Taht 1 sibling, 1 reply; 10+ messages in thread From: Simon Barber @ 2011-08-31 1:10 UTC (permalink / raw) To: bloat-devel Why is conntrack even getting involved? Simon On 08/30/2011 06:01 PM, Rick Jones wrote: > On 08/30/2011 05:32 PM, Dave Taht wrote: >> I get about 190Mbit/sec from netperf now, on GigE, with oprofiling >> enabled, driver buffers of 4, txqueue of 8, cerowrt default iptables >> rules, AND web10g patched into kernel 3.0.3. >> >> This is much saner than rc3, and judging from the csum_partial and >> copy_user being roughly equal, there isn't much left to be gained... > > >> Nice work. >> >> (Without oprofiling, and without web10g and with tcp cubic I can get >> past 250Mbit) >> >> >> CPU: MIPS 24K, speed 0 MHz (estimated) >> Counted INSTRUCTIONS events (Instructions completed) with a unit mask >> of 0x00 (No unit mask) count 100000 >> samples % app name symbol name >> ------------------------------------------------------------------------------- >> >> 17277 13.8798 vmlinux csum_partial >> 17277 100.000 vmlinux csum_partial [self] >> ------------------------------------------------------------------------------- >> >> 16607 13.3415 vmlinux __copy_user >> 16607 100.000 vmlinux __copy_user [self] >> ------------------------------------------------------------------------------- >> >> 11913 9.5705 ip_tables /ip_tables >> 11913 100.000 ip_tables /ip_tables [self] >> ------------------------------------------------------------------------------- >> >> 8949 7.1893 nf_conntrack /nf_conntrack >> 8949 100.000 nf_conntrack /nf_conntrack [self] >> >> In this case I was going from laptop - gige - through another >> rc6-smoketest router - to_this_box's internal lan port. >> >> It bugs me that iptables and conntrack eat so much cpu for what >> is an internal-only connection, e.g. one that >> doesn't need conntracking. > > The csum_partial is a bit surprising - I thought every NIC and its dog > offered CKO these days - or is that something happening with > ip_tables/contrack? I also thought that Linux used an integrated > copy/checksum in at least one direction, or did that go away when CKO > became prevalent? > > If this is inbound, and there is just plain checksumming and not > anything funny from conntrack, I would have expected checksum to be much > larger than copy. Checksum (in the inbound direction) will take the > cache misses and the copy would not. Unless... the data cache of the > processor is getting completely trashed - say from the netserver running > on the router not keeping up with the inbound data fully and so the copy > gets "far away" from the checksum verification. > > Does perf/perf_events (whatever the followon to perfmon2 is called) have > support for the CPU used in the device? (Assuming it even has a PMU to > be queried in the first place) > >> That said, I understand that people like their statistics, and me, >> I'm trying to make split-tcp work better, ultimately, one day.... >> >> I'm going to rerun this without the fw rules next. > > It would be interesting to see if the csum time goes away. Long ago and > far away when I was beating on a 32-core system with aggregate netperf > TCP_RR and enabling or not FW rules, conntrack had a non-trivial effect > indeed on performance. > > http://markmail.org/message/exjtzel7vq2ugt66#query:netdev%20conntrack%20rick%20jones%2032%20netperf+page:1+mid:s5v5kylvmlfrpb7a+state:results > > > I think will get to the start of that thread. The subject is '32 core > net-next stack/netfilter "scaling"' > > rick jones > _______________________________________________ > Bloat-devel mailing list > Bloat-devel@lists.bufferbloat.net > https://lists.bufferbloat.net/listinfo/bloat-devel ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: oprofiling is much saner looking now with rc6-smoketest 2011-08-31 1:10 ` Simon Barber @ 2011-08-31 1:20 ` Simon Barber 0 siblings, 0 replies; 10+ messages in thread From: Simon Barber @ 2011-08-31 1:20 UTC (permalink / raw) To: bloat-devel Apologies - I should have read the scenario better - the connection is terminated on the router. Simon On 08/30/2011 06:10 PM, Simon Barber wrote: > Why is conntrack even getting involved? > > Simon > > On 08/30/2011 06:01 PM, Rick Jones wrote: >> On 08/30/2011 05:32 PM, Dave Taht wrote: >>> I get about 190Mbit/sec from netperf now, on GigE, with oprofiling >>> enabled, driver buffers of 4, txqueue of 8, cerowrt default iptables >>> rules, AND web10g patched into kernel 3.0.3. >>> >>> This is much saner than rc3, and judging from the csum_partial and >>> copy_user being roughly equal, there isn't much left to be gained... >> > >>> Nice work. >>> >>> (Without oprofiling, and without web10g and with tcp cubic I can get >>> past 250Mbit) >>> >>> >>> CPU: MIPS 24K, speed 0 MHz (estimated) >>> Counted INSTRUCTIONS events (Instructions completed) with a unit mask >>> of 0x00 (No unit mask) count 100000 >>> samples % app name symbol name >>> ------------------------------------------------------------------------------- >>> >>> >>> 17277 13.8798 vmlinux csum_partial >>> 17277 100.000 vmlinux csum_partial [self] >>> ------------------------------------------------------------------------------- >>> >>> >>> 16607 13.3415 vmlinux __copy_user >>> 16607 100.000 vmlinux __copy_user [self] >>> ------------------------------------------------------------------------------- >>> >>> >>> 11913 9.5705 ip_tables /ip_tables >>> 11913 100.000 ip_tables /ip_tables [self] >>> ------------------------------------------------------------------------------- >>> >>> >>> 8949 7.1893 nf_conntrack /nf_conntrack >>> 8949 100.000 nf_conntrack /nf_conntrack [self] >>> >>> In this case I was going from laptop - gige - through another >>> rc6-smoketest router - to_this_box's internal lan port. >>> >>> It bugs me that iptables and conntrack eat so much cpu for what >>> is an internal-only connection, e.g. one that >>> doesn't need conntracking. >> >> The csum_partial is a bit surprising - I thought every NIC and its dog >> offered CKO these days - or is that something happening with >> ip_tables/contrack? I also thought that Linux used an integrated >> copy/checksum in at least one direction, or did that go away when CKO >> became prevalent? >> >> If this is inbound, and there is just plain checksumming and not >> anything funny from conntrack, I would have expected checksum to be much >> larger than copy. Checksum (in the inbound direction) will take the >> cache misses and the copy would not. Unless... the data cache of the >> processor is getting completely trashed - say from the netserver running >> on the router not keeping up with the inbound data fully and so the copy >> gets "far away" from the checksum verification. >> >> Does perf/perf_events (whatever the followon to perfmon2 is called) have >> support for the CPU used in the device? (Assuming it even has a PMU to >> be queried in the first place) >> >>> That said, I understand that people like their statistics, and me, >>> I'm trying to make split-tcp work better, ultimately, one day.... >>> >>> I'm going to rerun this without the fw rules next. >> >> It would be interesting to see if the csum time goes away. Long ago and >> far away when I was beating on a 32-core system with aggregate netperf >> TCP_RR and enabling or not FW rules, conntrack had a non-trivial effect >> indeed on performance. >> >> http://markmail.org/message/exjtzel7vq2ugt66#query:netdev%20conntrack%20rick%20jones%2032%20netperf+page:1+mid:s5v5kylvmlfrpb7a+state:results >> >> >> >> I think will get to the start of that thread. The subject is '32 core >> net-next stack/netfilter "scaling"' >> >> rick jones >> _______________________________________________ >> Bloat-devel mailing list >> Bloat-devel@lists.bufferbloat.net >> https://lists.bufferbloat.net/listinfo/bloat-devel > _______________________________________________ > Bloat-devel mailing list > Bloat-devel@lists.bufferbloat.net > https://lists.bufferbloat.net/listinfo/bloat-devel ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: oprofiling is much saner looking now with rc6-smoketest 2011-08-31 1:01 ` Rick Jones 2011-08-31 1:10 ` Simon Barber @ 2011-08-31 1:45 ` Dave Taht 2011-08-31 1:58 ` Dave Taht 2011-08-31 15:55 ` Rick Jones 1 sibling, 2 replies; 10+ messages in thread From: Dave Taht @ 2011-08-31 1:45 UTC (permalink / raw) To: Rick Jones; +Cc: bloat-devel On Tue, Aug 30, 2011 at 6:01 PM, Rick Jones <rick.jones2@hp.com> wrote: > On 08/30/2011 05:32 PM, Dave Taht wrote: >> It bugs me that iptables and conntrack eat so much cpu for what >> is an internal-only connection, e.g. one that >> doesn't need conntracking. > > The csum_partial is a bit surprising - I thought every NIC and its dog > offered CKO these days - or is that something happening with > ip_tables/contrack? If this chipset supports it, so far as I know, it isn't documented or implemented. > I also thought that Linux used an integrated > copy/checksum in at least one direction, or did that go away when CKO became > prevalent? Don't know. > > If this is inbound, and there is just plain checksumming and not anything > funny from conntrack, I would have expected checksum to be much larger than > copy. Checksum (in the inbound direction) will take the cache misses and > the copy would not. Unless... the data cache of the processor is getting > completely trashed - say from the netserver running on the router not > keeping up with the inbound data fully and so the copy gets "far away" from > the checksum verification. 220Mbit isn't good enough for ya? Previous tests ran at about 140Mbit, but due to some major optimizations by felix to fix a bunch of mis-alignment issues. Through the router, I've seen 260Mbit - which is perilously close to the speed that I can drive it at from the test boxes. > > Does perf/perf_events (whatever the followon to perfmon2 is called) have > support for the CPU used in the device? (Assuming it even has a PMU to be > queried in the first place) Yes. Don't think it's enabled. It is running flat out, according to top. > >> That said, I understand that people like their statistics, and me, >> I'm trying to make split-tcp work better, ultimately, one day.... >> >> I'm going to rerun this without the fw rules next. > > It would be interesting to see if the csum time goes away. Long ago and far > away when I was beating on a 32-core system with aggregate netperf TCP_RR > and enabling or not FW rules, conntrack had a non-trivial effect indeed on > performance. Stays about the same. iptables time drops. How to disable conntrack? Don't you only really need it for nat? > > http://markmail.org/message/exjtzel7vq2ugt66#query:netdev%20conntrack%20rick%20jones%2032%20netperf+page:1+mid:s5v5kylvmlfrpb7a+state:results > > I think will get to the start of that thread. The subject is '32 core > net-next stack/netfilter "scaling"' > > rick jones > -- Dave Täht SKYPE: davetaht US Tel: 1-239-829-5608 http://the-edge.blogspot.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: oprofiling is much saner looking now with rc6-smoketest 2011-08-31 1:45 ` Dave Taht @ 2011-08-31 1:58 ` Dave Taht 2011-08-31 3:28 ` Dave Taht 2011-08-31 15:55 ` Rick Jones 1 sibling, 1 reply; 10+ messages in thread From: Dave Taht @ 2011-08-31 1:58 UTC (permalink / raw) To: Rick Jones; +Cc: bloat-devel I have put the current rc6 smoketest up at: http://huchra.bufferbloat.net/~cero1/rc6-smoketest/ So far it's proving very stable. Wireless performance is excellent and wired performance dramatically improved. No crash bugs thus far, though I had a scare... For the final rc6, which I hope to have done by friday, I'm in the process of cleanly re-assembling the patch set (sorry, the sources are a bit of a mess at present). For this rc, I'm hoping that a new iptables lands, in particular, and I have numerous other little things in the queue to sort out. All that said, getting oprofile running is not hard, and I do appreciate smoke testers helping out!!! as I don't think I'll be able to get another release candidate done before linux plumbers. install the correct image on your router from the above via web interface or sysupgrade -n reboot edit /etc/opkg.conf to have that url in it opkg update opkg install oprofile cd /tmp mkdir /tmp/oprofile wget http://huchra.bufferbloat.net/~d/rc6-smoke-captures/vmlinux opcontrol --vmlinux=/tmp/vmlinux --session-dir=/tmp/oprofile (saving profile data to flash is a bad idea) opcontrol --start # do your testing opcontrol --dump opreport -c # or whatever options you like. On Tue, Aug 30, 2011 at 6:45 PM, Dave Taht <dave.taht@gmail.com> wrote: > On Tue, Aug 30, 2011 at 6:01 PM, Rick Jones <rick.jones2@hp.com> wrote: >> On 08/30/2011 05:32 PM, Dave Taht wrote: > >>> It bugs me that iptables and conntrack eat so much cpu for what >>> is an internal-only connection, e.g. one that >>> doesn't need conntracking. >> >> The csum_partial is a bit surprising - I thought every NIC and its dog >> offered CKO these days - or is that something happening with >> ip_tables/contrack? > > If this chipset supports it, so far as I know, it isn't documented or > implemented. > >> I also thought that Linux used an integrated >> copy/checksum in at least one direction, or did that go away when CKO became >> prevalent? > > Don't know. > >> >> If this is inbound, and there is just plain checksumming and not anything >> funny from conntrack, I would have expected checksum to be much larger than >> copy. Checksum (in the inbound direction) will take the cache misses and >> the copy would not. Unless... the data cache of the processor is getting >> completely trashed - say from the netserver running on the router not >> keeping up with the inbound data fully and so the copy gets "far away" from >> the checksum verification. > > 220Mbit isn't good enough for ya? Previous tests ran at about 140Mbit, but due > to some major optimizations by felix to fix a bunch of mis-alignment > issues. Through the router, I've seen 260Mbit - which is perilously > close to the speed that I can drive it at from the test boxes. > >> >> Does perf/perf_events (whatever the followon to perfmon2 is called) have >> support for the CPU used in the device? (Assuming it even has a PMU to be >> queried in the first place) > > Yes. Don't think it's enabled. It is running flat out, according to top. > >> >>> That said, I understand that people like their statistics, and me, >>> I'm trying to make split-tcp work better, ultimately, one day.... >>> >>> I'm going to rerun this without the fw rules next. >> >> It would be interesting to see if the csum time goes away. Long ago and far >> away when I was beating on a 32-core system with aggregate netperf TCP_RR >> and enabling or not FW rules, conntrack had a non-trivial effect indeed on >> performance. > > Stays about the same. iptables time drops. How to disable conntrack? > Don't you only really > need it for nat? > >> >> http://markmail.org/message/exjtzel7vq2ugt66#query:netdev%20conntrack%20rick%20jones%2032%20netperf+page:1+mid:s5v5kylvmlfrpb7a+state:results >> >> I think will get to the start of that thread. The subject is '32 core >> net-next stack/netfilter "scaling"' >> >> rick jones >> > > > > -- > Dave Täht > SKYPE: davetaht > US Tel: 1-239-829-5608 > http://the-edge.blogspot.com > -- Dave Täht SKYPE: davetaht US Tel: 1-239-829-5608 http://the-edge.blogspot.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: oprofiling is much saner looking now with rc6-smoketest 2011-08-31 1:58 ` Dave Taht @ 2011-08-31 3:28 ` Dave Taht 2011-08-31 16:19 ` Rick Jones 0 siblings, 1 reply; 10+ messages in thread From: Dave Taht @ 2011-08-31 3:28 UTC (permalink / raw) To: Rick Jones; +Cc: bloat-devel I took a little more time out to play with netperf at these extreme performance values, while puzzled about the performance knee observed midway through the previous tests. The three tests runs this evening (and captures!) are up at: http://huchra.bufferbloat.net/~d/rc6-smoke-captures/ For test 3, I rebooted the router into it's default tx ring (64), and set a txqueuelen of 128, running cubic... Measured throughput was mildly better (admittedly on a fresh boot, oprofile not even loaded) 229Mbit, and we didn't have a drop off at all, so I'm still chasing that... What I found interesting was the 10 second periodicity of the drop-offs. My assumption is that this is a timer being fired from somewhere (netperf?) that blocks the transmission... http://huchra.bufferbloat.net/~d/rc6-smoke-captures/txqueuelen128and10seconddropcycle.png test 4 will repeat the above sans oprofile, with the current default cerowrt settings for dma tx (4) and txqueuelen 8. If I get to it tonight. On Tue, Aug 30, 2011 at 6:58 PM, Dave Taht <dave.taht@gmail.com> wrote: > I have put the current rc6 smoketest up at: > > http://huchra.bufferbloat.net/~cero1/rc6-smoketest/ > > So far it's proving very stable. Wireless performance is excellent and > wired performance dramatically improved. No crash bugs thus far, > though I had a scare... > > For the final rc6, which I hope to have done by friday, I'm in the > process of cleanly re-assembling the patch set (sorry, the sources are > a bit of a mess at present). For this rc, I'm hoping that a new > iptables lands, in particular, and I have numerous other little things > in the queue to sort out. > > All that said, getting oprofile running is not hard, and I do > appreciate smoke testers helping out!!! as I don't think I'll be able > to get another release candidate done before linux plumbers. > > install the correct image on your router from the above via web > interface or sysupgrade -n > reboot > edit /etc/opkg.conf to have that url in it > opkg update > opkg install oprofile > cd /tmp > mkdir /tmp/oprofile > wget http://huchra.bufferbloat.net/~d/rc6-smoke-captures/vmlinux > opcontrol --vmlinux=/tmp/vmlinux --session-dir=/tmp/oprofile (saving > profile data to flash is a bad idea) > > opcontrol --start > # do your testing > opcontrol --dump > > opreport -c # or whatever options you like. > > > On Tue, Aug 30, 2011 at 6:45 PM, Dave Taht <dave.taht@gmail.com> wrote: >> On Tue, Aug 30, 2011 at 6:01 PM, Rick Jones <rick.jones2@hp.com> wrote: >>> On 08/30/2011 05:32 PM, Dave Taht wrote: >> >>>> It bugs me that iptables and conntrack eat so much cpu for what >>>> is an internal-only connection, e.g. one that >>>> doesn't need conntracking. >>> >>> The csum_partial is a bit surprising - I thought every NIC and its dog >>> offered CKO these days - or is that something happening with >>> ip_tables/contrack? >> >> If this chipset supports it, so far as I know, it isn't documented or >> implemented. >> >>> I also thought that Linux used an integrated >>> copy/checksum in at least one direction, or did that go away when CKO became >>> prevalent? >> >> Don't know. >> >>> >>> If this is inbound, and there is just plain checksumming and not anything >>> funny from conntrack, I would have expected checksum to be much larger than >>> copy. Checksum (in the inbound direction) will take the cache misses and >>> the copy would not. Unless... the data cache of the processor is getting >>> completely trashed - say from the netserver running on the router not >>> keeping up with the inbound data fully and so the copy gets "far away" from >>> the checksum verification. >> >> 220Mbit isn't good enough for ya? Previous tests ran at about 140Mbit, but due >> to some major optimizations by felix to fix a bunch of mis-alignment >> issues. Through the router, I've seen 260Mbit - which is perilously >> close to the speed that I can drive it at from the test boxes. >> >>> >>> Does perf/perf_events (whatever the followon to perfmon2 is called) have >>> support for the CPU used in the device? (Assuming it even has a PMU to be >>> queried in the first place) >> >> Yes. Don't think it's enabled. It is running flat out, according to top. >> >>> >>>> That said, I understand that people like their statistics, and me, >>>> I'm trying to make split-tcp work better, ultimately, one day.... >>>> >>>> I'm going to rerun this without the fw rules next. >>> >>> It would be interesting to see if the csum time goes away. Long ago and far >>> away when I was beating on a 32-core system with aggregate netperf TCP_RR >>> and enabling or not FW rules, conntrack had a non-trivial effect indeed on >>> performance. >> >> Stays about the same. iptables time drops. How to disable conntrack? >> Don't you only really >> need it for nat? >> >>> >>> http://markmail.org/message/exjtzel7vq2ugt66#query:netdev%20conntrack%20rick%20jones%2032%20netperf+page:1+mid:s5v5kylvmlfrpb7a+state:results >>> >>> I think will get to the start of that thread. The subject is '32 core >>> net-next stack/netfilter "scaling"' >>> >>> rick jones >>> >> >> >> >> -- >> Dave Täht >> SKYPE: davetaht >> US Tel: 1-239-829-5608 >> http://the-edge.blogspot.com >> > > > > -- > Dave Täht > SKYPE: davetaht > US Tel: 1-239-829-5608 > http://the-edge.blogspot.com > -- Dave Täht SKYPE: davetaht US Tel: 1-239-829-5608 http://the-edge.blogspot.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: oprofiling is much saner looking now with rc6-smoketest 2011-08-31 3:28 ` Dave Taht @ 2011-08-31 16:19 ` Rick Jones 0 siblings, 0 replies; 10+ messages in thread From: Rick Jones @ 2011-08-31 16:19 UTC (permalink / raw) To: Dave Taht; +Cc: bloat-devel On 08/30/2011 08:28 PM, Dave Taht wrote: > What I found interesting was the 10 second periodicity of the > drop-offs. My assumption is that this is a timer being fired from > somewhere (netperf?) that blocks the transmission... The only timer that netperf would fire-off in the middle of a run would be if netperf were ./configure'd with --enable-intervals and one set a burst interval of 10 seconds with some combination of global -b <burst length> -w <burst interval> happy benchmarking, rick jones ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: oprofiling is much saner looking now with rc6-smoketest 2011-08-31 1:45 ` Dave Taht 2011-08-31 1:58 ` Dave Taht @ 2011-08-31 15:55 ` Rick Jones 1 sibling, 0 replies; 10+ messages in thread From: Rick Jones @ 2011-08-31 15:55 UTC (permalink / raw) To: Dave Taht; +Cc: bloat-devel >> If this is inbound, and there is just plain checksumming and not anything >> funny from conntrack, I would have expected checksum to be much larger than >> copy. Checksum (in the inbound direction) will take the cache misses and >> the copy would not. Unless... the data cache of the processor is getting >> completely trashed - say from the netserver running on the router not >> keeping up with the inbound data fully and so the copy gets "far away" from >> the checksum verification. > > 220Mbit isn't good enough for ya? Previous tests ran at about 140Mbit, but due > to some major optimizations by felix to fix a bunch of mis-alignment > issues. Through the router, I've seen 260Mbit - which is perilously > close to the speed that I can drive it at from the test boxes. It is all a question of context. The last time I was in a context where 220 Mbit/s was high speed was when 100 BT first shipped or perhaps FDDI before that :) >> Does perf/perf_events (whatever the followon to perfmon2 is called) have >> support for the CPU used in the device? (Assuming it even has a PMU to be >> queried in the first place) > > Yes. Don't think it's enabled. It is running flat out, according to top. Well, flat-out as far as the basic OS utilities can tell. Stalled hardware manifests as CPU time consumed in something like top even though the processor may be sitting "idle," (in its context) twiddling its thumbs waiting on cache misses. Hence the question about PMU support. rick jones ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: oprofiling is much saner looking now with rc6-smoketest 2011-08-31 0:32 oprofiling is much saner looking now with rc6-smoketest Dave Taht 2011-08-31 1:01 ` Rick Jones @ 2011-08-31 1:41 ` Dave Taht 1 sibling, 0 replies; 10+ messages in thread From: Dave Taht @ 2011-08-31 1:41 UTC (permalink / raw) To: bloat-devel [-- Attachment #1: Type: text/plain, Size: 2424 bytes --] This is the same test, repeated, sans firewall rules - all iptables rules cleared. I get about 220MB/sec without oprofile running, and 208 with it running. (vs about 190MB with the iptables rules in the previous run) As noted earlier this is with netperf running on the router itself, with web10g patched in. Web10g supplies some interesting statistics (attached), and I have a tcptrace/xplot.org screenshots of the previous run here: http://huchra.bufferbloat.net/~d/rc6-smoke-captures/ that is also interesting. There is a knee at 30 seconds and other somewhat odd looking behavior when you zoom in. I further note that the laptop driving this test (via a gige pcmcia card) has a default txqueuelen of 1000 and I don't presently know the length of the dma tx ring, and is running the std ubuntu 11.4 kernel ( 2.6.38-11-generic), with ecn,sack,dsack enabled. And the best performance I've got from the laptop to a pentium 4 box was 290Mbit. I note that I'm pretty happy with 220Mbit OK! I wanted to have a baseline value before I started fiddling with vlan and dscp stuff.... CPU: MIPS 24K, speed 0 MHz (estimated) Counted INSTRUCTIONS events (Instructions completed) with a unit mask of 0x00 (No unit mask) count 100000 samples % app name symbol name ------------------------------------------------------------------------------- 17141 14.8045 vmlinux csum_partial 17141 100.000 vmlinux csum_partial [self] ------------------------------------------------------------------------------- 17024 14.7035 vmlinux __copy_user 17024 100.000 vmlinux __copy_user [self] ------------------------------------------------------------------------------- 8888 7.6765 nf_conntrack /nf_conntrack 8888 100.000 nf_conntrack /nf_conntrack [self] ------------------------------------------------------------------------------- 4139 3.5748 vmlinux __do_softirq 4139 100.000 vmlinux __do_softirq [self] ------------------------------------------------------------------------------- 4055 3.5023 ip_tables /ip_tables 4055 100.000 ip_tables /ip_tables [self] -- Dave Täht SKYPE: davetaht US Tel: 1-239-829-5608 http://the-edge.blogspot.com [-- Attachment #2: estats.stuff.txt --] [-- Type: text/plain, Size: 11288 bytes --] Connection 64 (172.29.1.33_22 172.29.1.123_51581) LocalAddressType : 1 LocalAddress : 172.29.1.33 LocalPort : 22 RemAddressType : 1 RemAddress : 172.29.1.123 RemPort : 51581 SegsOut : 2668 DataSegsOut : 2668 DataOctetsOut : 493373 SegsRetrans : 3 OctetsRetrans : 208 SegsIn : 4272 DataSegsIn : 1885 DataOctetsIn : 93215 ElapsedSecs : 6618 ElapsedMicroSecs : 723887 CurMSS : 1460 PipeSize : 0 MaxPipeSize : 4380 SmoothedRTT : 10 CurRTO : 210 CongSignals : 0 CurCwnd : 8760 CurSsthresh : 4294967295 Timeouts : 0 CurRwinSent : 16888 MaxRwinSent : 16888 ZeroRwinSent : 0 CurRwinRcvd : 293248 MaxRwinRcvd : 293248 ZeroRwinRcvd : 0 SndLimTransRwin : 0 SndLimTransCwnd : 1 SndLimTransSnd : 2 SndLimTimeRwin : 0 SndLimTimeCwnd : 400477716 SndLimTimeSnd : 4073612272 SendStall : 0 RetranThresh : 3 NonRecovDAEpisodes : 1 SumOctetsReordered : 256 NonRecovDA : 0 SampleRTT : 0 RTTVar : 50 MaxRTT : 40 MinRTT : 0 SumRTT : 9160 CountRTT : 2507 MaxRTO : 230 MinRTO : 210 IpTtl : 64 IpTosIn : 16 IpTosOut : 0 PreCongSumCwnd : 0 PreCongSumRTT : 0 PostCongSumRTT : 0 PostCongCountRTT : 0 ECNsignals : 0 DupAckEpisodes : 0 RcvRTT : 3134114 DupAcksOut : 0 CERcvd : 0 ECESent : 0 ActiveOpen : 0 MSSSent : 1460 MSSRcvd : 1460 WinScaleSent : 1 WinScaleRcvd : 7 TimeStamps : 2 ECN : 3 WillSendSACK : 1 WillUseSACK : 1 State : 1953653102 Nagle : 1 MaxSsCwnd : 14600 MaxCaCwnd : 10220 MaxSsthresh : 5840 MinSsthresh : 2920 InRecovery : 2 DupAcksIn : 0 SpuriousFrDetected : 0 SpuriousRtoDetected : 0 SoftErrors : 0 SoftErrorReason : 0 SlowStart : 2 CongAvoid : 41 OtherReductions : 0 CongOverCount : 0 FastRetran : 0 SubsequentTimeouts : 0 CurTimeoutCount : 0 AbruptTimeouts : 0 SACKsRcvd : 0 SACKBlocksRcvd : 0 DSACKDups : 0 MaxMSS : 1460 MinMSS : 1440 SndInitial : 2230931079 RecInitial : 1469570772 CurRetxQueue : 0 MaxRetxQueue : 0 CurReasmQueue : 0 MaxReasmQueue : 0 SndUna : 2231370872 SndNxt : 2231370872 SndMax : 2231370872 ThruOctetsAcked : 439793 RcvNxt : 1469663891 ThruOctetsReceived : 93119 CurAppWQueue : 0 MaxAppWQueue : 0 CurAppRQueue : 0 MaxAppRQueue : 1144 LimCwnd : 4294965836 LimSsthresh : 0 LimRwin : 64075 LimMSS : 95682560 OtherReductionsCV : 0 OtherReductionsCM : 0 StartTimeSecs : 1314747615 StartTimeMicroSecs : 690002 Sndbuf : 16384 Rcvbuf : 87380 Connection 56 (172.29.1.97_22 172.29.1.123_52588) LocalAddressType : 1 LocalAddress : 172.29.1.97 LocalPort : 22 RemAddressType : 1 RemAddress : 172.29.1.123 RemPort : 52588 SegsOut : 1241 DataSegsOut : 1241 DataOctetsOut : 110645 SegsRetrans : 0 OctetsRetrans : 0 SegsIn : 2000 DataSegsIn : 914 DataOctetsIn : 45711 ElapsedSecs : 4990 ElapsedMicroSecs : 629033 CurMSS : 1460 PipeSize : 0 MaxPipeSize : 1460 SmoothedRTT : 10 CurRTO : 210 CongSignals : 0 CurCwnd : 14600 CurSsthresh : 4294965836 Timeouts : 0 CurRwinSent : 16888 MaxRwinSent : 16888 ZeroRwinSent : 0 CurRwinRcvd : 64128 MaxRwinRcvd : 64128 ZeroRwinRcvd : 0 SndLimTransRwin : 0 SndLimTransCwnd : 0 SndLimTransSnd : 1 SndLimTimeRwin : 0 SndLimTimeCwnd : 0 SndLimTimeSnd : 4171379644 SendStall : 0 RetranThresh : 3 NonRecovDAEpisodes : 0 SumOctetsReordered : 0 NonRecovDA : 0 SampleRTT : 0 RTTVar : 50 MaxRTT : 120 MinRTT : 0 SumRTT : 3280 CountRTT : 1114 MaxRTO : 230 MinRTO : 210 IpTtl : 64 IpTosIn : 16 IpTosOut : 0 PreCongSumCwnd : 0 PreCongSumRTT : 0 PostCongSumRTT : 0 PostCongCountRTT : 0 ECNsignals : 0 DupAckEpisodes : 0 RcvRTT : 1532633 DupAcksOut : 0 CERcvd : 0 ECESent : 0 ActiveOpen : 0 MSSSent : 1460 MSSRcvd : 1460 WinScaleSent : 1 WinScaleRcvd : 7 TimeStamps : 2 ECN : 3 WillSendSACK : 1 WillUseSACK : 1 State : 1953653102 Nagle : 1 MaxSsCwnd : 14600 MaxCaCwnd : 0 MaxSsthresh : 0 MinSsthresh : 4294967295 InRecovery : 2 DupAcksIn : 0 SpuriousFrDetected : 0 SpuriousRtoDetected : 0 SoftErrors : 0 SoftErrorReason : 0 SlowStart : 0 CongAvoid : 0 OtherReductions : 0 CongOverCount : 0 FastRetran : 0 SubsequentTimeouts : 0 CurTimeoutCount : 0 AbruptTimeouts : 0 SACKsRcvd : 0 SACKBlocksRcvd : 0 DSACKDups : 0 MaxMSS : 1460 MinMSS : 1440 SndInitial : 1838946708 RecInitial : 749866429 CurRetxQueue : 0 MaxRetxQueue : 0 CurReasmQueue : 0 MaxReasmQueue : 0 SndUna : 1839032533 SndNxt : 1839032533 SndMax : 1839032533 ThruOctetsAcked : 85825 RcvNxt : 749912140 ThruOctetsReceived : 45711 CurAppWQueue : 0 MaxAppWQueue : 0 CurAppRQueue : 0 MaxAppRQueue : 1144 LimCwnd : 4294965836 LimSsthresh : 0 LimRwin : 64075 LimMSS : 95682560 OtherReductionsCV : 0 OtherReductionsCM : 0 StartTimeSecs : 1314742638 StartTimeMicroSecs : 218763 Sndbuf : 16384 Rcvbuf : 87380 Connection 0 (172.29.1.97_22 172.29.1.123_55029) LocalAddressType : 1 LocalAddress : 172.29.1.97 LocalPort : 22 RemAddressType : 1 RemAddress : 172.29.1.123 RemPort : 55029 SegsOut : 2927 DataSegsOut : 2927 DataOctetsOut : 563021 SegsRetrans : 0 OctetsRetrans : 0 SegsIn : 4394 DataSegsIn : 1820 DataOctetsIn : 93439 ElapsedSecs : 65542 ElapsedMicroSecs : 507312 CurMSS : 1460 PipeSize : 0 MaxPipeSize : 4380 SmoothedRTT : 20 CurRTO : 220 CongSignals : 0 CurCwnd : 14600 CurSsthresh : 4294965836 Timeouts : 0 CurRwinSent : 22712 MaxRwinSent : 22712 ZeroRwinSent : 0 CurRwinRcvd : 357760 MaxRwinRcvd : 357760 ZeroRwinRcvd : 0 SndLimTransRwin : 0 SndLimTransCwnd : 0 SndLimTransSnd : 1 SndLimTimeRwin : 0 SndLimTimeCwnd : 0 SndLimTimeSnd : 2439344612 SendStall : 0 RetranThresh : 3 NonRecovDAEpisodes : 0 SumOctetsReordered : 0 NonRecovDA : 0 SampleRTT : 0 RTTVar : 50 MaxRTT : 50 MinRTT : 0 SumRTT : 12590 CountRTT : 2722 MaxRTO : 230 MinRTO : 210 IpTtl : 64 IpTosIn : 16 IpTosOut : 0 PreCongSumCwnd : 0 PreCongSumRTT : 0 PostCongSumRTT : 0 PostCongCountRTT : 0 ECNsignals : 0 DupAckEpisodes : 0 RcvRTT : 644526 DupAcksOut : 0 CERcvd : 0 ECESent : 0 ActiveOpen : 0 MSSSent : 1460 MSSRcvd : 1460 WinScaleSent : 1 WinScaleRcvd : 7 TimeStamps : 2 ECN : 3 WillSendSACK : 1 WillUseSACK : 1 State : 1953653102 Nagle : 1 MaxSsCwnd : 14600 MaxCaCwnd : 0 MaxSsthresh : 0 MinSsthresh : 4294967295 InRecovery : 2 DupAcksIn : 7 SpuriousFrDetected : 0 SpuriousRtoDetected : 0 SoftErrors : 7 SoftErrorReason : 1 SlowStart : 0 CongAvoid : 0 OtherReductions : 0 CongOverCount : 0 FastRetran : 0 SubsequentTimeouts : 0 CurTimeoutCount : 0 AbruptTimeouts : 0 SACKsRcvd : 0 SACKBlocksRcvd : 0 DSACKDups : 0 MaxMSS : 1460 MinMSS : 1440 SndInitial : 1018333932 RecInitial : 1871301081 CurRetxQueue : 0 MaxRetxQueue : 0 CurReasmQueue : 0 MaxReasmQueue : 0 SndUna : 1018838413 SndNxt : 1018838413 SndMax : 1018838413 ThruOctetsAcked : 504481 RcvNxt : 1871394520 ThruOctetsReceived : 93439 CurAppWQueue : 0 MaxAppWQueue : 0 CurAppRQueue : 0 MaxAppRQueue : 1456 LimCwnd : 4294965836 LimSsthresh : 0 LimRwin : 64075 LimMSS : 95682560 OtherReductionsCV : 0 OtherReductionsCM : 0 StartTimeSecs : 1314684283 StartTimeMicroSecs : 261134 Sndbuf : 16384 Rcvbuf : 87380 ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2011-08-31 16:19 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2011-08-31 0:32 oprofiling is much saner looking now with rc6-smoketest Dave Taht 2011-08-31 1:01 ` Rick Jones 2011-08-31 1:10 ` Simon Barber 2011-08-31 1:20 ` Simon Barber 2011-08-31 1:45 ` Dave Taht 2011-08-31 1:58 ` Dave Taht 2011-08-31 3:28 ` Dave Taht 2011-08-31 16:19 ` Rick Jones 2011-08-31 15:55 ` Rick Jones 2011-08-31 1:41 ` Dave Taht
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox