<div dir="auto"></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">---------- Forwarded message ---------<br>From: <strong class="gmail_sendername" dir="auto">Thomas Gleixner</strong> <span dir="auto"><<a href="mailto:tglx@linutronix.de">tglx@linutronix.de</a>></span><br>Date: Sun, Mar 5, 2023, 12:57 PM<br>Subject: Re: [PATCH 2/3] softirq: avoid spurious stalls due to need_resched()<br>To: Jakub Kicinski <<a href="mailto:kuba@kernel.org">kuba@kernel.org</a>><br>Cc:  <<a href="mailto:peterz@infradead.org">peterz@infradead.org</a>>,  <<a href="mailto:jstultz@google.com">jstultz@google.com</a>>,  <<a href="mailto:edumazet@google.com">edumazet@google.com</a>>,  <<a href="mailto:netdev@vger.kernel.org">netdev@vger.kernel.org</a>>,  <<a href="mailto:linux-kernel@vger.kernel.org">linux-kernel@vger.kernel.org</a>>, Paul E. McKenney <<a href="mailto:paulmck@kernel.org">paulmck@kernel.org</a>>, Frederic Weisbecker <<a href="mailto:frederic@kernel.org">frederic@kernel.org</a>><br></div><br><br>Jakub!<br>

<br>

On Fri, Mar 03 2023 at 13:31, Jakub Kicinski wrote:<br>

> On Fri, 03 Mar 2023 14:30:46 +0100 Thomas Gleixner wrote:<br>

>> > +          __this_cpu_write(overload_limit, jiffies + limit);  <br>

>> <br>

>> The logic of all this is non-obvious and I had to reread it 5 times to<br>

>> conclude that it is matching the intent. Please add comments.<br>

>> <br>

>> While I'm not a big fan of heuristical duct tape, this looks harmless<br>

>> enough to not end up in an endless stream of tweaking. Famous last<br>

>> words...<br>

><br>

> Would it all be more readable if I named the "overload_limit"<br>

> "overloaded_until" instead? Naming..<br>

<br>

While naming matters it wont change the 'heuristical duct tape' property<br>

of this, right?<br>

<br>

> I'll add comments, too.<br>

<br>

They are definitely appreciated, but I'd prefer to have code which is<br>

self explanatory and does at least have a notion of a halfways<br>

scientific approach to the overall issue of softirqs.<br>

<br>

The point is that softirqs are just the proliferation of an at least 50<br>

years old OS design paradigm. Back then everyhting which run in an<br>

interrupt handler was "important" and more or less allowed to hog the<br>

CPU at will.<br>

<br>

That obviously caused problems because it prevented other interrupt<br>

handlers from being served.<br>

<br>

This was attempted to work around in hardware by providing interrupt<br>

priority levels. No general purpose OS utilized that ever because there<br>

is no way to get this right. Not even on UP, unless you build a designed<br>

for the purpose "OS".<br>

<br>

Soft interrupts are not any better. They avoid the problem of stalling<br>

interrupts by moving the problem one level down to the scheduler.<br>

<br>

Granted they are a cute hack, but at the very end they are still evading<br>

the resource control mechanisms of the OS by defining their own rules:<br>

<br>

    - NET RX defaults to 2ms with the ability to override via /proc<br>

    - RCU defaults to 3ms with the ability to override via /sysfs<br>

<br>

while the "overload detection" in the core defines a hardcoded limit of<br>

2ms. Alone the above does not sum up to the core limit and most of the<br>

other soft interrupt handlers do not even have the notion of limits.<br>

<br>

That clearly does not even remotely allow to do proper coordinated<br>

resource management.<br>

<br>

Not to talk about the sillyness of the jiffy based timouts which result<br>

in a randomized granularity of 0...1/Hz as mentioned before.<br>

<br>

I'm well aware of the fact that consulting a high resolution hardware<br>

clock frequently can be slow and hurting performance, but there are well<br>

understood workarounds, aka batching, which mitigate that. <br>

<br>

There is another aspect to softirqs which makes them a horror show:<br>

<br>

  While they are conceptually seperate, at the very end they are all<br>

  lumped together and especially the network code has implicit<br>

  assumptions about that. It's simply impossible to seperate the<br>

  processing of the various soft interrupt incarnations.<br>

<br>

IOW, resource control by developer preference and coincidence of<br>

events. That truly makes an understandable and to be relied on OS.<br>

<br>

We had seperate softirq threads and per softirq serialization (except<br>

NET_RX/TX which shared) in the early days of preempt RT, which gave fine<br>

grained control. Back then the interaction between different softirqs<br>

was halfways understandable and the handful of interaction points which<br>

relied on the per CPU global BH disable were fixable with local<br>

serializations. That lasted a year or two until we stopped maintaining<br>

that because the interaction between softirqs was becoming a whack a<br>

mole game. So we gave up and enjoy the full glory of a per CPU global<br>

lock, because that's what local BH disable actually is.<br>

<br>

I completely understand that ***GB networking is a challenge, but ***GB<br>

networking does not work without applications wwich use it. Those<br>

applications are unfortunately^Wrightfully subject to the scheduler,<br>

aka. resource control.<br>

<br>

IMO evading resource control is the worst of all approaches and the<br>

amount of heuristics you can apply to mitigate that, is never going to<br>

cover even a subset of the overall application space.<br>

<br>

Just look at the memcache vs. webserver use case vs. need_resched() and<br>

then the requirements coming from the low latency audio folks.<br>

<br>

I know the usual approach to that is to add some more heuristics which<br>

are by nature supposed to fail or to add yet another 'knob'. We have<br>

already too many knobs which are not comprehensible on their own. But<br>

even if a particular knob is comprehensible there is close to zero<br>

documentation and I even claim close to zero understanding of the<br>

interaction of knobs.<br>

<br>

Just for the record. Some of our engineers are working on TSN based<br>

real-time networking which is all about latency anc accuracy. Guess how<br>

well that works with the current overall design. That's not an esoteric<br>

niche use case as low-latency TSN is not restricted to the automation<br>

space. There are quite some use cases which go there even in the high<br>

end networking space.<br>

<br>

>> But without the sched_clock() changes the actual defer time depends on<br>

>> HZ and the point in time where limit is set. That means it ranges from 0<br>

>> to 1/HZ, i.e. the 2ms defer time ends up with close to 10ms on HZ=100 in<br>

>> the worst case, which perhaps explains the 8ms+ stalls you are still<br>

>> observing. Can you test with that sched_clock change applied, i.e. the<br>

>> first two commits from<br>

>> <br>

>>   git://<a href="http://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git" rel="noreferrer noreferrer" target="_blank">git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git</a> core/softirq<br>

>> <br>

>> 59be25c466d9 ("softirq: Use sched_clock() based timeout")<br>

>> bd5a5bd77009 ("softirq: Rewrite softirq processing loop")<br>

><br>

> Those will help, but I spent some time digging into the jiffies related<br>

> warts with kprobes - while annoying they weren't a major source of wake<br>

> ups. (FWIW the jiffies noise on our workloads is due to cgroup stats<br>

> disabling IRQs for multiple ms on the timekeeping CPU).<br>

<br>

What? That's completely insane and needs to be fixed.<br>

<br>

> Here are fresh stats on why we wake up ksoftirqd on our Web workload<br>

> (collected over 100 sec):<br>

><br>

> Time exceeded:      484<br>

> Loop max run out:  6525<br>

> need_resched():   10219<br>

> (control: 17226 - number of times wakeup_process called for ksirqd)<br>

><br>

> As you can see need_resched() dominates.<br>

<br>

> Zooming into the time exceeded - we can count nanoseconds between<br>

> __do_softirq starting and the check. This is the histogram of actual<br>

> usecs as seen by BPF (AKA ktime_get_mono_fast_ns() / 1000):<br>

><br>

> [256, 512)             1 |                                                    |<br>

> [512, 1K)              0 |                                                    |<br>

> [1K, 2K)             217 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         |<br>

> [2K, 4K)             266 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|<br>

><br>

> So yes, we can probably save ourselves ~200 wakeup with a better clock<br>

> but that's just 1.3% of the total wake ups :(<br>

<br>

Fair enough. Though that does not make our time limit handling any more<br>

consistent and we need to fix that too to handle the other issues.<br>

<br>

> Now - now about the max loop count. I ORed the pending softirqs every<br>

> time we get to the end of the loop. Looks like vast majority of the<br>

> loop counter wake ups are exclusively due to RCU:<br>

><br>

> @looped[512]: 5516<br>

<br>

If the loop counter breaks without consuming the time budget that's<br>

silly.<br>

<br>

> Where 512 is the ORed pending mask over all iterations<br>

> 512 == 1 << RCU_SOFTIRQ.<br>

><br>

> And they usually take less than 100us to consume the 10 iterations.<br>

> Histogram of usecs consumed when we run out of loop iterations:<br>

><br>

> [16, 32)               3 |                                                    |<br>

> [32, 64)            4786 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|<br>

> [64, 128)            871 |@@@@@@@@@                                           |<br>

> [128, 256)            34 |                                                    |<br>

> [256, 512)             9 |                                                    |<br>

> [512, 1K)            262 |@@                                                  |<br>

> [1K, 2K)              35 |                                                    |<br>

> [2K, 4K)               1 |                                                    |<br>

><br>

> Paul, is this expected? Is RCU not trying too hard to be nice?<br>

><br>

> # cat /sys/module/rcutree/parameters/blimit<br>

> 10<br>

><br>

> Or should we perhaps just raise the loop limit? Breaking after less <br>

> than 100usec seems excessive :(<br>

<br>

No. Can we please stop twiddling a parameter here and there and go and<br>

fix this whole problem space properly. Increasing the loop count for RCU<br>

might work for your particular usecase and cause issues in other<br>

scenarios.<br>

<br>

Btw, RCU seems to be a perfect candidate to delegate batches from softirq<br>

into a seperate scheduler controllable entity.<br>

<br>

>> So for the remaining 90ms any invocation of raise_softirq() outside of<br>

>> (soft)interrupt context, which wakes ksoftirqd again, prevents<br>

>> processing on return from interrupt until ksoftirqd gets on the CPU and<br>

>> goes back to sleep, because task_is_running() == true and the stale<br>

>> limit is not after jiffies.<br>

>> <br>

>> Probably not a big issue, but someone will notice on some weird workload<br>

>> sooner than later and the tweaking will start nevertheless. :) So maybe<br>

>> we fix it right away. :)<br>

><br>

> Hm, Paolo raised this point as well, but the overload time is strictly<br>

> to stop paying attention to the fact ksoftirqd is running.<br>

> IOW current kernels behave as if they had overload_limit of infinity.<br>

><br>

> The current code already prevents processing until ksoftirqd schedules<br>

> in, after raise_softirq() from a funky context.<br>

<br>

Correct and it does so because we are just applying duct tape over and<br>

over.<br>

<br>

That said, I have no brilliant solution for that off the top of my head,<br>

but I'm not comfortable with applying more adhoc solutions which are<br>

contrary to the efforts of e.g. the audio folks.<br>

<br>

I have some vague ideas how to approach that, but I'm traveling all of<br>

next week, so I neither will be reading much email, nor will I have time<br>

to think deeply about softirqs. I'll resume when I'm back.<br>

<br>

Thanks,<br>

<br>

        tglx<br>

</div>