Depends on how short of a timescale you're talking about. Shared global state that is being read and written to very quickly by multiple threads is bad enough for a single package system, but when you start getting to something like an AMD Ryzen or NUMA, shared global state becomes really expensive. Accuracy is expensive. Loosen the accuracy and gain scalability.
I would be interested in the pseduo-code or high level of what state needs to be shared and how that state is used.
I was also thinking more of some hybrid. Instead of a "token" representing a bucked amount of bandwidth that can be immediately used, I was thinking more of like a "future" of bandwidth that could be used. So instead of saying "here's a token of bandwidth", you have each core doing it's own deficit bandwidth shaping, but when a token is received, a core can temporarily increase its assigned shaping bandwidth. If I remember correctly, cake already supports having its bandwidth changed on the fly.
Of course it may be simpler to say cake is meant to be used on no more than 8 cores with a non-numa CPU system with all cores having a shared low-latency cache connecting the cores.