[Cake] [Bloat] Little's Law mea culpa, but not invalidating my main point
Jason_Livingood at comcast.com
Mon Jul 12 09:46:22 EDT 2021
> 2) Users are pissed off, because they clicked on a web page, and got nothing back. They retry on their screen, or they try another site. Meanwhile, the underlying TCP connection remains there, pumping the network full of more packets on that old path, which is still backed up with packets that haven't been delivered that are sitting in queues.
Agree. I’ve experienced that as utilization of a network segment or supporting network systems (e.g. DNS) increases, you may see very small delay creep in - but not much as things are stable until they are *quite suddenly* not so. At that stability inflection point you immediately & dramatically fall off a cliff, which is then exacerbated by what you note here – user and machine-based retries/retransmissions that drives a huge increase in traffic. The solution has typically been throwing massive new capacity at it until the storm recedes.
> I should say that most operators, and especially ATT in this case, do not measure end-to-end latency. Instead they use Little's Lemma to query routers for their current throughput in bits per second, and calculate latency as if Little's Lemma applied.
IMO network operators views/practices vary widely & have been evolving quite a bit in recent years. Yes, it used to be all about capacity utilization metrics but I think that is changing. In my day job, we run E2E latency tests (among others) to CPE and the distribution is a lot more instructive than the mean/median to continuously improving the network experience.
> And management responds, Hooray! Because utilization of 100% of their hardware is their investors' metric of maximizing profits. The hardware they are operating is fully utilized. No waste! And users are happy because no packets have been dropped!
Well, I hope it wasn’t 100% utilization meant they were ‘green’ on their network KPIs. ;-) Ha. But I think you are correct that a network engineering team would have been measured by how well they kept ahead of utilization/demand & network capacity decisions driven in large part by utilization trends. In a lot of networks I suspect an informal rule of thumb arose that things got a little funny once p98 utilization got to around 94-95% of link capacity – so backup from there to figure out when you need to trigger automatic capacity augments to avoid that. While I do not think managing via utilization in that way is incorrect, ISTM it’s mostly being used as the measure is an indirect proxy for end user QoE. I think latency/delay is becoming seen to be as important certainly, if not a more direct proxy for end user QoE. This is all still evolving and I have to say is a super interesting & fun thing to work on. :-)
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Cake