There should also be a way to track the "ack backlog". By that I mean, if you can see that the packets being acked were sent 10 seconds ago and they are consistently so, you should then be able to determine that you are likely (10 seconds - real RTT - processing delay) deep in buffers somewhere. If you back off on the number of packets in flight and that ack backlog doesn't seem to change much, then the congestion is probably not related to your specific flow. It is likely due to aggregate congestion somewhere in the path. Could be a congested peering point, pop, busy distant end, whatever. But if the backing off DOES significantly reduce the ack backlog (acks are now arriving for packets sent only 5 seconds ago rather than 10) then you have a notion that the flow is a significant contributor to the total backlog. Exactly what one would do with that information is the question, I guess.

Is the backlog consistent across all flows or just one? If it is consistent across all flows then the source of buffering is very close to you. If it is wildly different, it is likely somewhere in the path of that particular flow. And looking at the document linked concerning CDG, I see they take that into account. If I back off but the RTT doesn't decrease, then my flow is not a significant contributor to the delay. The problem with the algorithm to my mind is that finding the size of "the queue" for any particular flow is practically impossible because each flow will have its own specific amount of buffering along the path and if you get into things like asymmetric routing where the reply path might not be the same as the send path (multihomed transit provider or end node sending reply traffic over different peer than the traffic in the other direction is arriving on) or (worse) where ECMP is being done across peers on a packet by packed and not flow-based basis. At that point it is impossible to really profile the path.

So if I were designing such an algorithm, I would try to determine: Is the delay consistent across all flows? Is the delay consistent even within a single flow? When I reduce my rate, does the backlog drop? Exactly what I would do with that information would require more thought.

On Sun, Jun 21, 2015 at 9:19 AM, Benjamin Cronce <bcronce@gmail.com> wrote:

Just a random Sunday morning thought that has probably already been thought of before, but I currently can't think of hearing it before.

My understanding of most TCP congestion control algorithms is they primarily watch for drops, but drops are indicated by the receiving party via ACKs. The issue with this is TCP keeps pushing more data into the window until a drop is signaled, even if the rate received is not increased. What if the sending TCP also monitors rate received and backs off cramming more segments into a window if the received rate does not increase.

Two things to measure this. RTT which is part of TCP statistics already and the rate at which bytes are ACKed. If you double the number of segments being sent, but in a time frame relative to the RTT, you do not see a meaningful increase in the rate at which bytes are being ACKed, may want to back off.

It just seems to me that if you have a 50ms RTT and 10 seconds of bufferbloat, TCP is cramming data down the path with no care in the world about how quickly data is actually getting ACKed, it's just waiting for the first segment to get dropped, which would never happen in an infinitely buffered network.

TCP should be able to keep state that tracks the minimum RTT and maximum ACK rate. Between these two, it should not be able to go over the max path rate except when attempting to probe for a new max or min. Min RTT is probably a good target because path latency should be relatively static, however path free-bandwidth is not static. The desirable number of segments in flight would need to change but would be bounded by the max.

Of course naggle type algorithms can mess with this because when ACKs occur is no longer based entirely when a segment is received, but also by some other additional amount of time. If you assume that naggle will coalesce N segments into a single ACK, then you need to add to the RTT, the amount of time at the current PPS, how long until you expect another ACK assuming N number of segments will be coalesced. This would be even important for low latency low bandwidth paths. Coalesce information could be assumed, negotiated, or inferred. Negotiated would be best.

Anyway, just some random Sunday thoughts.

_______________________________________________
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat