This *is* commonly a problem. Look up "TCP incast".

The scenario is exactly as you describe. A distributed database makes queries over the same switch to K other nodes in order to verify the integrity of the answer. Data is served from memory and thus access times are roughly the same on all the K nodes. If the data response is sizable, then the switch output port is overwhelmed with traffic, and it drops packets. TCPs congestion algorithm gets into play.

It is almost like resonance in engineering. At the wrong "frequency", the bridge/switch will resonate and make everything go haywire.


On Sun, Jun 12, 2016 at 11:24 PM, Steinar H. Gunderson <sgunderson@bigfoot.com> wrote:
On Sun, Jun 12, 2016 at 01:25:17PM -0500, Benjamin Cronce wrote:
> Internal networks rarely have bandwidth issues and congestion only happens
> when you don't have enough bandwidth.

I don't think this is true. You might not have an aggregate bandwidth issues,
but given the burstiness of TCP and the typical switch buffer depth
(64 frames is a typical number), it's very very easy to lose packets in your
switch even on a relatively quiet network with no downconversion. (Witness
the rise of DCTCP, made especially for internal traffic on this kind of
network.)

/* Steinar */
--
Homepage: https://www.sesse.net/
_______________________________________________
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat



--
J.