I have discovered the source of the confusion. This is different, but similar to PUMA6 problems, I think.
Certain types of traffic are getting de-prioritized, especially small packets under load.
Comparison:
Normal ping:
Minimum = 17ms, Maximum = 734ms, Average = 202ms
Large 1472-byte ping
Minimum = 18ms, Maximum = 229ms, Average = 47ms
TCP-Based ping via psping.exe
Minimum = 14.15ms, Maximum = 40.39ms, Average = 25.88ms
For more fun:
ICMP in a VPN, ping set to 13-bytes
Minimum = 36ms, Maximum = 602ms, Average = 208ms
ICMP in a VPN, 1300 bytes
Minimum = 24ms, Maximum = 72ms, Average = 40ms
This explains why certain synthetic bufferbloat benchmarks were passing with flying colors but my basic ICMP ping was terrible under load. VPN does not seem to help, so it's not packet inspection... it's failure of tiny packets and success of larger, normal ones.
The problem only occurs during download stress, not upload stress. Anyway, I guess this is mostly an FYI.
Perhaps a script that does variable packet sizes for the generated small flow traffic could be useful to someone in the future.