Hi Bob,
Measuring and monitoring Wifi behavior isn't necessary or
sufficient. Same with Starlink or whatever else comes along in the
future.
The architecture of the Internet places different mechanisms, that
in past times were contained in the switching equipment, now at many
different places along a data path. Much of the mechanism is even
in the users' devices themselves, which make all sorts of decisions
about datagram size, acknowledgements, retransmission, discarding
duplicates, et al. Those mechanisms interact with the decisions
being made in network equipment such as switches. The overall
behavior dictates what the end users see as behavior and reliability
of "the net" as they experience it. The performance of the overall
system is influenced by the interaction of its many pieces.
My point was that to manage network service ("network" being defined
by the users), you have to monitor and measure performance as seen
by the users, as close to the keyboard/mouse/screen/whatever as you
can get. That's why we decided to require a computer of some kind
on each users' LAN environment, so we could experience and measure
what they were likely experiencing, and use our measurements of
switches, circuits, etc. to analyze and fix problems. It was also
helpful to have a database of the metrics captured during previous
"normal" network activity, to use as comparisons.
As one example, I remember one event when a momentary glitch on a
transpacific circuit would cause a flurry of activity as TCPs in the
users' computers compensated, and would settle back to a steady
state after a few minutes. But users complained that their file
transfers were now taking much longer than usual. After our poking
and prodding, using those remote computers as tools to see what the
users were experiencing, we discovered that everything was operating
as expected, except that every datagram was being transmitted twice
and the duplicates discarded at the destination. The TCP
retransmission mechanisms had settled into a new stable state.
To the network switches, the datagrams all seeemed OK, but there was
significantly more traffic than usual. No one was monitoring all
those user devices out on the LANs so no one except the users
noticed anything wrong. Eventually another glitch on the circuit
would cause another flurry of activity and perhaps settle back into
the desired state where datagrams only got sent once.
We monitored whatever we could using SNMP to the routers and
computers that had implemented such things, and we used our remote
computers to also collect data from the users' perspective. Often
we could tell a LAN manager that some particular deviceat his/her
site was having problems, by looking for behavior that differed from
the "normal" historical behavior from a week or so earlier.
It would be interesting for example to collect metrics from switches
about "buffer occupancy" and "transit time" (I don't recall if any
MIB in SNMP had such metrics), and correlate that with TCP metrics
such as retransmission behavior and duplicate detection.
Jack
On 2/27/24 09:48, rjmcmahon wrote:
Hi Jack,
On LAN probes & monitors; I've been told that 90% of users
devices are now wirelessly connected so the concept of connecting
to a common wave guide to measure or observe user information
& flow state isn't viable. A WiFi AP could provide its end
state but wireless channels' states are non-trivial and the APs
prioritize packet forwarding at L2 over state collection. I
suspect a fully capable AP that could record per quintuple and RF
channels' states would be too expensive. This is part of the
reason why our industry and policy makers need to define the key
performance metrics well.
Bob
Yes, latency is complicated.... Back when
I was involved in the
early Internet (early 1980s), we knew that latency was an issue
requiring much further research, but we figured that meanwhile
problems could be avoided by keeping traffic loads well below
capacity
while the appropriate algorithms could be discovered by the
engineers
(I was one...). Forty years later, it seems like it's still a
research topic.
Years later in the 90s I was involved in operating an
international
corporate intranet. We quickly learned that keeping the human
users
happy required looking at more than the routers and circuits
between
them. With much of the "reliability mechanisms" of TCP et al
now
located in the users' computers rather than the network
switches,
evaluating users' experience with "the net" required
measurements from
the users' perspective.
To do that, we created a policy whereby every LAN attached to
the
long-haul backbone had to have a computer on that LAN to which
we had
remote access. That enabled us to perform "ping" tests and
also
collect data about TCP behavior (duplicates, retransmissions,
etc.)
using SNMP, etherwatch, et al. It was not unusual for the
users'
data to indicate that "the net", as they saw it, was misbehaving
while
the network data, as seen by the operators, indicated that all
the
routers and circuits were working just fine.
If the government regulators want to keep the users happy, IMHO
they
need to understand this.
Jack Haverty
On 2/26/24 16:25, rjmcmahon wrote:
On top of all that, the latency
responses tend to be non parametric
and may need full pdfs/cdfs along with non-parametric
statistical
process controls. Attached is an example from many years ago
which
was a firmware bug that sometimes delayed packet processing,
creating a second node in the pdf.
Engineers and their algorithms can be this way it seems.
Bob
I didn't study the whole report, but I didn't notice any
metrics
associated with *variance* of latency or bandwidth. It's
common for
vendors to play games ("Lies, damn lies, and statistics!") to
make
their metrics look good. A metric of latency that says
something
like "99% less than N milliseconds" doesn't necessarily
translate
into
an acceptable user performance.
It's also important to look at the specific techniques used
for
taking
measurements. For example, if a measurement is performed
every
fifteen minutes, extrapolating the metric as representative of
all
the
time between measurements can also lead to a metric judgement
which
doesn't reflect the reality of what the user actually
experiences.
In addition, there's a lot of mechanism between the ISPs'
handling
of
datagrams and the end-user. The users' experience is
affected by
how
all of that mechanism interacts as underlying network behavior
changes. When a TCP running in some host decides it needs to
retransmit, or an interactive audio/video session discards
datagrams
because they arrive too late to be useful, the user sees
unacceptable
performance even though the network operators may think
everything
is
running fine. Measurements from the end-users' perspective
might
indicate performance is quite different from what measurements
at
the
ISP level suggest.
Gamers are especially sensitive to variance, but it will also
apply
to
interactive uses such as might occur in telemedicine or remote
operations. A few years ago I helped a friend do some tests
for a
gaming situation and we discovered that the average latency
was
reasonably low, but occasionally, perhaps a few times per
hour,
latency would increase to 10s of seconds.
In a game, that often means the player loses. In a remote
surgery
it
may mean horrendous outcomes. As more functionality is
performed
"in
the cloud" such situations will become increasingly common.
Jack Haverty
On 2/26/24 12:02, rjmcmahon via Nnagain wrote:
Thanks for sharing this. I'm trying to find out what are the
key
metrics that will be used for this monitoring. I want to make
sure
iperf 2 can cover the technical, traffic related ones that
make
sense to a skilled network operator, including a WiFi BSS
manager. I
didn't read all 327 pages though, from what I did read, I
didn't see
anything obvious. I assume these types of KPIs may be in
reference
docs or something.
Thanks in advance for any help on this.
Bob
And...
Our bufferbloat.net submittal was cited multiple times! Thank
you
all
for participating in that process!
https://docs.fcc.gov/public/attachments/DOC-400675A1.pdf
It is a long read, and does still start off on the wrong feet
(IMHO),
in particular not understanding the difference between idle
and
working latency.
It is my hope that by widening awareness of more of the real
problems
with latency under load to policymakers and other submitters
downstream from this new FCC document, and more reading what
we
had to
say, that we will begin to make serious progress towards
finally
fixing bufferbloat in the USA.
I do keep hoping that somewhere along the way in the future,
the
costs
of IPv4 address exhaustion and the IPv6 transition, will also
get
raised to the national level. [1]
We are still collecting signatures for what the bufferbloat
project
members wrote, and have 1200 bucks in the kitty for further
articles
and/or publicity. Thoughts appreciated as to where we can go
next
with
shifting the national debate about bandwidth in a better
direction!
Next up would be trying to get a meeting, and to do an
ex-parte
filing, I think, and I wish we could do a live demonstration
on
television about it as good as feynman did here:
https://www.youtube.com/watch?v=raMmRKGkGD4
Our original posting is here:
https://docs.google.com/document/d/19ADByjakzQXCj9Re_pUvrb5Qe5OK-QmhlYRLMBY4vH4/edit
Larry's wonderful post is here:
https://circleid.com/posts/20231211-its-the-latency-fcc
[1] How can we get more talking about IPv4 and IPv6, too?
Will we
have
to wait another year?
https://hackaday.com/2024/02/14/floss-weekly-episode-769-10-more-internet/
--
https://blog.cerowrt.org/post/2024_predictions/
Dave Täht CSO, LibreQos
_______________________________________________
Nnagain mailing list
Nnagain@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/nnagain
_______________________________________________
Nnagain mailing list
Nnagain@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/nnagain