Over the past month I had been experiencing issues seemingly related to my ISP. Short periods of increased latency (lag spikes) were starting to have an effect on my sanity. After digging in several tools (a few that I knew about, a few that I didn’t before this endeavor), I was able to determine the cause of my issues (buffer bloat) and configure my router with Smart Queue Management (SQM) to resolve my latency issues.
Initial discovery of issues – I have ping alerts that send me a Pushover notification on my phone when an external endpoint is unreachable. In this case I’ve been pinging CloudFlare (126.96.36.199), and I started getting notifications that pings were dropping. Normally this would tell me that my internet is down, but in this case it was telling me something different. I tried to use pathping to determine the cause of pings dropping, but I couldn’t get it to see any hops outside my network. I used SS64 to find the Linux equivalent of pathping – MTR (Matt’s traceroute). Here we can see what appears to be packet loss at my ISP’s routing partner. Although this particular issue isn’t resolved by SQM, it lead me down the path I am on now…
I reached out to Equinix support, and they told me that my ISP likely has an oversubscribed port. This means that the connection is basically maxed out, and the router drops ICMP packets as “bottom of the barrel” traffic. This StackExchange answer recommends to use the tool tcping, which is a wonderful network troubleshooting tool along with tcproute (this tool can help determine what firewall a port needs to be opened on, for example). They also recommended using WinMTR, but I found that it’s not that great because it counts “no ICMP reply” hops as packet loss, and it doesn’t have alternate display modes. So basically I’ve found that it’s good to have a mix of both Windows and Linux network troubleshooting tools at one’s disposal. Using tcping to probe DNS port 53, I was able to see that none of my actual data packets to CloudFlare were dropping. I decided to leave it at this, and ping monitor quad9 (188.8.131.52) until my ISP upgrades their oversubscribed link.
But now that I’ve seen what can’t be unseen, of course other issues started cropping up. I was starting to see latency and packet loss during periods of high utilization. This would manifest as lag spikes either when I was working remotely, or listening to streaming audio. Using MTR’s alternate display mode (press the D key), I was able to see some pretty wild statistics across all the hops along the route. I had my ISP come out and take a look and they replaced the modem, and the signal looked fine. With this display mode you can see with the scale, that this was a period of extreme latency. ? is packet loss and > is greater than the labeled scale.
Seeing that the very first hop to my router is even affected by extreme latency, I took some steps to rule it out. Pinging my router under heavy load showed latency that would seem to indicate the router is an issue. I looked up “EdgerouterX lag spikes” and I found several posts where the issue was described as “buffer bloat” and recommended to configure Smart Queue Management. I found the WaveForm buffer bloat test, and ran it on my network. It’s important to run this test directly connected to your router (with an ethernet cable) and with no heavy load on the internet connection. Bypassing my router, running the test directly from my laptop connected to the modem, I could see that the issue is not caused by my router. There are times when network congestion further down the route appears to back up and latency was much worse than this.
Having done some research on SQM, I found that this has an impact on total available bandwidth. Knowing this I contacted my ISP and had them double my bandwidth to 400/20. In order for my EdgerouterX to pull bandwidth upwards of 300 Mbps, I had to configure my router to enable hwnat hardware offloading. Running a buffer bloat test, bandwidth is looking great, but still seeing latency occurring at times of congestion.
After applying SQM and running another test, I see the desired effect – lower latency with the trade off of lower bandwidth (for a single connection). I am pretty sure however that with multiple connections SQM will still be able to saturate and fully take advantage of the data connection. Update: SQM disables hardware offloading so the full speed on my router can't be realized while using SQM. I can either downgrade my internet connection or upgrade my router to a model that can offer more throughput with SQM enabled.
The takeaway: dropped pings don’t mean dropped data packets. If you have tools such as MTR or tcping at your disposal, they can help demystify network issues. If you’re seeing latency caused by buffer bloat, you need a router that can have Smart Queue Management configured. The ISP cannot be expected to have this feature enabled on their routers as it would be a bottleneck on the network. If you’re having trouble understanding buffer bloat, check out the FAQ on the buffer bloat test page.