On 2018-03-20 from 11:07 to 11:10 UTC icinga started complaining about too many 5xx returned for the domain wikimedia.org. It turned out to be related to the v1/pageviews API.
AQS metrics containing the spike:
https://grafana.wikimedia.org/dashboard/db/aqs-elukey?orgId=1&from=1521543209228&to=1521544844913
From a early analysis made by Joseph checking Webrequest data (so from the Varnish viewpoint), it seems that multiple IPs using a python-requests UA made 192125 HTTP 200 request and 188177 HTTP 404 requests at minute 11:09 (~380k reqs/min, 6338/s).
From the metrics it seems that Restbase at some point started returning 504s (gateway timeout), failing with the following error:
https://logstash.wikimedia.org/goto/c8f93229a85d7d961df699459347cf6e
Error: connect EADDRNOTAVAIL 10.2.2.12:7232 - Local (10.64.0.223:0) at Object.exports._errnoException (util.js:1018:11) at exports._exceptionWithHostPort (util.js:1041:20) at connect (net.js:880:16) at net.js:1009:7 at /srv/deployment/restbase/deploy-cache/revs/8dbc93c6b3747dbfe90f8f8e56fd6af661cf5e69/node_modules/dnscache/lib/index.js:80:28 at /srv/deployment/restbase/deploy-cache/revs/8dbc93c6b3747dbfe90f8f8e56fd6af661cf5e69/node_modules/dnscache/lib/cache.js:116:13 at RawTask.call (/srv/deployment/restbase/deploy-cache/revs/8dbc93c6b3747dbfe90f8f8e56fd6af661cf5e69/node_modules/asap/asap.js:40:19) at flush (/srv/deployment/restbase/deploy-cache/revs/8dbc93c6b3747dbfe90f8f8e56fd6af661cf5e69/node_modules/asap/raw.js:50:29) at _combinedTickCallback (internal/process/next_tick.js:73:7) at process._tickCallback (internal/process/next_tick.js:104:9)
Since I am not seeing a huge spike in latency metrics for AQS, I am wondering if all these requests didn't saturate the Restbase's ephemeral ports for a brief amount of time causing this nodejs error.
On restbase1007:
elukey@restbase1007:~$ cat /proc/sys/net/ipv4/ip_local_port_range 32768 60999
That is ~28k ephemeral ports available. From this graph it seems that time waits increased lot during the timeframe:
From https://vincent.bernat.im/en/blog/2014-tcp-time-wait-state-linux
"If the TIME-WAIT sockets are on the client side, such a situation is easy to detect. The call to connect() will return EADDRNOTAVAIL and the application will log some error message about that."
and
elukey@restbase1007:~$ sudo sysctl net.ipv4.tcp_tw_reuse net.ipv4.tcp_tw_reuse = 0
"By enabling net.ipv4.tcp_tw_reuse, Linux will reuse an existing connection in the TIME-WAIT state for a new outgoing connection if the new timestamp is strictly bigger than the most recent timestamp recorded for the previous connection: an outgoing connection in the TIME-WAIT state can be reused after just one second."
We are using this option on cache and swift hosts IIRC. So if everything that I've written vaguely makes sense, I think that the issue was on Restbase, even if AQS needs some tweaks to be quicker for 404s etc...
Would it make sense to increase on Restbase the ephemeral ports available and/or enable tw_reuse?
Awesome explanations for those settings on caching hosts in https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/cacheproxy/manifests/performance.pp