Page MenuHomePhabricator

EventStreams accumulates too much memory on SCB nodes in CODFW
Closed, ResolvedPublic

Description

It seems the EventStreams service has started accumulating enormous amounts of memory on SCB nodes in CODFW. Just one example from scb2002:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                        
 4232 eventst+  20   0 9927.1m  66888  22488 S   2.3  0.2   0:14.11 /usr/bin/nodejs /srv/deployment/eventstreams/deploy-cache/revs/6b013f9a10a7d96b40225ab313b22cb4a19e03d2/node_modules/service-runner/service-r+ 
 4044 eventst+  20   0 9868.6m  65568  20288 S   0.3  0.2   0:05.46 /usr/bin/nodejs /srv/deployment/eventstreams/deploy-cache/revs/6b013f9a10a7d96b40225ab313b22cb4a19e03d2/node_modules/service-runner/service-r+ 
25405 eventst+  20   0 3106056  27476   7228 S   0.3  0.1   9:07.04 /usr/bin/nodejs src/server.js -c /etc/eventstreams/config.yaml                                                                                 
 3545 eventst+  20   0 10.285g 694560  22404 S   0.0  2.1   1:28.05 /usr/bin/nodejs /srv/deployment/eventstreams/deploy-cache/revs/6b013f9a10a7d96b40225ab313b22cb4a19e03d2/node_modules/service-runner/service-r+ 
 3577 eventst+  20   0 10.455g 872428  22636 S   0.0  2.7   3:42.07 /usr/bin/nodejs /srv/deployment/eventstreams/deploy-cache/revs/6b013f9a10a7d96b40225ab313b22cb4a19e03d2/node_modules/service-runner/service-r+ 
 3597 eventst+  20   0 10.276g 685192  22628 S   0.0  2.1   1:34.94 /usr/bin/nodejs /srv/deployment/eventstreams/deploy-cache/revs/6b013f9a10a7d96b40225ab313b22cb4a19e03d2/node_modules/service-runner/service-r+ 
 3624 eventst+  20   0 9866.7m  64108  20132 S   0.0  0.2   0:05.51 /usr/bin/nodejs /srv/deployment/eventstreams/deploy-cache/revs/6b013f9a10a7d96b40225ab313b22cb4a19e03d2/node_modules/service-runner/service-r+ 
 3637 eventst+  20   0 9918.4m  66332  22692 S   0.0  0.2   0:18.12 /usr/bin/nodejs /srv/deployment/eventstreams/deploy-cache/revs/6b013f9a10a7d96b40225ab313b22cb4a19e03d2/node_modules/service-runner/service-r+ 
 3798 eventst+  20   0 9925.2m  75112  22628 S   0.0  0.2   0:36.99 /usr/bin/nodejs /srv/deployment/eventstreams/deploy-cache/revs/6b013f9a10a7d96b40225ab313b22cb4a19e03d2/node_modules/service-runner/service-r+ 
 3908 eventst+  20   0 10.301g 712684  22764 S   0.0  2.2   4:52.33 /usr/bin/nodejs /srv/deployment/eventstreams/deploy-cache/revs/6b013f9a10a7d96b40225ab313b22cb4a19e03d2/node_modules/service-runner/service-r+ 
 3953 eventst+  20   0 9922.2m  69192  22696 S   0.0  0.2   0:20.33 /usr/bin/nodejs /srv/deployment/eventstreams/deploy-cache/revs/6b013f9a10a7d96b40225ab313b22cb4a19e03d2/node_modules/service-runner/service-r+ 
 3985 eventst+  20   0 9922.1m  70896  22396 S   0.0  0.2   0:23.61 /usr/bin/nodejs /srv/deployment/eventstreams/deploy-cache/revs/6b013f9a10a7d96b40225ab313b22cb4a19e03d2/node_modules/service-runner/service-r+ 
 4058 eventst+  20   0 11.166g 1.543g  22592 S   0.0  4.9   3:47.59 /usr/bin/nodejs /srv/deployment/eventstreams/deploy-cache/revs/6b013f9a10a7d96b40225ab313b22cb4a19e03d2/node_modules/service-runner/service-r+ 
 4177 eventst+  20   0 10.291g 700132  22552 S   0.0  2.1   1:34.04 /usr/bin/nodejs /srv/deployment/eventstreams/deploy-cache/revs/6b013f9a10a7d96b40225ab313b22cb4a19e03d2/node_modules/service-runner/service-r+ 
 4186 eventst+  20   0 9919.1m  65492  22756 S   0.0  0.2   0:06.58 /usr/bin/nodejs /srv/deployment/eventstreams/deploy-cache/revs/6b013f9a10a7d96b40225ab313b22cb4a19e03d2/node_modules/service-runner/service-r+ 
 4209 eventst+  20   0 10.901g 1.279g  22568 S   0.0  4.1   5:35.37 /usr/bin/nodejs /srv/deployment/eventstreams/deploy-cache/revs/6b013f9a10a7d96b40225ab313b22cb4a19e03d2/node_modules/service-runner/service-r+ 
 4225 eventst+  20   0 10.658g 1.035g  22560 S   0.0  3.3   3:46.86 /usr/bin/nodejs /srv/deployment/eventstreams/deploy-cache/revs/6b013f9a10a7d96b40225ab313b22cb4a19e03d2/node_modules/service-runner/service-r+ 
25402 eventst+  20   0   18272    240      4 S   0.0  0.0   0:00.00 /usr/bin/firejail --blacklist=/root --blacklist=/home --caps --seccomp /usr/bin/nodejs src/server.js -c /etc/eventstreams/config.yaml          
25403 eventst+  20   0   18272    268      0 S   0.0  0.0   0:00.00 /usr/bin/firejail --blacklist=/root --blacklist=/home --caps --seccomp /usr/bin/nodejs src/server.js -c /etc/eventstreams/config.yaml

Note the processes with 2+% mem (some reaching 5%) ! This has caused, only today, all of the services to crash on scb200[236]. The same does not occur in EQIAD, mysteriously.

Event Timeline

mobrovac triaged this task as Unbreak Now! priority.Jul 17 2018, 4:07 PM
mobrovac created this task.

I inspected the logs of the worker with the highest amount of memory on scb2002 and it seems that all of the connections have been closed, but the worker is holding 5% of the node's memory. This suggest a serious memory leak in the service. For posterity, here are the logs I inspected:

{"name":"eventstreams","hostname":"scb2002","pid":404,"level":30,"levelPath":"info","msg":"Worker 404 listening on 0.0.0.0:8092","time":"2018-07-17T18:37:44.734Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"bc192713-8a03-11e8-8936-f8d288eb8562","level":30,"msg":"Creating new KafkaSSE instance bc192713-8a03-11e8-8936-f8d288eb8562.","time":"2018-07-17T20:55:28.013Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"bc192713-8a03-11e8-8936-f8d288eb8562","level":30,"assignments":[{"topic":"eqiad.mediawiki.recentchange","partition":0,"offset":-1},{"topic":"codfw.mediawiki.recentchange","partition":0,"offset":-1}],"msg":"Subscribing to (some) topic partitions in topics eqiad.mediawiki.recentchange,codfw.mediawiki.recentchange.","time":"2018-07-17T20:55:28.261Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"bc192713-8a03-11e8-8936-f8d288eb8562","level":30,"msg":"Initializing sseClient and starting consume loop.","time":"2018-07-17T20:55:28.264Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"bc192713-8a03-11e8-8936-f8d288eb8562","level":30,"msg":"Disconnecting.","time":"2018-07-17T20:55:28.996Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"bc192713-8a03-11e8-8936-f8d288eb8562","level":30,"msg":"Closing Kafka Consumer.","time":"2018-07-17T20:55:28.996Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"bc192713-8a03-11e8-8936-f8d288eb8562","level":30,"msg":"Closing SSE response.","time":"2018-07-17T20:55:28.996Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"bc192713-8a03-11e8-8936-f8d288eb8562","level":30,"msg":"Finished. Returning from consume loop.","time":"2018-07-17T20:55:28.997Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"ad6c28c0-8a0b-11e8-9af3-613d111baae9","level":30,"msg":"Creating new KafkaSSE instance ad6c28c0-8a0b-11e8-9af3-613d111baae9.","time":"2018-07-17T21:52:19.362Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"ad6c28c0-8a0b-11e8-9af3-613d111baae9","level":30,"assignments":[{"topic":"eqiad.mediawiki.recentchange","partition":0,"offset":1003832049},{"topic":"codfw.mediawiki.recentchange","partition":0,"offset":28164410}],"msg":"Subscribing to (some) topic partitions in topics eqiad.mediawiki.recentchange,codfw.mediawiki.recentchange.","time":"2018-07-17T21:52:19.571Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"ad6c28c0-8a0b-11e8-9af3-613d111baae9","level":30,"msg":"Initializing sseClient and starting consume loop.","time":"2018-07-17T21:52:19.572Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"ad6c28c0-8a0b-11e8-9af3-613d111baae9","level":30,"msg":"Disconnecting.","time":"2018-07-17T21:53:30.287Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"ad6c28c0-8a0b-11e8-9af3-613d111baae9","level":30,"msg":"Closing Kafka Consumer.","time":"2018-07-17T21:53:30.289Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"ad6c28c0-8a0b-11e8-9af3-613d111baae9","level":30,"msg":"Closing SSE response.","time":"2018-07-17T21:53:30.290Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"ad6c28c0-8a0b-11e8-9af3-613d111baae9","level":30,"msg":"Finished. Returning from consume loop.","time":"2018-07-17T21:53:30.290Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"2853ef53-8a0c-11e8-af72-2f8ae6df933b","level":30,"msg":"Creating new KafkaSSE instance 2853ef53-8a0c-11e8-af72-2f8ae6df933b.","time":"2018-07-17T21:55:45.568Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"2853ef53-8a0c-11e8-af72-2f8ae6df933b","level":30,"assignments":[{"topic":"eqiad.mediawiki.recentchange","partition":0,"offset":-1},{"topic":"codfw.mediawiki.recentchange","partition":0,"offset":-1}],"msg":"Subscribing to (some) topic partitions in topics eqiad.mediawiki.recentchange,codfw.mediawiki.recentchange.","time":"2018-07-17T21:55:45.803Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"2853ef53-8a0c-11e8-af72-2f8ae6df933b","level":30,"msg":"Initializing sseClient and starting consume loop.","time":"2018-07-17T21:55:45.805Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"2853ef53-8a0c-11e8-af72-2f8ae6df933b","level":30,"msg":"Disconnecting.","time":"2018-07-17T21:55:46.146Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"2853ef53-8a0c-11e8-af72-2f8ae6df933b","level":30,"msg":"Closing Kafka Consumer.","time":"2018-07-17T21:55:46.147Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"2853ef53-8a0c-11e8-af72-2f8ae6df933b","level":30,"msg":"Closing SSE response.","time":"2018-07-17T21:55:46.147Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"2853ef53-8a0c-11e8-af72-2f8ae6df933b","level":30,"msg":"Finished. Returning from consume loop.","time":"2018-07-17T21:55:46.191Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"4abcd574-8a0c-11e8-b961-db491ffe5eed","level":30,"msg":"Creating new KafkaSSE instance 4abcd574-8a0c-11e8-b961-db491ffe5eed.","time":"2018-07-17T21:56:43.294Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"4abcd574-8a0c-11e8-b961-db491ffe5eed","level":30,"assignments":[{"topic":"eqiad.mediawiki.recentchange","partition":0,"offset":-1},{"topic":"codfw.mediawiki.recentchange","partition":0,"offset":-1}],"msg":"Subscribing to (some) topic partitions in topics eqiad.mediawiki.recentchange,codfw.mediawiki.recentchange.","time":"2018-07-17T21:56:43.481Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"4abcd574-8a0c-11e8-b961-db491ffe5eed","level":30,"msg":"Initializing sseClient and starting consume loop.","time":"2018-07-17T21:56:43.481Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"4abcd574-8a0c-11e8-b961-db491ffe5eed","level":30,"msg":"Disconnecting.","time":"2018-07-17T22:13:13.681Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"4abcd574-8a0c-11e8-b961-db491ffe5eed","level":30,"msg":"Closing Kafka Consumer.","time":"2018-07-17T22:13:13.682Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"4abcd574-8a0c-11e8-b961-db491ffe5eed","level":30,"msg":"Closing SSE response.","time":"2018-07-17T22:13:13.683Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"4abcd574-8a0c-11e8-b961-db491ffe5eed","level":30,"msg":"Finished. Returning from consume loop.","time":"2018-07-17T22:13:13.951Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"02208458-8a1d-11e8-963e-494db3f9bc6b","level":30,"msg":"Creating new KafkaSSE instance 02208458-8a1d-11e8-963e-494db3f9bc6b.","time":"2018-07-17T23:56:22.918Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"02208458-8a1d-11e8-963e-494db3f9bc6b","level":30,"assignments":[{"topic":"eqiad.mediawiki.recentchange","partition":0,"offset":-1},{"topic":"codfw.mediawiki.recentchange","partition":0,"offset":-1}],"msg":"Subscribing to (some) topic partitions in topics eqiad.mediawiki.recentchange,codfw.mediawiki.recentchange.","time":"2018-07-17T23:56:23.145Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"02208458-8a1d-11e8-963e-494db3f9bc6b","level":30,"msg":"Initializing sseClient and starting consume loop.","time":"2018-07-17T23:56:23.146Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"02208458-8a1d-11e8-963e-494db3f9bc6b","level":30,"msg":"Disconnecting.","time":"2018-07-17T23:56:24.054Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"02208458-8a1d-11e8-963e-494db3f9bc6b","level":30,"msg":"Closing Kafka Consumer.","time":"2018-07-17T23:56:24.054Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"02208458-8a1d-11e8-963e-494db3f9bc6b","level":30,"msg":"Closing SSE response.","time":"2018-07-17T23:56:24.054Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"02208458-8a1d-11e8-963e-494db3f9bc6b","level":30,"msg":"Finished. Returning from consume loop.","time":"2018-07-17T23:56:24.055Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"b1f430c4-8a35-11e8-b8cc-d5bfb1eb9143","level":30,"msg":"Creating new KafkaSSE instance b1f430c4-8a35-11e8-b8cc-d5bfb1eb9143.","time":"2018-07-18T02:53:05.827Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"b1f430c4-8a35-11e8-b8cc-d5bfb1eb9143","level":30,"assignments":[{"topic":"eqiad.mediawiki.recentchange","partition":0,"offset":-1},{"topic":"codfw.mediawiki.recentchange","partition":0,"offset":-1}],"msg":"Subscribing to (some) topic partitions in topics eqiad.mediawiki.recentchange,codfw.mediawiki.recentchange.","time":"2018-07-18T02:53:06.039Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"b1f430c4-8a35-11e8-b8cc-d5bfb1eb9143","level":30,"msg":"Initializing sseClient and starting consume loop.","time":"2018-07-18T02:53:06.041Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"b1f430c4-8a35-11e8-b8cc-d5bfb1eb9143","level":30,"msg":"Disconnecting.","time":"2018-07-18T03:09:36.318Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"b1f430c4-8a35-11e8-b8cc-d5bfb1eb9143","level":30,"msg":"Closing Kafka Consumer.","time":"2018-07-18T03:09:36.319Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"b1f430c4-8a35-11e8-b8cc-d5bfb1eb9143","level":30,"msg":"Closing SSE response.","time":"2018-07-18T03:09:36.319Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"b1f430c4-8a35-11e8-b8cc-d5bfb1eb9143","level":30,"msg":"Finished. Returning from consume loop.","time":"2018-07-18T03:09:36.351Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"05d7eb52-8a46-11e8-a22e-7610f3aa84c0","level":30,"msg":"Creating new KafkaSSE instance 05d7eb52-8a46-11e8-a22e-7610f3aa84c0.","time":"2018-07-18T04:49:58.519Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"05d7eb52-8a46-11e8-a22e-7610f3aa84c0","level":30,"assignments":[{"topic":"eqiad.mediawiki.recentchange","partition":0,"offset":1003975219},{"topic":"codfw.mediawiki.recentchange","partition":0,"offset":28164410}],"msg":"Subscribing to (some) topic partitions in topics eqiad.mediawiki.recentchange,codfw.mediawiki.recentchange.","time":"2018-07-18T04:49:58.745Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"05d7eb52-8a46-11e8-a22e-7610f3aa84c0","level":30,"msg":"Initializing sseClient and starting consume loop.","time":"2018-07-18T04:49:58.746Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"05d7eb52-8a46-11e8-a22e-7610f3aa84c0","level":30,"msg":"Disconnecting.","time":"2018-07-18T04:51:07.807Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"05d7eb52-8a46-11e8-a22e-7610f3aa84c0","level":30,"msg":"Closing Kafka Consumer.","time":"2018-07-18T04:51:07.811Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"05d7eb52-8a46-11e8-a22e-7610f3aa84c0","level":30,"msg":"Closing SSE response.","time":"2018-07-18T04:51:07.811Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"05d7eb52-8a46-11e8-a22e-7610f3aa84c0","level":30,"msg":"Finished. Returning from consume loop.","time":"2018-07-18T04:51:07.812Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"67c30038-8a4c-11e8-9244-551601801f40","level":30,"msg":"Creating new KafkaSSE instance 67c30038-8a4c-11e8-9244-551601801f40.","time":"2018-07-18T05:35:39.782Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"67c30038-8a4c-11e8-9244-551601801f40","level":30,"assignments":[{"topic":"eqiad.mediawiki.recentchange","partition":0,"offset":1003994171},{"topic":"codfw.mediawiki.recentchange","partition":0,"offset":28164410}],"msg":"Subscribing to (some) topic partitions in topics eqiad.mediawiki.recentchange,codfw.mediawiki.recentchange.","time":"2018-07-18T05:35:39.998Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"67c30038-8a4c-11e8-9244-551601801f40","level":30,"msg":"Initializing sseClient and starting consume loop.","time":"2018-07-18T05:35:40.000Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"67c30038-8a4c-11e8-9244-551601801f40","level":30,"msg":"Disconnecting.","time":"2018-07-18T05:38:31.727Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"67c30038-8a4c-11e8-9244-551601801f40","level":30,"msg":"Closing Kafka Consumer.","time":"2018-07-18T05:38:31.729Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"67c30038-8a4c-11e8-9244-551601801f40","level":30,"msg":"Closing SSE response.","time":"2018-07-18T05:38:31.730Z","v":0}
{"name":"eventstreams","hostname":"scb2002","pid":404,"id":"67c30038-8a4c-11e8-9244-551601801f40","level":30,"msg":"Finished. Returning from consume loop.","time":"2018-07-18T05:38:31.731Z","v":0}

Change 446556 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[operations/puppet@production] EventStreams: Consume messages from the local Kafka cluster

https://gerrit.wikimedia.org/r/446556

Change 446577 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/services/eventstreams/deploy@master] rdkafka: Set the maximum buffer size to 65Mb

https://gerrit.wikimedia.org/r/446577

Mentioned in SAL (#wikimedia-operations) [2018-07-18T12:24:05Z] <elukey> manually added queued.max.messages.kbytes: 65535 to eventstreams on scb2002 as test for T199813

Change 446577 merged by Mobrovac:
[mediawiki/services/eventstreams/deploy@master] rdkafka: Set the maximum buffer size to 65Mb

https://gerrit.wikimedia.org/r/446577

Mentioned in SAL (#wikimedia-operations) [2018-07-18T14:48:02Z] <mobrovac@deploy1001> Started deploy [eventstreams/deploy@09f0efe]: Set the maximum rdkafka receive buffer size to 64MB - T199813

Mentioned in SAL (#wikimedia-operations) [2018-07-18T14:50:30Z] <mobrovac@deploy1001> Finished deploy [eventstreams/deploy@09f0efe]: Set the maximum rdkafka receive buffer size to 64MB - T199813 (duration: 02m 28s)

mobrovac lowered the priority of this task from Unbreak Now! to High.

Lowering the priority as fixing the receive buffer size to 64MB should do the trick. Let's monitor it now for a while.

Change 446556 merged by Elukey:
[operations/puppet@production] EventStreams: Consume messages from the local Kafka cluster

https://gerrit.wikimedia.org/r/446556

Mentioned in SAL (#wikimedia-operations) [2018-07-19T08:08:33Z] <elukey> restart eventstreams on scb2* hosts to pick up new Kafka settings (pointing it to main-codfw) - T199813

A little recap and current status. We tried obtaining heap dumps from memory-fat processes in codfw, but these were only 20MB large when the process was holding 1.5+ GB of memory. We also compared heap dumps with those obtained from eqiad and could not spot any (obvious) differences. Because the heap dump were so small, that led us to believe that the majority of the memory was being held by librdkafka. One more thing that seems to confirm this is the fact that, in spite of the huge memory footprint, no worker was killed by service-runner because of it. Therefore we [lowered the queued.max.messages.kbytes to 64MB](https://gerrit.wikimedia.org/r/#/c/mediawiki/services/eventstreams/deploy/+/446577/). That did not help. We then hypothesised it might be due to ES in codfw reading a lot of messages cross-DC, which led us to making the service consume from its local DC, to no avail either: processes in codfw hold onto the memory even after clients disconnect from them.

To confirm that it's a software problem, we ran an experiment in eqiad, where we asked ES for recentchange events dating back a week, in this way effectively asking for 10 million events to be delivered. While we saw an increase in the memory footprint of the worker processing the request (from 0.2% of RAM to 0.7%), everything went back to normal as soon as the client disconnected. In other words, we haven't been able to confirm it's a software issue and our attempts at making eqiad workers swallow memory failed.

Perhaps it would be worth disabling the active-active behaviour for ES and have all clients talk to the service in eqiad only and monitor the situation and see if the same issue arises there.

Mentioned in SAL (#wikimedia-operations) [2018-07-19T19:04:02Z] <elukey> roll restart eventstreams on scb2* hosts to prevent OOM issues over the EU night - T199813

Change 447059 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/services/eventstreams/deploy@master] Config: Lower librdkafka setting values for fetching

https://gerrit.wikimedia.org/r/447059

Change 447059 merged by Mobrovac:
[mediawiki/services/eventstreams/deploy@master] Config: Lower librdkafka setting values for fetching

https://gerrit.wikimedia.org/r/447059

Mentioned in SAL (#wikimedia-operations) [2018-07-20T11:58:08Z] <mobrovac@deploy1001> Started deploy [eventstreams/deploy@01fac88]: Lower librdkafka settings related to fetching messages - T199813

Mentioned in SAL (#wikimedia-operations) [2018-07-20T12:00:20Z] <mobrovac@deploy1001> Finished deploy [eventstreams/deploy@01fac88]: Lower librdkafka settings related to fetching messages - T199813 (duration: 02m 11s)

Change 447417 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/services/eventstreams@master] Use node-rdkafka v2.3.4

https://gerrit.wikimedia.org/r/447417

Change 447417 merged by Mobrovac:
[mediawiki/services/eventstreams@master] Use node-rdkafka v2.3.4

https://gerrit.wikimedia.org/r/447417

Mentioned in SAL (#wikimedia-operations) [2018-07-23T13:57:31Z] <mobrovac@deploy1001> Started deploy [eventstreams/deploy@5ed03e0]: Update node-rdkafka to v2.3.4 - T199813

Mentioned in SAL (#wikimedia-operations) [2018-07-23T14:00:00Z] <mobrovac@deploy1001> Finished deploy [eventstreams/deploy@5ed03e0]: Update node-rdkafka to v2.3.4 - T199813 (duration: 02m 29s)

Change 447423 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache: temporarily return 404 for stream.w.o/socket.io

https://gerrit.wikimedia.org/r/447423

Change 447423 merged by Ema:
[operations/puppet@production] cache: temporarily return 404 for stream.w.o/socket.io

https://gerrit.wikimedia.org/r/447423

Mentioned in SAL (#wikimedia-operations) [2018-07-23T22:53:26Z] <mobrovac@deploy1001> Started restart [eventstreams/deploy@5ed03e0]: A ggodnight restart - T199813

Change 447658 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/services/eventstreams@master] Update KafkaSSE to bb7e986

https://gerrit.wikimedia.org/r/447658

Change 447658 merged by Mobrovac:
[mediawiki/services/eventstreams@master] Update KafkaSSE to bb7e986

https://gerrit.wikimedia.org/r/447658

Mentioned in SAL (#wikimedia-operations) [2018-07-24T18:32:00Z] <mobrovac@deploy1001> Started deploy [eventstreams/deploy@690fdad]: Wait for the client to consume the meesage being sent before consuming the next one - T199813

Mentioned in SAL (#wikimedia-operations) [2018-07-24T18:34:18Z] <mobrovac@deploy1001> Finished deploy [eventstreams/deploy@690fdad]: Wait for the client to consume the meesage being sent before consuming the next one - T199813 (duration: 02m 18s)

The fix is now deployed everywhere. It has been on scb2002 the whole day, and that node has not exhibited any memory issues since. Hence, finally (and happily) resolving.

For posterity, it is easy to spot when Marko deployed and the effects of the new code:

Screen Shot 2018-07-25 at 8.28.56 AM.png (772×2 px, 194 KB)

Screen Shot 2018-07-25 at 8.28.37 AM.png (674×2 px, 489 KB)

There seems to be only one weirdness remaining, namely:

Screen Shot 2018-07-25 at 1.32.55 PM.png (576×2 px, 126 KB)

For some reason, right after the deployment the number of active clients seems steadily increasing over time. From the ES dashboard it seems that the weird behavior is present only on scb2* nodes (again).

As Marko pointed out, the increase/decrease logic for these counters is shown in:

Moreover, on several scb2* nodes the following message started to pop up after the deployment:

eventstreams[7702]: (node:1891) Warning: Possible EventEmitter memory leak detected. 11 drain listeners added. Use emitter.setMaxListeners() to increase limit

More info https://nodejs.org/docs/latest/api/events.html#events_emitter_setmaxlisteners_n

(Not sure if it is related but worth to be mentioned).

Change 448108 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/services/eventstreams@master] Do not use compression and provide better stream and err handling

https://gerrit.wikimedia.org/r/448108

Change 448108 merged by Mobrovac:
[mediawiki/services/eventstreams@master] Do not use compression and provide better stream and err handling

https://gerrit.wikimedia.org/r/448108

Mentioned in SAL (#wikimedia-operations) [2018-07-26T19:36:04Z] <mobrovac@deploy1001> Started deploy [eventstreams/deploy@b1f577d]: Provide better stream and error handling and stop using compression - T199813

Mentioned in SAL (#wikimedia-operations) [2018-07-26T19:38:17Z] <mobrovac@deploy1001> Finished deploy [eventstreams/deploy@b1f577d]: Provide better stream and error handling and stop using compression - T199813 (duration: 02m 13s)

Change 448469 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/services/eventstreams@master] Update KafkaSSE to fe0df0e

https://gerrit.wikimedia.org/r/448469

Change 448469 merged by Mobrovac:
[mediawiki/services/eventstreams@master] Update KafkaSSE to fe0df0e

https://gerrit.wikimedia.org/r/448469

Mentioned in SAL (#wikimedia-operations) [2018-07-27T11:31:05Z] <mobrovac@deploy1001> Started deploy [eventstreams/deploy@941e3cf]: Make the main processing loop async - T199813

Mentioned in SAL (#wikimedia-operations) [2018-07-27T11:33:07Z] <mobrovac@deploy1001> Finished deploy [eventstreams/deploy@941e3cf]: Make the main processing loop async - T199813 (duration: 02m 02s)

mobrovac removed a project: Patch-For-Review.

This has finally been resolved for good. Here's a summary/post-mortem for clarity and posterity.

As noted in the task's description, SCB was running out of memory and it was discovered the culprit to be the EventStreams service. We identified the cases which were causing this: clients that would request the RC stream starting from far in the past. We took some heap dumps of affected workers only to find out that the heap dumps would be ~20MB large for processes that were holding up to 1.5GB of RSS. This indicated that the memory being held was in native components. As a first step and measure, we tweaked node-rdkafka's receive buffer size and min and max values for the number of messages fetched. We also updated ES to node-rdkafka v.2.3.4. In this mess, we also directed ES to read only from the local Kafka cluster so as to minimise the potential impact on the cluster, given that at the time we were still unsure what exactly was going on. While all of these measures did slow down ES' memory growth over time, they did not eliminate them.

After a lengthy investigation, we figured out that it was due to ES' lack of appropriate back-pressure mechanisms, which was filed as T191207, but never gotten around to, so basically we were lucky until it bit us. However, locating and resolving this issue took a while because of the subtleties in the way the service was coded: it mostly employed asynchronous Promise constructs, but the code path that involved sending the events to clients was synchronous. Alas, streams in Node.JS are event-based. Therefore, while not apparent at first, the code employed asynchronous, synchronous and event-based chunks all at once. Once this was identified as the problematic section, the synchronous and event-based parts were turned into asynchronous code paths in bb7e986f39b2 and deployed. That instantly took the pressure off the SCB cluster and things were back under control and this ticket was resolved.

Soon enough two new problems became apparent (cf. T199813#4450115):

  • NodeJS processes started reporting possible memory leaks due to too many event handlers being present
  • the graphs showing the current number of connected clients started showing an increasing number of clients.

After another round of lengthy investigation, it was discovered that ES, by virtue of inheriting code from the service-template-node, was using the compression module (which works on the output stream and compresses it actively). Because the module itself completely redefines the output stream, there were incompatibilities in event handling. Furthermore, it became apparent that better error and stream handing was needed during the process of delivering events to clients. These problems were resolved with 302e6ffb3a1d and PS 448108 (compression was turned off since it isn't suited for this kind of server-client interaction).

The latter problem turned out to be the continuation of the issue of mixing different styles of execution paths, namely of mixing asynchronous (Promise-based) and event-driven styles. This was solved by making the main consume-send loop act outside of the main Promise chain and have the latter wait for the output stream be disconnected rather than waiting for the consume-send loop to end. The fix is in fe0df0e409a3.

If other problems arise, we should probably open up new tasks, as this particular issue is not coming back to bite us again. Special thanks to @elukey for his tireless investigation and search for solutions.

Change 449434 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] Revert "cache: temporarily return 404 for stream.w.o/socket.io"

https://gerrit.wikimedia.org/r/449434

Change 449434 merged by Ema:
[operations/puppet@production] Revert "cache: temporarily return 404 for stream.w.o/socket.io"

https://gerrit.wikimedia.org/r/449434