Page MenuHomePhabricator

Maps Unavailability due to thanos-swift cfssl rollout (14 Aug 2023)
Closed, ResolvedPublic

Description

On the 14th Aug from ~13:40 - 14:50 we had limited maps.wikimedia.org availability where ~50% requests were successful.

Backround

One of the components of our maps infrastructure is tegola vector server (running on kubernetes under tegola-vector-tiles.svc.$dc.wmnet). Kartotherian (kartotherian.discovery.wmnet) makes a request to tegola, tegola in turn checks its cache ( thanos-swift.discovery.wmnet). If the object is not found, tegola pulls any realted data from a maps-postgres hosts, generates it, and stores it back. Tegola-vector-tiles talks directly to thanos-swift.

Happy Path: edge -> kartotherian -> tegola -> thanos-swift

Partial Outage

From ~13:40 and onwards, we saw tegola overwhelmed with in-flight requests, and kartotherian in turn would start returning 500s due to timeouts. The kartotherian unavailability was picked up by our typical LVS checks and in turn by ATS.

Due to issues stated below, it was not immediately obvious that this outage was related with the cfssl rollout on thanos-swift change 950072

Responders: @fgiunchedi @Jgiannelos @jbond, @Joe @JMeybohm @herron

Red Herrings:

There were 2 issues that delayed us from pinpointing the exact problem

  • maps2009 T344110 is dead (maps-posgres primary server)
  • a scap bug/hiccup on the 1st Aug, had the side effect of kartotherian on eqiad to use tegola-vector-tiles.svc.codfw.wmnet as a backend, instead of the eqiad one. This was discovered during the incident, though it was unrelated with it

Actions taken so far

  • codfw is depooled (due to T344110)
  • cfssl patch has been rolled back
  • updated wmf-certificates in tegola-vector-tiles
  • enabled tegola's s3 clients debug, hoping to get more information.
  • we rolled out the cfssl patch twice on codfw and pooled it back to get traffic
    • 14 Aug ~13:40 - 14:50 (actual incident - grafana &from=1692018879707&orgId=1&to=1692027162776)
    • 17 Aug 13:36 - 13:40 GMT
    • 17 Aug ~13:50 - 14:00 GMT (grafana: &from=1692279018056&to=1692282274755)
    • 18 Aug 14:24 - 14:29 GMT (grafana: &from=1692367786000&to=1692373532000)

Information

  • Both CAs are present in the tegola-vector-tiles contaner
  • a curl call from codfw tegola-vector-tiles contaner to thanos-swift is successful
  • when experimenting with codfw, we restarted all pods, tegola's cache tests were successful
    • in other words, it appears that tegola is able to talk to thanos-swift after the cfssl rollout
    • we have not ruled out that TLS is the problem here
  • during codfw tests, we initially see tegola serving requests and talking to its cache. After less than 1-2 minutes after pooling, tegola fails to reach is cache.
  • During the 14th Aug incident, we saw the following
{"log":"2023-08-14 14:38:31 [ERROR] middleware_tile_cache.go:42: cache middleware: error reading from cache: RequestError: send request failed\n","stream":"stderr","time":"2023-08-14T14:38:31.737858817Z"}
{"log":"caused by: Get https://thanos-swift.discovery.wmnet/tegola-swift-codfw-v002/shared-cache/osm/8/148/74: net/http: TLS handshake timeout\n","stream":"stderr","time":"2023-08-14T14:38:31.737918346Z"}
  • During the 18th Aug test, the s3 debug log didn't yield a TLS error, but we rather saw the s3 client making GET requests without getting a response
  • it is notable to mention that go's net/http package is the http client here
  • Cipher Suits: TLS_AES_128_GCM_SHA256 is selected by envoy with the puppet cert vs TLS_AES_256_GCM_SHA384 with cfssl

Useful Graphs and data

14 Aug

tegola-vector-tiles-app-metrics
lvs codfw
thanos-swift codfw
thanos-fe2004 host overview
thanos-swift enoy telemetry

18 Aug

image.png (918×1 px, 192 KB)

image.png (344×300 px, 22 KB)

cc: @Jgiannelos @Vgutierrez @Joe @JMeybohm @akosiaris (thank you for the help in going through logs, tcpdumps, and helping)

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+8 -48
operations/puppetproduction+16 -0
operations/deployment-chartsmaster+10 -31
operations/deployment-chartsmaster+9 -0
operations/deployment-chartsmaster+2 -3
operations/software/tegolawmf/v0.19.x+25 -19
operations/deployment-chartsmaster+9 -1
operations/software/tegolawmf/v0.19.x+27 -0
operations/puppetproduction+16 -0
operations/deployment-chartsmaster+5 -0
operations/deployment-chartsmaster+1 -7
operations/deployment-chartsmaster+6 -0
operations/software/tegolawmf/v0.14.x+3 -0
operations/deployment-chartsmaster+1 -8
operations/deployment-chartsmaster+3 -0
operations/deployment-chartsmaster+5 -1
operations/software/tegolawmf/v0.14.x+4 -3
maps/kartotherian/deploymaster+7 -6
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

From the 18th Aug rollout, tcpdumps can be found in my home directory on kubernetes2010 and thanos-fe2004. The pod IP on kubernetes2010 is 10.194.150.55 and thano's ip is 10.2.1.54

jijiki changed the task status from Open to In Progress.Aug 22 2023, 4:46 PM
jijiki triaged this task as High priority.
jijiki updated the task description. (Show Details)
jijiki added subscribers: Jgiannelos, jbond, Vgutierrez and 3 others.

Could we just circumvent the problem by enabling the thanos-swift listener in tegolas service-proxy and configure it to use the plaintext localhost connection instead of connecting directly to thanos-swift?

Could we just circumvent the problem by enabling the thanos-swift listener in tegolas service-proxy and configure it to use the plaintext localhost connection instead of connecting directly to thanos-swift?

That's always going to be our last resort, although I fear the s3 sdk will complain about various things and that will need tweaking. We need anyways to understand what's wrong as we might encounter similar issues somewhere else.

So: yes we could, but no we shouldn't.

Mentioned in SAL (#wikimedia-operations) [2023-08-23T14:23:36Z] <akosiaris> pool kartotherian in codfw for testing T344324

Mentioned in SAL (#wikimedia-operations) [2023-08-23T14:34:28Z] <akosiaris> depool again kartotherian in codfw for testing T344324

Mentioned in SAL (#wikimedia-operations) [2023-08-23T14:57:13Z] <akosiaris> deploy codfw tegola-vector-tiles with high CPU limits to rule out a hunch. T344324

Mentioned in SAL (#wikimedia-operations) [2023-08-23T14:59:17Z] <akosiaris> pool kartotherian in codfw for testing T344324

Change 951957 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/deployment-charts@master] tegola: bump image and cpu limits on codfw

https://gerrit.wikimedia.org/r/951957

Change 951957 merged by jenkins-bot:

[operations/deployment-charts@master] tegola: bump cpu limits

https://gerrit.wikimedia.org/r/951957

Testing procedure:

  • stop puppet on eqiad
  • merge the cfssl puppet change and roll it out on codfw (example: 952156
  • pool kartotherian on codfw
  • depool eqiad

The goal is that each DC can handle the full traffic without errors.

While looking into this with @akosiaris (cheers alex!), it was found that the pods were being heavily throttled. Bumping the pod CPU limits appeared to initially, improve the situation. When we sent the whole traffic to codfw, we started seeing errors again (TLS handshake timeout)

Even if the solution is to put more CPUs into the problem, it is not great. Our CPU limit initially was 400ms, where 12 replicas were able to handle the traffic full traffic. For the time being we are experimenting with 4 CPUs and see how ti goes.

Just to add some numbers here, with full traffic on eqiad without cfssl and a limit of 2 CPUs, a tegola container's max throttle is 50ms-200ms, while on codfw with cfssl and a limit of 4 CPUs, throttling is 10s-20s

Could we just circumvent the problem by enabling the thanos-swift listener in tegolas service-proxy and configure it to use the plaintext localhost connection instead of connecting directly to thanos-swift?

That's always going to be our last resort, although I fear the s3 sdk will complain about various things and that will need tweaking. We need anyways to understand what's wrong as we might encounter similar issues somewhere else.

So: yes we could, but no we shouldn't.

While the service does not hold any PII data, there is an auth key being passed, so sadly, we should only do this when we are desperate, and then ready to generate a new key.

jijiki changed the task status from In Progress to Stalled.Aug 29 2023, 2:26 PM
jijiki moved this task from Incoming 🐫 to this.quarter 🍕 on the serviceops board.
jijiki updated the task description. (Show Details)

Additionally, while the key for both certificates are of the same size, during negotiation, envoy chooses the TLS_AES_256_GCM_SHA384 cipher suite, while previously was choosing TLS_AES_128_GCM_SHA256. We will investigate this further when we pick it up again.

Jgiannelos changed the task status from Stalled to In Progress.Oct 2 2023, 3:10 PM
Jgiannelos moved this task from Backlog to In Progress on the Content-Transform-Team-WIP board.

Major upgrades of tegola + dependencies + base debian image have been live for quite some time now with no errors/complaints.
Should we try if this improved things with the swift cert switchover?

Major upgrades of tegola + dependencies + base debian image have been live for quite some time now with no errors/complaints.
Should we try if this improved things with the swift cert switchover?

While we discussed this in private, I forgot to update the task. I will have another go at rolling out cfssl as soon as I find some time.

Stalled until T356412 is picked up by Data-Persistence

I am working on the task, but basically just moving swift proxies (main swift, not thanos-swift) to PKI/CFSSL. What is the requirement for thanos-swift? In theory if the cabundle shipped by wmf-certificates is used, everything should run fine. In the T360414 the Observability team is moving ahead with thanos-swift on CFSSL, might be good to sync.

Stalled until T356412 is picked up by Data-Persistence

I am working on the task, but basically just moving swift proxies (main swift, not thanos-swift) to PKI/CFSSL. What is the requirement for thanos-swift? In theory if the cabundle shipped by wmf-certificates is used, everything should run fine. In the T360414 the Observability team is moving ahead with thanos-swift on CFSSL, might be good to sync.

As we mention in T344324#9117628, for reasons we dont know (as we stopped digging into this), when we switching thanos to cfssl, it appears that tegola needs a lot more horsepower than before to talk to thanos. The only difference I was able to spot was from thanos' POV:

Cipher Suits: TLS_AES_128_GCM_SHA256 is selected by envoy with the puppet cert vs TLS_AES_256_GCM_SHA384 with cfssl

We have added more resources to tegola, but we never revisited the problem, though we could prioritise it again. I will sync with observability in T360414

Change #1029544 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: move Swift config in staging to local envoy proxy

https://gerrit.wikimedia.org/r/1029544

Change #1029544 merged by Elukey:

[operations/deployment-charts@master] services: move Tegola's Swift config in staging to local envoy proxy

https://gerrit.wikimedia.org/r/1029544

To keep archives happy: tried to set up a local sidear in staging (I think it was attempted before but I didn't find task entries sorry) and I got:

  • An error related to using an HTTP endpoint for a HTTPS call (endpoint=localhost:6022)
  • An error like the following when forcing https in the endpoint (endpoint=https://localhost:6022):

SignatureDoesNotMatch: The request signature we calculated does not match the signature you provided. Check your key and signing method.

The problem is likely to be the Host header, see https://github.com/aws/aws-sdk-go/issues/1473. Maybe we could add some custom code to our tegola repo and test it?

I found also this interesting project that explains the issue very well: https://github.com/Kriechi/aws-s3-reverse-proxy/blob/master/README.md

AWS uses its Signature v4 to protect API access. All requests sent to AWS S3 have to have a valid signature which includes your AWS security credentials (commonly known as AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY). This signature is encoded in the HTTP Authorization header and also contains a SHA256 of various other HTTP headers, including the Host header, which directly corresponds to the endpoint_url. Since we want to reverse-proxy requests, we need to rewrite this header, and therefore need to re-sign the request.

We can avoid the special proxy and instruct tegola about this special use case, I'll send a patch and you'll tell me your thoughts :)

I took the time to re-read the whole task, and one thing that I missed was the fact that after a lot of upgrades we may be in a different position with the current version of Tegola (namely, the problem may not be present anymore, or in a different form).

So instead of me going for the tegola/request-signing road (even if it was a good excuse to write some Go!), I propose to just upgrade a single Thanos Swift FE node with CFSSL/PKI, and see how Tegola reacts (instead of rolling out all certs at once). In case of trouble we can depool and fix limiting the damage to client requests.

Change #1031439 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Move Swift on thanos-fe1001 to PKI TLS cert

https://gerrit.wikimedia.org/r/1031439

Change #1031439 merged by Elukey:

[operations/puppet@production] Move Swift on thanos-fe1001 to PKI TLS cert

https://gerrit.wikimedia.org/r/1031439

We rolled out CFSSL/PKI cert to thanos-fe1001, one of 4 eqiad nodes, and from CPU graphs the usage seems to have gone up by roughly +50ms. No constant throttling and the app seems working fine. The main issue is what happens with all 4 nodes with PKI, and if in the future we'll need more.

From https://github.com/go-spatial/tegola/releases/tag/v0.20.0 it seems that Tegola bumped the aws-sdk-go library to a newer version, that may improve the performances. This seems to be a problem with the aws-sdk-go library, not tegola or anything else.

Upgrading Tegola from 0.12.1 to 0.20 is a big jump, but probably worth it (we are 4yrs behind upsteam..)

@Jgiannelos ahhh I got fooled by the repo, didn't see that we use one branch for each release.. and the git tags fooled me as well.

If it is not a problem we can definitely try 0.20, but I have zero knowledge about tegola so if you'd have some spare cycles to dedicate to this it would be great.

The main problem in my opinion is the following: Tegola uses the aws-sdk-go to connect to Thanos Swift, and it doesn't tune any client values afaict from the code, so it ends up creating a new TLS connection every time. As Effie pointed out earlier, it seems that we change the TLS Cipher suite with the new TLS cert on thanos-fe1001, that would explain the increase in cpu usage (more time spend in handshaking etc..).

Ideally we could use the envoy sidecar that SRE offers for k8s, but as written above request signing would need to be fixed on Tegola's end (see https://github.com/aws/aws-sdk-go/issues/1473). Using a sidecar would allow us to offload all the TLS part to it, and the problem would be solved. From the gh issue (that is old so the code might be unreliable), it seems that the patch required would be very small, we could test it in our repo and then propose a change to upstream. What do you think?

elukey changed the task status from Stalled to Open.May 15 2024, 2:42 PM
elukey changed the status of subtask T356412: Consolidate TLS cert puppetry for ms and thanos swift frontends from Open to Stalled.

Change #1032482 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/software/tegola@wmf/v0.19.x] Add proxy_host setting to the S3 cache.

https://gerrit.wikimedia.org/r/1032482

Change #1032482 merged by jenkins-bot:

[operations/software/tegola@wmf/v0.19.x] Add proxy_host setting to the S3 cache.

https://gerrit.wikimedia.org/r/1032482

Change #1034503 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: force Tegola to use the local sidecar for Swift caching

https://gerrit.wikimedia.org/r/1034503

Change #1034503 merged by Elukey:

[operations/deployment-charts@master] services: force Tegola to use the local sidecar for Swift caching

https://gerrit.wikimedia.org/r/1034503

Change #1035006 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/software/tegola@wmf/v0.19.x] cache: fix and improve the code in the s3 module that allows a proxy

https://gerrit.wikimedia.org/r/1035006

The code in https://gerrit.wikimedia.org/r/1035006 should work nicely, I was able to test it on stat1010 with the following setup:

  • locally-built tegola with the new code
  • minimal config to just connect to s3/swift (no postgress) using the tegola staging credentials
  • mitmproxy listening in localhost:6022 (to mimic the envoy proxy in k8s)

Once reviewed merged I'll test it in staging again :)

Change #1035006 merged by jenkins-bot:

[operations/software/tegola@wmf/v0.19.x] cache: fix and improve the code in the s3 module that allows a proxy

https://gerrit.wikimedia.org/r/1035006

Change #1035445 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: update Tegola's Docker settings in staging

https://gerrit.wikimedia.org/r/1035445

Change #1035445 merged by Elukey:

[operations/deployment-charts@master] services: update Tegola's Docker settings in staging

https://gerrit.wikimedia.org/r/1035445

The new code is deployed in staging, and it uses the envoy sidecar to contact Thanos. More info in this dashboard.

Opened https://github.com/go-spatial/tegola/pull/992 to upstream.

@Jgiannelos @jijiki is there any specific integration testing that we could do in staging to confirm that everything is working?

I had a chat with @Jgiannelos on IRC, and this was the testing done in staging:

# Sample list with 1000 tiles
# https://phabricator.wikimedia.org/P63005
curl -o 1000-tiles.txt https://phab.wmfusercontent.org/file/data/nwqv7m52fjlaxxovutfq/PHID-FILE-srm6yb6sbz7yp4xkd64g/1000-tiles.txt
cat 1000-tiles.txt | xargs -I {} curl -v -o /dev/null https://staging.svc.eqiad.wmnet:4105/maps/osm/{}.pbf

Except some problems with the paths starting with 1/X/X (but it was reproducible even without proxy settings), we were able to see a stream of requests ending up with the response header tegola-cache: MISS followed by a stream of responses with tegola-cache: HIT.

The envoy sidecar is publishing metrics, so in theory we should be good to go.

Change #1035743 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: upgrade tegola in codfw to use the envoy proxy for Swift

https://gerrit.wikimedia.org/r/1035743

Change #1035743 merged by Elukey:

[operations/deployment-charts@master] services: upgrade tegola in codfw to use the envoy proxy for Swift

https://gerrit.wikimedia.org/r/1035743

Change #1036269 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: move tegola in eqiad to the Thanos sidecar config for Swift

https://gerrit.wikimedia.org/r/1036269

Change #1036269 merged by Elukey:

[operations/deployment-charts@master] services: move tegola in eqiad to the Thanos sidecar config for Swift

https://gerrit.wikimedia.org/r/1036269

Tegola on staging and production is now connecting to Thanos via the enovy sidecar, the rollout was fine and we observed a little cpu usage reduction in codfw. In eqiad it was difficult to judge the impact since there seemed to be a rise in traffic that brought up multiple metrics, so we'll see when things stabilize.

Next steps: move another thanos swift frontend to cfssl and see how Tegola's CPU reacts.

Change #1036284 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Move thanos-fe1002's envoy to CFSSL/PKI

https://gerrit.wikimedia.org/r/1036284

Change #1036284 merged by Elukey:

[operations/puppet@production] Move thanos-fe1002's envoy to CFSSL/PKI

https://gerrit.wikimedia.org/r/1036284

Mentioned in SAL (#wikimedia-operations) [2024-05-28T12:38:24Z] <elukey> move thanos-fe1002's envoy TLS cert to CFSSL/PKI - T344324

Change #1036643 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::thanos::frontend: move all envoy TLS certs to PKI

https://gerrit.wikimedia.org/r/1036643

Change #1036643 merged by Elukey:

[operations/puppet@production] role::thanos::frontend: move all envoy TLS certs to PKI

https://gerrit.wikimedia.org/r/1036643

Mentioned in SAL (#wikimedia-operations) [2024-05-29T12:39:17Z] <elukey> move thanos-fe100[3,4] and thanos-fe2* to PKI TLS certs (envoy, backends for thanos-swift.discovery.wmnet) - T344324

Thanos Swift TLS certs migrated to CFSSL/PKI, no issue on the Tegola side.

The problem:
From the various investigations it seemed that Tegola's aws sdk S3 client didn't use a connection pool when contacting Thanos Swift, ending up in doing TLS handshakes etc.. for every request (at least, this seems the case for the current version of the client that we are using, namely from version 0.19.x). When Thanos-Swift envoys were migrated to a new cert, the Cipher Suite negotiated between Tegola and Thanos Swift changed, requiring more CPUs on Tegola's side.

Fixes:

  • We increased the CPU allowance for Tegola containers.
  • We modified our version of Tegola to support the usage of Envoy as sidecar/local-proxy. It didn't work out of the box due to request signing, see more info in https://github.com/go-spatial/tegola/pull/992 (this PR should be merged soon from upstream so in the future we hopefully don't need local patches in our repos).