Maps Unavailability due to thanos-swift cfssl rollout (14 Aug 2023)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	jijiki
	Aug 16 2023, 9:06 AM

Description

On the 14th Aug from ~13:40 - 14:50 we had limited maps.wikimedia.org availability where ~50% requests were successful.

Backround

One of the components of our maps infrastructure is tegola vector server (running on kubernetes under tegola-vector-tiles.svc.$dc.wmnet). Kartotherian (kartotherian.discovery.wmnet) makes a request to tegola, tegola in turn checks its cache ( thanos-swift.discovery.wmnet). If the object is not found, tegola pulls any realted data from a maps-postgres hosts, generates it, and stores it back. Tegola-vector-tiles talks directly to thanos-swift.

Happy Path: edge -> kartotherian -> tegola -> thanos-swift

Partial Outage

From ~13:40 and onwards, we saw tegola overwhelmed with in-flight requests, and kartotherian in turn would start returning 500s due to timeouts. The kartotherian unavailability was picked up by our typical LVS checks and in turn by ATS.

Due to issues stated below, it was not immediately obvious that this outage was related with the cfssl rollout on thanos-swift change 950072

Responders: @fgiunchedi @Jgiannelos @jbond, @Joe @JMeybohm @herron

Red Herrings:

There were 2 issues that delayed us from pinpointing the exact problem

maps2009 T344110 is dead (maps-posgres primary server)
a scap bug/hiccup on the 1st Aug, had the side effect of kartotherian on eqiad to use tegola-vector-tiles.svc.codfw.wmnet as a backend, instead of the eqiad one. This was discovered during the incident, though it was unrelated with it

Actions taken so far

codfw is depooled (due to T344110)
cfssl patch has been rolled back
updated wmf-certificates in tegola-vector-tiles
enabled tegola's s3 clients debug, hoping to get more information.
we rolled out the cfssl patch twice on codfw and pooled it back to get traffic
- 14 Aug ~13:40 - 14:50 (actual incident - grafana &from=1692018879707&orgId=1&to=1692027162776)
- 17 Aug 13:36 - 13:40 GMT
- 17 Aug ~13:50 - 14:00 GMT (grafana: &from=1692279018056&to=1692282274755)
- 18 Aug 14:24 - 14:29 GMT (grafana: &from=1692367786000&to=1692373532000)

Information

Both CAs are present in the tegola-vector-tiles contaner
a curl call from codfw tegola-vector-tiles contaner to thanos-swift is successful
when experimenting with codfw, we restarted all pods, tegola's cache tests were successful
- in other words, it appears that tegola is able to talk to thanos-swift after the cfssl rollout
- we have not ruled out that TLS is the problem here
during codfw tests, we initially see tegola serving requests and talking to its cache. After less than 1-2 minutes after pooling, tegola fails to reach is cache.
During the 14th Aug incident, we saw the following

{"log":"2023-08-14 14:38:31 [ERROR] middleware_tile_cache.go:42: cache middleware: error reading from cache: RequestError: send request failed\n","stream":"stderr","time":"2023-08-14T14:38:31.737858817Z"}
{"log":"caused by: Get https://thanos-swift.discovery.wmnet/tegola-swift-codfw-v002/shared-cache/osm/8/148/74: net/http: TLS handshake timeout\n","stream":"stderr","time":"2023-08-14T14:38:31.737918346Z"}

During the 18th Aug test, the s3 debug log didn't yield a TLS error, but we rather saw the s3 client making GET requests without getting a response
it is notable to mention that go's net/http package is the http client here
Cipher Suits: TLS_AES_128_GCM_SHA256 is selected by envoy with the puppet cert vs TLS_AES_256_GCM_SHA384 with cfssl

Useful Graphs and data

14 Aug

tegola-vector-tiles-app-metrics
lvs codfw
thanos-swift codfw
thanos-fe2004 host overview
thanos-swift enoy telemetry

18 Aug

cc: @Jgiannelos @Vgutierrez @Joe @JMeybohm @akosiaris (thank you for the help in going through logs, tcpdumps, and helping)

Details

Subject	Repo	Branch	Lines +/-
role::thanos::frontend: move all envoy TLS certs to PKI	operations/puppet	production	+8 -48
Move thanos-fe1002's envoy to CFSSL/PKI	operations/puppet	production	+16 -0
services: move tegola in eqiad to the Thanos sidecar config for Swift	operations/deployment-charts	master	+10 -31
services: upgrade tegola in codfw to use the envoy proxy for Swift	operations/deployment-charts	master	+9 -0
services: update Tegola's Docker settings in staging	operations/deployment-charts	master	+2 -3
cache: fix and improve the code in the s3 module that allows a proxy	operations/software/tegola	wmf/v0.19.x	+25 -19
services: force Tegola to use the local sidecar for Swift caching	operations/deployment-charts	master	+9 -1
Add proxy_host setting to the S3 cache.	operations/software/tegola	wmf/v0.19.x	+27 -0
Move Swift on thanos-fe1001 to PKI TLS cert	operations/puppet	production	+16 -0
services: move Tegola's Swift config in staging to local envoy proxy	operations/deployment-charts	master	+5 -0
tegola: bump cpu limits	operations/deployment-charts	master	+1 -7
tegola-vector-tiles: use tegola image with debug enabled on codfw	operations/deployment-charts	master	+6 -0
Enable aws-sdk (s3) debug logging	operations/software/tegola	wmf/v0.14.x	+3 -0
tegola-vector-tiles: update application image	operations/deployment-charts	master	+1 -8
tegola-vector-tiles: update image on codfw	operations/deployment-charts	master	+3 -0
tegola-vector-tiles: update image	operations/deployment-charts	master	+5 -1
Dependencies maintenance	operations/software/tegola	wmf/v0.14.x	+4 -3
Cleanup stale config references	maps/kartotherian/deploy	master	+7 -6

Related Objects
Search...

Status	Assigned	Task
Invalid	None	T343987 Switch thanos-fe to cfssl
Resolved	Jgiannelos	T344324 Maps Unavailability due to thanos-swift cfssl rollout (14 Aug 2023)
Resolved	elukey	T356412 Consolidate TLS cert puppetry for ms and thanos swift frontends

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

From the 18th Aug rollout, tcpdumps can be found in my home directory on kubernetes2010 and thanos-fe2004. The pod IP on kubernetes2010 is 10.194.150.55 and thano's ip is 10.2.1.54

jijiki changed the task status from Open to In Progress.Aug 22 2023, 4:46 PM

jijiki triaged this task as High priority.

jijiki updated the task description. (Show Details)

jijiki added a parent task: T343987: Switch thanos-fe to cfssl.

jijiki added subscribers: Jgiannelos, jbond, Vgutierrez and 3 others.

Krinkle updated the task description. (Show Details)Aug 22 2023, 5:35 PM

Could we just circumvent the problem by enabling the thanos-swift listener in tegolas service-proxy and configure it to use the plaintext localhost connection instead of connecting directly to thanos-swift?

In T344324#9112384, @JMeybohm wrote:

Could we just circumvent the problem by enabling the thanos-swift listener in tegolas service-proxy and configure it to use the plaintext localhost connection instead of connecting directly to thanos-swift?

That's always going to be our last resort, although I fear the s3 sdk will complain about various things and that will need tweaking. We need anyways to understand what's wrong as we might encounter similar issues somewhere else.

So: yes we could, but no we shouldn't.

jijiki updated the task description. (Show Details)Aug 23 2023, 10:44 AM

akosiaris subscribed.Aug 23 2023, 11:08 AM

lmata subscribed.Aug 23 2023, 1:37 PM

Mentioned in SAL (#wikimedia-operations) [2023-08-23T14:23:36Z] <akosiaris> pool kartotherian in codfw for testing T344324

Mentioned in SAL (#wikimedia-operations) [2023-08-23T14:34:28Z] <akosiaris> depool again kartotherian in codfw for testing T344324

Mentioned in SAL (#wikimedia-operations) [2023-08-23T14:57:13Z] <akosiaris> deploy codfw tegola-vector-tiles with high CPU limits to rule out a hunch. T344324

Mentioned in SAL (#wikimedia-operations) [2023-08-23T14:59:17Z] <akosiaris> pool kartotherian in codfw for testing T344324

Jgiannelos mentioned this in rOSTG38036dfb8fe3: Dependencies maintenance.Aug 23 2023, 3:13 PM

Jgiannelos mentioned this in rOSTG250a0f9eea79: Dependencies maintenance.Aug 23 2023, 3:20 PM

Change 951957 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/deployment-charts@master] tegola: bump image and cpu limits on codfw

https://gerrit.wikimedia.org/r/951957

gerritbot added a project: Patch-For-Review.Aug 23 2023, 3:21 PM

Change 951957 merged by jenkins-bot:

[operations/deployment-charts@master] tegola: bump cpu limits

https://gerrit.wikimedia.org/r/951957

Maintenance_bot removed a project: Patch-For-Review.Aug 23 2023, 4:10 PM

Testing procedure:

stop puppet on eqiad
merge the cfssl puppet change and roll it out on codfw (example: 952156
pool kartotherian on codfw
depool eqiad

The goal is that each DC can handle the full traffic without errors.

While looking into this with @akosiaris (cheers alex!), it was found that the pods were being heavily throttled. Bumping the pod CPU limits appeared to initially, improve the situation. When we sent the whole traffic to codfw, we started seeing errors again (TLS handshake timeout)

Even if the solution is to put more CPUs into the problem, it is not great. Our CPU limit initially was 400ms, where 12 replicas were able to handle the traffic full traffic. For the time being we are experimenting with 4 CPUs and see how ti goes.

Just to add some numbers here, with full traffic on eqiad without cfssl and a limit of 2 CPUs, a tegola container's max throttle is 50ms-200ms, while on codfw with cfssl and a limit of 4 CPUs, throttling is 10s-20s

In T344324#9112390, @Joe wrote:

In T344324#9112384, @JMeybohm wrote:

Could we just circumvent the problem by enabling the thanos-swift listener in tegolas service-proxy and configure it to use the plaintext localhost connection instead of connecting directly to thanos-swift?

That's always going to be our last resort, although I fear the s3 sdk will complain about various things and that will need tweaking. We need anyways to understand what's wrong as we might encounter similar issues somewhere else.

So: yes we could, but no we shouldn't.

While the service does not hold any PII data, there is an auth key being passed, so sadly, we should only do this when we are desperate, and then ready to generate a new key.

jijiki changed the task status from In Progress to Stalled.Aug 29 2023, 2:26 PM

jijiki moved this task from Incoming 🐫 to this.quarter 🍕 on the serviceops board.

jijiki updated the task description. (Show Details)Aug 30 2023, 12:18 PM

jijiki updated the task description. (Show Details)

Additionally, while the key for both certificates are of the same size, during negotiation, envoy chooses the TLS_AES_256_GCM_SHA384 cipher suite, while previously was choosing TLS_AES_128_GCM_SHA256. We will investigate this further when we pick it up again.

Jgiannelos mentioned this in rOSTGc1d72f621c88: Dependencies maintenance.Oct 2 2023, 12:26 PM

Jgiannelos mentioned this in rOSTG3c61c4d09bef: Enable aws-sdk (s3) debug logging.

Jgiannelos changed the task status from Stalled to In Progress.Oct 2 2023, 3:10 PM

Jgiannelos moved this task from Backlog to In Progress on the Content-Transform-Team-WIP board.

Jgiannelos claimed this task.Oct 10 2023, 2:37 PM

Major upgrades of tegola + dependencies + base debian image have been live for quite some time now with no errors/complaints.
Should we try if this improved things with the swift cert switchover?

In T344324#9239207, @Jgiannelos wrote:

Major upgrades of tegola + dependencies + base debian image have been live for quite some time now with no errors/complaints.
Should we try if this improved things with the swift cert switchover?

While we discussed this in private, I forgot to update the task. I will have another go at rolling out cfssl as soon as I find some time.

MSantos edited projects, added Content-Transform-Team; removed Content-Transform-Team-WIP.Nov 20 2023, 4:20 PM

MSantos moved this task from Backlog to Tracking on the Content-Transform-Team board.Nov 23 2023, 3:15 PM

Stalled until T356412 is picked up by Data-Persistence

jijiki moved this task from this.quarter 🍕 to Stalled 🐌 on the serviceops board.Feb 1 2024, 6:10 PM

In T344324#9507114, @jijiki wrote:

Stalled until T356412 is picked up by Data-Persistence

I am working on the task, but basically just moving swift proxies (main swift, not thanos-swift) to PKI/CFSSL. What is the requirement for thanos-swift? In theory if the cabundle shipped by wmf-certificates is used, everything should run fine. In the T360414 the Observability team is moving ahead with thanos-swift on CFSSL, might be good to sync.

In T344324#9780912, @elukey wrote:

In T344324#9507114, @jijiki wrote:

Stalled until T356412 is picked up by Data-Persistence

I am working on the task, but basically just moving swift proxies (main swift, not thanos-swift) to PKI/CFSSL. What is the requirement for thanos-swift? In theory if the cabundle shipped by wmf-certificates is used, everything should run fine. In the T360414 the Observability team is moving ahead with thanos-swift on CFSSL, might be good to sync.

As we mention in T344324#9117628, for reasons we dont know (as we stopped digging into this), when we switching thanos to cfssl, it appears that tegola needs a lot more horsepower than before to talk to thanos. The only difference I was able to spot was from thanos' POV:

Cipher Suits: TLS_AES_128_GCM_SHA256 is selected by envoy with the puppet cert vs TLS_AES_256_GCM_SHA384 with cfssl

We have added more resources to tegola, but we never revisited the problem, though we could prioritise it again. I will sync with observability in T360414

jijiki mentioned this in T356412: Consolidate TLS cert puppetry for ms and thanos swift frontends.May 9 2024, 9:47 AM

Change #1029544 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: move Swift config in staging to local envoy proxy

https://gerrit.wikimedia.org/r/1029544

gerritbot added a project: Patch-For-Review.May 9 2024, 1:41 PM

Change #1029544 merged by Elukey:

[operations/deployment-charts@master] services: move Tegola's Swift config in staging to local envoy proxy

https://gerrit.wikimedia.org/r/1029544

Maintenance_bot removed a project: Patch-For-Review.May 13 2024, 1:31 PM

To keep archives happy: tried to set up a local sidear in staging (I think it was attempted before but I didn't find task entries sorry) and I got:

An error related to using an HTTP endpoint for a HTTPS call (endpoint=localhost:6022)
An error like the following when forcing https in the endpoint (endpoint=https://localhost:6022):

SignatureDoesNotMatch: The request signature we calculated does not match the signature you provided. Check your key and signing method.

The problem is likely to be the Host header, see https://github.com/aws/aws-sdk-go/issues/1473. Maybe we could add some custom code to our tegola repo and test it?

I found also this interesting project that explains the issue very well: https://github.com/Kriechi/aws-s3-reverse-proxy/blob/master/README.md

AWS uses its Signature v4 to protect API access. All requests sent to AWS S3 have to have a valid signature which includes your AWS security credentials (commonly known as AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY). This signature is encoded in the HTTP Authorization header and also contains a SHA256 of various other HTTP headers, including the Host header, which directly corresponds to the endpoint_url. Since we want to reverse-proxy requests, we need to rewrite this header, and therefore need to re-sign the request.

We can avoid the special proxy and instruct tegola about this special use case, I'll send a patch and you'll tell me your thoughts :)

I took the time to re-read the whole task, and one thing that I missed was the fact that after a lot of upgrades we may be in a different position with the current version of Tegola (namely, the problem may not be present anymore, or in a different form).

So instead of me going for the tegola/request-signing road (even if it was a good excuse to write some Go!), I propose to just upgrade a single Thanos Swift FE node with CFSSL/PKI, and see how Tegola reacts (instead of rolling out all certs at once). In case of trouble we can depool and fix limiting the damage to client requests.

Change #1031439 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Move Swift on thanos-fe1001 to PKI TLS cert

https://gerrit.wikimedia.org/r/1031439

gerritbot added a project: Patch-For-Review.May 14 2024, 1:10 PM

Change #1031439 merged by Elukey:

[operations/puppet@production] Move Swift on thanos-fe1001 to PKI TLS cert

https://gerrit.wikimedia.org/r/1031439

We rolled out CFSSL/PKI cert to thanos-fe1001, one of 4 eqiad nodes, and from CPU graphs the usage seems to have gone up by roughly +50ms. No constant throttling and the app seems working fine. The main issue is what happens with all 4 nodes with PKI, and if in the future we'll need more.

From https://github.com/go-spatial/tegola/releases/tag/v0.20.0 it seems that Tegola bumped the aws-sdk-go library to a newer version, that may improve the performances. This seems to be a problem with the aws-sdk-go library, not tegola or anything else.

Upgrading Tegola from 0.12.1 to 0.20 is a big jump, but probably worth it (we are 4yrs behind upsteam..)

I think we are running tegola v.0.19 so it should be relatively easier to bump to 0.20
https://gerrit.wikimedia.org/r/c/operations/software/tegola/+/978030
https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/980885

Maintenance_bot removed a project: Patch-For-Review.May 15 2024, 2:33 PM

@Jgiannelos ahhh I got fooled by the repo, didn't see that we use one branch for each release.. and the git tags fooled me as well.

If it is not a problem we can definitely try 0.20, but I have zero knowledge about tegola so if you'd have some spare cycles to dedicate to this it would be great.

The main problem in my opinion is the following: Tegola uses the aws-sdk-go to connect to Thanos Swift, and it doesn't tune any client values afaict from the code, so it ends up creating a new TLS connection every time. As Effie pointed out earlier, it seems that we change the TLS Cipher suite with the new TLS cert on thanos-fe1001, that would explain the increase in cpu usage (more time spend in handshaking etc..).

Ideally we could use the envoy sidecar that SRE offers for k8s, but as written above request signing would need to be fixed on Tegola's end (see https://github.com/aws/aws-sdk-go/issues/1473). Using a sidecar would allow us to offload all the TLS part to it, and the problem would be solved. From the gh issue (that is old so the code might be unreliable), it seems that the patch required would be very small, we could test it in our repo and then propose a change to upstream. What do you think?

Jgiannelos added a subscriber: SLopes-WMF.May 15 2024, 2:40 PM

elukey changed the task status from Stalled to Open.May 15 2024, 2:42 PM

elukey changed the status of subtask T356412: Consolidate TLS cert puppetry for ms and thanos swift frontends from Open to Stalled.

MoritzMuehlenhoff subscribed.May 15 2024, 2:47 PM

Change #1032482 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/software/tegola@wmf/v0.19.x] Add proxy_host setting to the S3 cache.

https://gerrit.wikimedia.org/r/1032482

gerritbot added a project: Patch-For-Review.May 16 2024, 1:16 PM

Change #1032482 merged by jenkins-bot:

[operations/software/tegola@wmf/v0.19.x] Add proxy_host setting to the S3 cache.

https://gerrit.wikimedia.org/r/1032482

elukey mentioned this in rOSTG155fb5518a46: Add proxy_host setting to the S3 cache..May 21 2024, 2:29 PM

Maintenance_bot removed a project: Patch-For-Review.May 21 2024, 2:31 PM

Change #1034503 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: force Tegola to use the local sidecar for Swift caching

https://gerrit.wikimedia.org/r/1034503

Change #1034503 merged by Elukey:

[operations/deployment-charts@master] services: force Tegola to use the local sidecar for Swift caching

https://gerrit.wikimedia.org/r/1034503

Maintenance_bot removed a project: Patch-For-Review.May 21 2024, 3:32 PM

Change #1035006 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/software/tegola@wmf/v0.19.x] cache: fix and improve the code in the s3 module that allows a proxy

https://gerrit.wikimedia.org/r/1035006

gerritbot added a project: Patch-For-Review.May 22 2024, 4:56 PM

The code in https://gerrit.wikimedia.org/r/1035006 should work nicely, I was able to test it on stat1010 with the following setup:

locally-built tegola with the new code
minimal config to just connect to s3/swift (no postgress) using the tegola staging credentials
mitmproxy listening in localhost:6022 (to mimic the envoy proxy in k8s)

Once reviewed merged I'll test it in staging again :)

Change #1035006 merged by jenkins-bot:

[operations/software/tegola@wmf/v0.19.x] cache: fix and improve the code in the s3 module that allows a proxy

https://gerrit.wikimedia.org/r/1035006

elukey mentioned this in rOSTG7f51f67ab571: cache: fix and improve the code in the s3 module that allows a proxy.May 23 2024, 2:41 PM

Change #1035445 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: update Tegola's Docker settings in staging

https://gerrit.wikimedia.org/r/1035445

Change #1035445 merged by Elukey:

[operations/deployment-charts@master] services: update Tegola's Docker settings in staging

https://gerrit.wikimedia.org/r/1035445

The new code is deployed in staging, and it uses the envoy sidecar to contact Thanos. More info in this dashboard.

Opened https://github.com/go-spatial/tegola/pull/992 to upstream.

@Jgiannelos @jijiki is there any specific integration testing that we could do in staging to confirm that everything is working?

I had a chat with @Jgiannelos on IRC, and this was the testing done in staging:

# Sample list with 1000 tiles
# https://phabricator.wikimedia.org/P63005
curl -o 1000-tiles.txt https://phab.wmfusercontent.org/file/data/nwqv7m52fjlaxxovutfq/PHID-FILE-srm6yb6sbz7yp4xkd64g/1000-tiles.txt
cat 1000-tiles.txt | xargs -I {} curl -v -o /dev/null https://staging.svc.eqiad.wmnet:4105/maps/osm/{}.pbf

Except some problems with the paths starting with 1/X/X (but it was reproducible even without proxy settings), we were able to see a stream of requests ending up with the response header tegola-cache: MISS followed by a stream of responses with tegola-cache: HIT.

The envoy sidecar is publishing metrics, so in theory we should be good to go.

Change #1035743 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: upgrade tegola in codfw to use the envoy proxy for Swift

https://gerrit.wikimedia.org/r/1035743

@elukey this great! Thank you!

Change #1035743 merged by Elukey:

[operations/deployment-charts@master] services: upgrade tegola in codfw to use the envoy proxy for Swift

https://gerrit.wikimedia.org/r/1035743

Change #1036269 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: move tegola in eqiad to the Thanos sidecar config for Swift

https://gerrit.wikimedia.org/r/1036269

Change #1036269 merged by Elukey:

[operations/deployment-charts@master] services: move tegola in eqiad to the Thanos sidecar config for Swift

https://gerrit.wikimedia.org/r/1036269

Tegola on staging and production is now connecting to Thanos via the enovy sidecar, the rollout was fine and we observed a little cpu usage reduction in codfw. In eqiad it was difficult to judge the impact since there seemed to be a rise in traffic that brought up multiple metrics, so we'll see when things stabilize.

Next steps: move another thanos swift frontend to cfssl and see how Tegola's CPU reacts.

Change #1036284 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Move thanos-fe1002's envoy to CFSSL/PKI

https://gerrit.wikimedia.org/r/1036284

Change #1036284 merged by Elukey:

[operations/puppet@production] Move thanos-fe1002's envoy to CFSSL/PKI

https://gerrit.wikimedia.org/r/1036284

Mentioned in SAL (#wikimedia-operations) [2024-05-28T12:38:24Z] <elukey> move thanos-fe1002's envoy TLS cert to CFSSL/PKI - T344324

Change #1036643 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::thanos::frontend: move all envoy TLS certs to PKI

https://gerrit.wikimedia.org/r/1036643

Change #1036643 merged by Elukey:

[operations/puppet@production] role::thanos::frontend: move all envoy TLS certs to PKI

https://gerrit.wikimedia.org/r/1036643

Mentioned in SAL (#wikimedia-operations) [2024-05-29T12:39:17Z] <elukey> move thanos-fe100[3,4] and thanos-fe2* to PKI TLS certs (envoy, backends for thanos-swift.discovery.wmnet) - T344324

Thanos Swift TLS certs migrated to CFSSL/PKI, no issue on the Tegola side.

The problem:
From the various investigations it seemed that Tegola's aws sdk S3 client didn't use a connection pool when contacting Thanos Swift, ending up in doing TLS handshakes etc.. for every request (at least, this seems the case for the current version of the client that we are using, namely from version 0.19.x). When Thanos-Swift envoys were migrated to a new cert, the Cipher Suite negotiated between Tegola and Thanos Swift changed, requiring more CPUs on Tegola's side.

Fixes:

We increased the CPU allowance for Tegola containers.
We modified our version of Tegola to support the usage of Envoy as sidecar/local-proxy. It didn't work out of the box due to request signing, see more info in https://github.com/go-spatial/tegola/pull/992 (this PR should be merged soon from upstream so in the future we hopefully don't need local patches in our repos).

elukey closed subtask T356412: Consolidate TLS cert puppetry for ms and thanos swift frontends as Resolved.May 29 2024, 1:32 PM

	F37616616: image.png
	Aug 22 2023, 4:40 PM

	F37616588: image.png
	Aug 22 2023, 4:40 PM

Maps Unavailability due to thanos-swift cfssl rollout (14 Aug 2023)Closed, ResolvedPublicActions