Page MenuHomePhabricator

Sunset search.wikimedia.org service
Closed, ResolvedPublic

Description

Back in March I investigated traffic to & usage of search.wikimedia.org bridge, and found that old Apple Dictionary versions out there (like, more than a decade old used by ~330 IP addresses/day) were broken because they're hitting a non-secure URI and weren't coded to deal with the 301 response. At some point Apple switched to using MW search API directly in the Dictionary app.

We were maintaining this tech debt as of 2021 (T289224) but we can just shut it down and archive the codebase. Anyone using the legacy, broken Dictionary app isn't getting any results anyway and we can't disable HTTPS just for that domain.

Additional info:

Consequences:

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Ladsgroup subscribed.

I add serviceops, I know it's a bit of stretch but that's the one that makes the most sense. Please change to another team if you think there is a better choice.

Ladsgroup triaged this task as Medium priority.Aug 26 2022, 7:16 AM

Change 826884 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] trafficserver: remove search.wikimedia.org

https://gerrit.wikimedia.org/r/826884

Change 826885 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] httpbb: drop tests for search.wikimedia.org

https://gerrit.wikimedia.org/r/826885

The first change above would remove it from ATS (trafficserver) config. That would be a one-line change that would result in this not getting any traffic anymore.

Then we could give it one final grace period to see nothing complains and then after that remove the actual kubernetes service (which might be something we yet have to document how to do properly because it might be the first k8s service to be retired).

The first change above would remove it from ATS (trafficserver) config. That would be a one-line change that would result in this not getting any traffic anymore. Then we could give it one final grace period to see nothing complains

Thank you! And yes, that makes sense. A month would probably be sufficient?

and then after that remove the actual kubernetes service (which might be something we yet have to document how to do properly because it might be the first k8s service to be retired).

Clearing out tech debt AND doc debt in one sweep – love to see it.

link courtesy of @Urbanecm (thanks): https://w.wiki/5d3s (843 hits in 30 days but sampled)

An alternative way to shut it down would be to remove it first from DNS and later do everything else.

Then potential users would just get nothing / timeout instead of an error message in place of the current content at https://search.wikimedia.org/

If we leave it in DNS and remove it from traffic server they would see the error page that this is Wikimedia but the domain is unknown.

Change 826885 merged by Dzahn:

[operations/puppet@production] httpbb: drop tests for search.wikimedia.org

https://gerrit.wikimedia.org/r/826885

I don't want to over complicate the decommission of the service. But I was thinking about depooling the service first from confctl. Depooling should have the lowest rollback time and should be less invasive than removing backends DNS or helm deployments (except from alerts maybe, so we may need downtimes). Furthermore users would get a 5xx(?) instead of a timeout. So we could order the decommission steps according to the complexity of the change and potential rollbacks:

  • depool from confctl
  • <grace period 1>
  • remove from LB and DNS
  • <grace period 2>
  • undeploy service from Kubernetes
  • cleanup/archive

Then we could give it one final grace period to see nothing complains and then after that remove the actual kubernetes service (which might be something we yet have to document how to do properly because it might be the first k8s service to be retired).

I also like the idea of creating docs for removing a service. Something like /wiki/Kubernetes/Add_a_new_service in reverse order.

Out of curiosity I looked at the httpd logs for the Pods on Kubernetes and found only one "valid" request for the last 10 days. The request was to

http://search.wikimedia.org/?search=test

All other valid (http 200) requests are health checks and metrics. All remaining request (mostly scraping/bots) got a http 400.

Just leaving my thoughts here, because I was added as a reviewer in https://gerrit.wikimedia.org/r/q/Ie8dce42a8efaca82fe6c7be0a9dd1cb43403ae7d

Needs consulting with Alex / Giuseppe before proceeding.

[...]
Out of curiosity I looked at the httpd logs for the Pods on Kubernetes and found only one "valid" request for the last 10 days. The request was to

http://search.wikimedia.org/?search=test

All other valid (http 200) requests are health checks and metrics. All remaining request (mostly scraping/bots) got a http 400.

A little correction after playing a little bit with Turnilo: Most requests are cached so looking at the Kubernetes services is not representative.
Turnilo shows about ~4k requests per day (factoring in sampling). The requests seem to have a "proper" apple user agent so no monitoring/health checking involved.

Turnilo query for requests last 30 days here: https://w.wiki/5jYX

And I can confirm most requests get a http 301 response: https://w.wiki/5jYg

Clement_Goubert moved this task from Unused 3 to this.quarter 🍕 on the serviceops board.
Clement_Goubert subscribed.

Just for clarification, we are talking about the service named apple-search in service discovery and not search or search-https, as these two are actually the elasticsearch clusters, right?

Change 826884 abandoned by Dzahn:

[operations/puppet@production] trafficserver: remove search.wikimedia.org

Reason:

per SRE serviceops meeting today

https://gerrit.wikimedia.org/r/826884

Just for clarification, we are talking about the service named apple-search in service discovery and not search or search-https, as these two are actually the elasticsearch clusters, right?

Pinging @Gehel to confirm that search and search-https services should not be touched.

Yes, based on backend.yaml#225 looks like apple-search is the name of the service that should be shut down.

@mpopov @Gehel This was clarified in our ServiceOps meeting. We are not touching the search and search-https services.

Change 843425 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mwdebug: Disable nutcracker

https://gerrit.wikimedia.org/r/843425

Disregard the above related patch, I fumbled the Bug id.

Change 826884 restored by Clément Goubert:

[operations/puppet@production] trafficserver: remove search.wikimedia.org

https://gerrit.wikimedia.org/r/826884

Restored the trafficserver search.wikimedia.org removal patch.

As I understand it, removing this mapping will stop traffic to the apple-search backend service and will not affect the search and search-https services.

Once that is done, we can decommission the apple-search kubernetes service.

Change 852208 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/dns@master] apple-search: Remove DNS records

https://gerrit.wikimedia.org/r/852208

Change 852210 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] apple-search: Switch lvs state to lvs_setup

https://gerrit.wikimedia.org/r/852210

I have checked with traffic, and we can effectively start by removing the trafficserver mapping via https://gerrit.wikimedia.org/r/c/operations/puppet/+/826884

It should all be handled by puppet and render the actual service unreachable by sending it down the default path.

Should we define a grace period after which we actually start decommissioning the service?

Mentioned in SAL (#wikimedia-operations) [2022-11-03T14:11:52Z] <claime> Sunsetting search.wikimedia.org, starting a 2 week grace period before decommission - T316296

Change 826884 merged by Clément Goubert:

[operations/puppet@production] trafficserver: remove search.wikimedia.org

https://gerrit.wikimedia.org/r/826884

Starting 2 week grace period from today, full decom to happen after 2022-11-17

Clement_Goubert changed the task status from Open to In Progress.Nov 3 2022, 2:21 PM
Clement_Goubert moved this task from this.quarter 🍕 to API Gateway 🥌 on the serviceops board.

Change 857691 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] apple-search: Switch lvs state to service_setup

https://gerrit.wikimedia.org/r/857691

Change 857706 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] apple-search: Remove service from service::catalog

https://gerrit.wikimedia.org/r/857706

Service removal plan:
From https://wikitech.wikimedia.org/wiki/LVS#Remove_a_load_balanced_service

  1. Silence probes : instance=apple-search:4013
  2. Remove discovery DNS stanza: https://gerrit.wikimedia.org/r/c/operations/dns/+/858207
  3. Remove the rest of the DNS records: https://gerrit.wikimedia.org/r/c/operations/dns/+/852208
  4. Switch to lvs_setup: https://gerrit.wikimedia.org/r/c/operations/puppet/+/852210
  5. Remove from LB and backend: https://gerrit.wikimedia.org/r/c/operations/puppet/+/857691
    1. Run puppet on all LVS servers
    2. Use the sre.loadbalancer.restart-pybal cookbook on the backup and active LVS servers (ask #wikimedia-traffic for the right servers)
	   sudo cookbook sre.loadbalancer.restart-pybal lvsXXXX --reason "Decomissioning apple-search" --task-id T316296
	   sudo cookbook sre.loadbalancer.restart-pybal lvsXXXX --reason "Decomissioning apple-search" --task-id T316296
    1. codfw :ipvsadm --delete-service --tcp-service 10.2.1.68:4013
    2. eqiad: ipvsadm --delete-service --tcp-service 10.2.2.68:4013
  1. Remove from service catalog: https://gerrit.wikimedia.org/r/c/operations/puppet/+/857706

Then the service deployment on wikikube can be removed following https://wikitech.wikimedia.org/wiki/Kubernetes/Remove_a_service

Change 858207 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/dns@master] apple-search: remove discovery record

https://gerrit.wikimedia.org/r/858207

Change 858286 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] apple-search: Remove apple-search from conftool

https://gerrit.wikimedia.org/r/858286

Mentioned in SAL (#wikimedia-operations) [2022-11-18T11:27:23Z] <claime> Starting decommission of apple-search service - T316296

Change 858207 merged by Clément Goubert:

[operations/dns@master] apple-search: remove discovery record

https://gerrit.wikimedia.org/r/858207

Change 852210 merged by Clément Goubert:

[operations/puppet@production] apple-search: Switch lvs state to lvs_setup

https://gerrit.wikimedia.org/r/852210

Mentioned in SAL (#wikimedia-operations) [2022-11-18T11:41:28Z] <claime> Switching apple-search to state:lvs_setup - T316296

Mentioned in SAL (#wikimedia-operations) [2022-11-18T11:53:51Z] <claime> Switching apple-search to state:service_setup - T316296

Change 857691 merged by Clément Goubert:

[operations/puppet@production] apple-search: Remove service from lb and backend

https://gerrit.wikimedia.org/r/857691

Mentioned in SAL (#wikimedia-operations) [2022-11-18T12:22:23Z] <claime> apple-search removed from backends - T316296

Change 852208 merged by Clément Goubert:

[operations/dns@master] apple-search: Remove DNS records

https://gerrit.wikimedia.org/r/852208

Change 857706 merged by Clément Goubert:

[operations/puppet@production] apple-search: Remove service from service::catalog

https://gerrit.wikimedia.org/r/857706

Mentioned in SAL (#wikimedia-operations) [2022-11-18T12:30:53Z] <claime> Removing apple-search from service::catalog - T316296

Mentioned in SAL (#wikimedia-operations) [2022-11-18T12:37:13Z] <claime> Removing apple-search from conftool - T316296

Change 858286 merged by Clément Goubert:

[operations/puppet@production] apple-search: Remove apple-search from conftool

https://gerrit.wikimedia.org/r/858286

apple-search removed from DNS, LVS, service::catalog and conftool.
Starting removal from wikikube and deployment-charts.

Mentioned in SAL (#wikimedia-operations) [2022-11-18T12:41:48Z] <claime> Starting apple-search removal from wikikube - T316296

Mentioned in SAL (#wikimedia-operations) [2022-11-18T12:43:10Z] <claime> cgoubert@deploy1002:/apple-search$ helmfile -e staging -i destroy - T316296

Mentioned in SAL (#wikimedia-operations) [2022-11-18T12:45:13Z] <claime> cgoubert@deploy1002:/apple-search$ helmfile -e eqiad -i destroy - T316296

Mentioned in SAL (#wikimedia-operations) [2022-11-18T12:45:58Z] <claime> cgoubert@deploy1002:/apple-search$ helmfile -e codfw -i destroy - T316296

Change 858575 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] admin-ng: remove apple-search namespace

https://gerrit.wikimedia.org/r/858575

Change 858577 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] wikikube: remove apple-search deployment

https://gerrit.wikimedia.org/r/858577

Change 858578 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] charts: remove apple-search chart

https://gerrit.wikimedia.org/r/858578

Change 858617 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] apple-search: absent kubernetes service

https://gerrit.wikimedia.org/r/858617

Change 858624 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] apple-search: final cleanup

https://gerrit.wikimedia.org/r/858624

Change 858575 merged by jenkins-bot:

[operations/deployment-charts@master] admin-ng: remove apple-search namespace

https://gerrit.wikimedia.org/r/858575

Mentioned in SAL (#wikimedia-operations) [2022-11-18T16:08:49Z] <claime> removing apple-search namespaces - T316296

Change 858577 merged by jenkins-bot:

[operations/deployment-charts@master] wikikube: remove apple-search deployment

https://gerrit.wikimedia.org/r/858577

Change 858617 merged by Clément Goubert:

[operations/puppet@production] apple-search: absent kubernetes service

https://gerrit.wikimedia.org/r/858617

Change 858578 merged by jenkins-bot:

[operations/deployment-charts@master] charts: remove apple-search chart

https://gerrit.wikimedia.org/r/858578

Change 858624 merged by Clément Goubert:

[operations/puppet@production] apple-search: final cleanup

https://gerrit.wikimedia.org/r/858624

Certificates cleaned up. It's dead, Jim.

Mentioned in SAL (#wikimedia-operations) [2022-11-18T16:58:31Z] <claime> apple-search service decommissioned - T316296

Volans subscribed.

FYI the SVC addresses are still allocated in Netbox: https://netbox.wikimedia.org/ipam/ip-addresses/?q=apple-search
I guess they should be removed. When doing so remember to run the sre.dns.netbox cookbook too.