fgiunchedi renamed T370526: Remove load_average check for ms-be/thanos-be from Port to Prometheus load_average check to Remove load_average check for ms-be/thanos-be.

Sep 24 2024, 1:10 PM · SRE-swift-storage, SRE Observability (FY2024/2025-Q1), Observability-Alerting

fgiunchedi created P69400 (An Untitled Masterwork).

Sep 24 2024, 12:50 PM

fgiunchedi closed T374860: Retire mw_wikiversion_difference check as Resolved.

Nice! Thank you @hnowlan, resolving as we're done

Sep 24 2024, 9:33 AM · serviceops, SRE Observability (FY2024/2025-Q1), Observability-Alerting

fgiunchedi closed T374860: Retire mw_wikiversion_difference check, a subtask of T321808: Port all Icinga checks to Prometheus/Alertmanager, as Resolved.

Sep 24 2024, 9:33 AM · SRE Observability (FY2024/2025-Q1), Observability-Alerting

fgiunchedi created T375475: debmonitor could provide users with cumin and/or debdeploy pre-made config/command.

Sep 24 2024, 9:09 AM · SRE-tools, Infrastructure-Foundations

fgiunchedi removed a watcher for SRE-OnFire: fgiunchedi.

Sep 24 2024, 7:27 AM

Sep 20 2024

fgiunchedi created T375271: Remove uwsgi::app unit monitoring.

Sep 20 2024, 12:40 PM · Patch-For-Review, SRE Observability (FY2024/2025-Q1), Observability-Alerting

Sep 19 2024

fgiunchedi removed a subtask for T349619: Migrate roles to puppet7: T351624: Probes for centrallog hosts fail to validate with "x509: issuer name does not match subject from issuing certificate".

Sep 19 2024, 10:10 AM · Data-Platform-SRE (2024.06.17 - 2024.07.07), serviceops, collaboration-services, SRE-tools, Puppet-Core, Puppet (Puppet 7.0), Infrastructure-Foundations, SRE

fgiunchedi removed a parent task for T351624: Probes for centrallog hosts fail to validate with "x509: issuer name does not match subject from issuing certificate": T349619: Migrate roles to puppet7.

Sep 19 2024, 10:10 AM · Patch-For-Review, User-fgiunchedi, Observability-Logging, SRE

fgiunchedi added a project to T351710: ossl rsyslog errors post-migration: Observability-Logging.

Sep 19 2024, 10:06 AM · Observability-Logging, SRE Observability (FY2024/2025-Q1), User-fgiunchedi, Patch-For-Review, Cloud-VPS, SRE, observability

fgiunchedi removed a subtask for T324623: Switch rsyslog from gtls to ossl: T351710: ossl rsyslog errors post-migration.

Sep 19 2024, 10:06 AM · User-MoritzMuehlenhoff, Cloud-VPS, cloud-services-team, Patch-For-Review, SRE, observability, User-dcaro

fgiunchedi removed a parent task for T351710: ossl rsyslog errors post-migration: T324623: Switch rsyslog from gtls to ossl.

Sep 19 2024, 10:06 AM · Observability-Logging, SRE Observability (FY2024/2025-Q1), User-fgiunchedi, Patch-For-Review, Cloud-VPS, SRE, observability

fgiunchedi created T375166: Port PDU checks to Prometheus/Alertmanager.

Sep 19 2024, 9:32 AM · SRE Observability (FY2024/2025-Q1), Observability-Alerting

fgiunchedi updated the task description for T375066: Audit hosts in 'misc' cluster.

Sep 19 2024, 9:16 AM · Observability-Metrics, SRE

fgiunchedi updated the task description for T375066: Audit hosts in 'misc' cluster.

Sep 19 2024, 9:11 AM · Observability-Metrics, SRE

fgiunchedi added a comment to T375066: Audit hosts in 'misc' cluster.

In T375066#10158541, @Dzahn wrote:

Would it make sense to have clusters named after a team to group all machines owned by a specific subteam?

Or would that go against the purpose of clusters and they should stricly group servers by service / puppet role?

Sep 19 2024, 8:03 AM · Observability-Metrics, SRE

Sep 18 2024

fgiunchedi added a comment to T374842: Retire anycast_healthchecker Icinga check.

In T374842#10157255, @fnegri wrote:

Reading through the comments I found:

For centrallog we'll do the in place upgrade to Bookworm as part of T353912, I'd say sooner rather than later. Not sure about cloudlb* hosts, thus cc cloud-services-team to see if they do have indeed an ETA for cloudlb on Bookworm or not.

I've created a subtask for the cloudlb bookworm upgrade, it's not in our immediate plans but shouldn't be too hard: T375082: Upgrade cloudlb hosts to bookworm.

Sep 18 2024, 2:37 PM · SRE Observability (FY2024/2025-Q1), Observability-Alerting

fgiunchedi updated the task description for T375085: mtail 3.0.0~rc50-1+b6 leaks memory on centrallog2002.

Sep 18 2024, 2:34 PM · SRE Observability (FY2024/2025-Q1)

fgiunchedi created T375085: mtail 3.0.0~rc50-1+b6 leaks memory on centrallog2002.

Sep 18 2024, 2:33 PM · SRE Observability (FY2024/2025-Q1)

fgiunchedi updated the task description for T375066: Audit hosts in 'misc' cluster.

Sep 18 2024, 1:37 PM · Observability-Metrics, SRE

fgiunchedi added a comment to T374842: Retire anycast_healthchecker Icinga check.

In T374842#10153975, @ssingh wrote:

On perhaps a related note, while it is true that many of the things the script is doing are taken care by the self-healing nature of anycast-hc in 0.9 and beyond, I do wonder if there is value in keeping this script around. anycast-hc is a critical part of our infrastructure especially on the DNS boxes because we announce all ns[0-2] routes from there. In a hypothetical case where it is not running, this would mean that all our nameservers are down. This leads me to wonder that in the absence of this script, what kind of an alert will be generated if anycast-hc is not running if it failed to start or is manually stopped for whatever reason?

One such alert can be the BGP one but there are many of them and the SNR is quite terrible. The other alert can be the systemd service not running but that again is shadowed by other such alerts. Does my concern make sense? In which case I am thinking perhaps we can restrict the scope of the script but still keep it around because if we get rid of this, there isn't a single dedicated script to check if anycast-hc is running or and in case it failed or didn't start, we won't probably catch it.

Sep 18 2024, 1:31 PM · SRE Observability (FY2024/2025-Q1), Observability-Alerting

fgiunchedi added a project to T375066: Audit hosts in 'misc' cluster: observability.

Sep 18 2024, 10:21 AM · Observability-Metrics, SRE

fgiunchedi updated the task description for T375066: Audit hosts in 'misc' cluster.

Sep 18 2024, 10:21 AM · Observability-Metrics, SRE

fgiunchedi updated the task description for T375066: Audit hosts in 'misc' cluster.

Sep 18 2024, 10:18 AM · Observability-Metrics, SRE

fgiunchedi updated the task description for T375066: Audit hosts in 'misc' cluster.

Sep 18 2024, 10:13 AM · Observability-Metrics, SRE

fgiunchedi created T375066: Audit hosts in 'misc' cluster.

Sep 18 2024, 9:57 AM · Observability-Metrics, SRE

Sep 17 2024

fgiunchedi added a comment to T374916: Port Categories lag / ping checks to Prometheus/Alertmanager.

In T374916#10153020, @dcausse wrote:

We could perhaps adapt modules/query_service/files/monitor/prometheus-blazegraph-exporter.py to take care of running this query by possibly re-using the same gauge blazegraph_lastupdated but adapting the query depending on the namespace it's running.

Sep 17 2024, 3:40 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08), MW-1.43-notes (1.43.0-wmf.25; 2024-10-01), Patch-For-Review, Wikidata, Discovery-Search, Wikidata-Query-Service, SRE Observability (FY2024/2025-Q1), Observability-Alerting

fgiunchedi created T374916: Port Categories lag / ping checks to Prometheus/Alertmanager.

Sep 17 2024, 9:37 AM · Data-Platform-SRE (2024.10.19 - 2024.11.08), MW-1.43-notes (1.43.0-wmf.25; 2024-10-01), Patch-For-Review, Wikidata, Discovery-Search, Wikidata-Query-Service, SRE Observability (FY2024/2025-Q1), Observability-Alerting

fgiunchedi added a comment to T374842: Retire anycast_healthchecker Icinga check.

In T374842#10149670, @ssingh wrote:

In T374842#10149337, @fgiunchedi wrote:

Thank you @ssingh, there's also centrallog hosts running bullseye + 0.8 in addition to cloudlb hosts, at any rate I tried installing the bookworm anycast-healthchecker 0.9 package on bullseye on a test host and it seems to work as expected (i.e. a rebuild for bullseye doesn't seem to be needed)

Ah, that's interesting. I mean it's not surprising it worked given the code but I am not sure if we want to extend it like that. Or maybe, it's fine and we do this all the time! (I know of some cases in which we do it but I am not sure if it is frowned upon.)

Sep 17 2024, 9:28 AM · SRE Observability (FY2024/2025-Q1), Observability-Alerting

fgiunchedi added a project to T374842: Retire anycast_healthchecker Icinga check: cloud-services-team.

Sep 17 2024, 9:28 AM · SRE Observability (FY2024/2025-Q1), Observability-Alerting

Sep 16 2024

fgiunchedi added a comment to T374827: Port conntrack utilization alert to Prometheus/Alertmanager.

Doh! You are quite right, the relevant alert is MaxConntrack

Sep 16 2024, 3:24 PM · Infrastructure-Foundations, SRE Observability (FY2024/2025-Q1), Observability-Alerting

fgiunchedi created T374860: Retire mw_wikiversion_difference check.

Sep 16 2024, 3:04 PM · serviceops, SRE Observability (FY2024/2025-Q1), Observability-Alerting

fgiunchedi added a comment to T374842: Retire anycast_healthchecker Icinga check.

Thank you @ssingh, there's also centrallog hosts running bullseye + 0.8 in addition to cloudlb hosts, at any rate I tried installing the bookworm anycast-healthchecker 0.9 package on bullseye on a test host and it seems to work as expected (i.e. a rebuild for bullseye doesn't seem to be needed)

Sep 16 2024, 2:50 PM · SRE Observability (FY2024/2025-Q1), Observability-Alerting

fgiunchedi updated subscribers of T374842: Retire anycast_healthchecker Icinga check.

@ssingh from T370068: Upgrade anycast-healthchecker to 0.9.8 (from 0.9.1-1+wmf12u1) I couldn't find any obvious blocker to have 0.9 on bullseye. Do you remember if there were there any obvious blockers for that to happen ? Thank you

Sep 16 2024, 2:00 PM · SRE Observability (FY2024/2025-Q1), Observability-Alerting

fgiunchedi created T374842: Retire anycast_healthchecker Icinga check.

Sep 16 2024, 1:56 PM · SRE Observability (FY2024/2025-Q1), Observability-Alerting

fgiunchedi added a comment to T374839: Port postgresql replication check to Prometheus/Alertmanager.

Also note that prometheus-postgres-exporter since 0.12.0 has gained support for replication monitoring. This is IMHO the proper solution, though it will require >= trixie unless we chose to backport the package instead: https://github.com/prometheus-community/postgres_exporter/blob/master/CHANGELOG.md#0120--2023-03-21
It is an option that can/should be considered I think since it would mean future-proofing the postgresql monitoring infrastructure.

Sep 16 2024, 1:15 PM · PostgreSQL, SRE Observability (FY2024/2025-Q1), Observability-Alerting

fgiunchedi created T374839: Port postgresql replication check to Prometheus/Alertmanager.

Sep 16 2024, 1:13 PM · PostgreSQL, SRE Observability (FY2024/2025-Q1), Observability-Alerting

fgiunchedi created T374827: Port conntrack utilization alert to Prometheus/Alertmanager.

Sep 16 2024, 10:18 AM · Infrastructure-Foundations, SRE Observability (FY2024/2025-Q1), Observability-Alerting

fgiunchedi created T374823: Port netbox reports checks to Prometheus/Alertmanager.

Sep 16 2024, 9:37 AM · Infrastructure-Foundations, netbox, SRE Observability (FY2024/2025-Q1), Observability-Alerting

fgiunchedi created T374821: Replace or delete dumps_store_load_average.

Sep 16 2024, 9:26 AM · Data-Platform-SRE (2024.09.28 - 2024.10.18), SRE Observability (FY2024/2025-Q1), Observability-Alerting

fgiunchedi closed T359633: Strategy for Envoy metrics and Prometheus as Resolved.

Resolving since we haven't run into further problems with envoy metrics ingestion

Sep 16 2024, 8:15 AM · User-fgiunchedi, Observability-Metrics, MW-on-K8s

Sep 13 2024

fgiunchedi created T374711: keyholder-proxy doesn't restart on config change.

Sep 13 2024, 12:33 PM · User-Elukey, Puppet, Keyholder, SRE, Infrastructure-Foundations

Sep 12 2024

fgiunchedi added a comment to T374599: cloud: prometheus: investigate weirdness with metrics and alertmanager.

From my investigation so far on IRC:

Sep 12 2024, 10:01 AM · SRE Observability (FY2024/2025-Q1), User-aborrero, cloud-services-team

fgiunchedi closed T372411: Automation to find / summarize "orphaned" traces, a subtask of T366750: Basic data quality work for traces, as Resolved.

Sep 12 2024, 9:04 AM · Observability-Tracing

fgiunchedi closed T372411: Automation to find / summarize "orphaned" traces as Resolved.

script is deployed!

Sep 12 2024, 9:04 AM · Patch-For-Review, Observability-Tracing

Sep 11 2024

fgiunchedi created T374513: Lint problems for NeutronAgentDownForLong and NeutronAgentDown.

The Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and replace it with a more specific project tag to this task. Thanks!

Sep 11 2024, 7:52 AM · cloud-services-team (FY2024/2025-Q1-Q2)

Sep 10 2024

lmata awarded T350597: Audit and prioritize metrics for conversion to statslib that are used for graphite-based alerting a Party Time token.

Sep 10 2024, 1:10 PM · SRE Observability (FY2024/2025-Q1), User-fgiunchedi, Observability-Metrics

fgiunchedi closed T326657: Add prometheus-https load balancer as Resolved.

This is done, I've set the service as non-paging since we're using it for the prometheus web interface (i.e. humans) whereas the http service is paging since that's for automated access

Sep 10 2024, 12:31 PM · Patch-For-Review, SRE Observability (FY2024/2025-Q1), Observability-Metrics

fgiunchedi closed T350597: Audit and prioritize metrics for conversion to statslib that are used for graphite-based alerting as Resolved.

This is done! The only other use of graphite_threshold is 'zuul_gearman_wait_queue' which will be addressed as part of T233089: Export zuul metrics to Prometheus. There are of course graphite-itself alerts left, which will be removed together with graphite.

Sep 10 2024, 10:53 AM · SRE Observability (FY2024/2025-Q1), User-fgiunchedi, Observability-Metrics

fgiunchedi closed T350597: Audit and prioritize metrics for conversion to statslib that are used for graphite-based alerting, a subtask of T343020: Converting MediaWiki Metrics to StatsLib, as Resolved.

Sep 10 2024, 10:51 AM · SRE Observability (FY2024/2025-Q2), Observability-Metrics

fgiunchedi updated the task description for T350597: Audit and prioritize metrics for conversion to statslib that are used for graphite-based alerting.

Sep 10 2024, 10:50 AM · SRE Observability (FY2024/2025-Q1), User-fgiunchedi, Observability-Metrics

Sep 9 2024

jcrespo awarded T356788: thanos-query probedown due to OOM of both eqiad titan frontends a Like token.

Sep 9 2024, 3:20 PM · SRE Observability (FY2024/2025-Q1), Sustainability (Incident Followup), SRE, observability

fgiunchedi updated subscribers of T350597: Audit and prioritize metrics for conversion to statslib that are used for graphite-based alerting.

I have merged the ported mw alerts (thank you @Clement_Goubert !) and changed the referenced dashboard at https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts to show the prometheus metrics.

Sep 9 2024, 2:31 PM · SRE Observability (FY2024/2025-Q1), User-fgiunchedi, Observability-Metrics

fgiunchedi added a comment to T372418: Put the alert1002 and alert2002 hosts in production.

Also please note that we're blocked on T374340 before the failover can happen

Sep 9 2024, 12:15 PM · Patch-For-Review, SRE Observability (FY2024/2025-Q1), Observability-Alerting

fgiunchedi created T374340: No ntp query ACL for new alert hosts.

Sep 9 2024, 9:12 AM · Traffic, SRE Observability (FY2024/2025-Q1), Observability-Alerting

fgiunchedi added a comment to T373980: Hosts using nftables are not reachable via ssh from alert[12]002. Reboot needed..

This is great, thank you all for your help!

Sep 9 2024, 8:21 AM · collaboration-services, Infrastructure-Foundations, SRE Observability (FY2024/2025-Q1), Observability-Alerting

fgiunchedi added a comment to T372418: Put the alert1002 and alert2002 hosts in production.

With https://gerrit.wikimedia.org/r/c/operations/puppet/+/1060516 merged now corto's profile::corto::active_host hiera setting needs to be changed on failover, @andrea.denisse FYI

Sep 9 2024, 8:15 AM · Patch-For-Review, SRE Observability (FY2024/2025-Q1), Observability-Alerting

Sep 6 2024

fgiunchedi renamed T321808: Port all Icinga checks to Prometheus/Alertmanager from Port most/all Icinga checks to Prometheus/Alertmanager to Port all Icinga checks to Prometheus/Alertmanager.

Sep 6 2024, 8:18 AM · SRE Observability (FY2024/2025-Q1), Observability-Alerting

fgiunchedi closed T374173: Grafana LDAP sync script broken, seems to cause login issues for users recently added to LDAP groups as Resolved.

I'm optimistically resolving this, please reopen if sth is amiss!

Sep 6 2024, 7:51 AM · Grafana, observability

fgiunchedi added a comment to T374173: Grafana LDAP sync script broken, seems to cause login issues for users recently added to LDAP groups.

Thank you for the report, the script indeed breaks on renamed users (I112dcae4bf being the cause). I've manually deleted the old user and the script now completes, so you should have access @Southparkfan ! I've filed T374190: grafana-ldap-users-sync breaks on renamed users for followup

Sep 6 2024, 6:54 AM · Grafana, observability

fgiunchedi created T374190: grafana-ldap-users-sync breaks on renamed users.

Sep 6 2024, 6:51 AM · Grafana, Observability-Metrics

Sep 5 2024

fgiunchedi updated subscribers of T373980: Hosts using nftables are not reachable via ssh from alert[12]002. Reboot needed..

In T373980#10121709, @MoritzMuehlenhoff wrote:

In T373980#10121443, @fgiunchedi wrote:

Thank you to all involved so far with the reboots -- much appreciated!

I can confirm the hosts are now reachable from alert2002, except lists1004 and lists2001. On these hosts, unlike the others, there is both nft and iptables, similarly there's /etc/ferm and /etc/nftables which I'm assuming is the cause of the problem (i.e. iptables and nftables together). Does that ring a bell @eoghan ?

Looking at Puppet it seems the host was moved to nftables and then back to iptables/ferm. There's no migration path in Puppet for the latter.

Cleaning up manually the following steps are needed:

stop nftables.service and following that uninstall the nftables package

remove /etc/nftables

reboot into a clean iptables/ferm setup

That should be the core of it, if there's other issues after the reboot, happy to have a closer look.

Sep 5 2024, 2:48 PM · collaboration-services, Infrastructure-Foundations, SRE Observability (FY2024/2025-Q1), Observability-Alerting

fgiunchedi updated subscribers of T373980: Hosts using nftables are not reachable via ssh from alert[12]002. Reboot needed..

Thank you to all involved so far with the reboots -- much appreciated!

Sep 5 2024, 1:15 PM · collaboration-services, Infrastructure-Foundations, SRE Observability (FY2024/2025-Q1), Observability-Alerting

fgiunchedi updated the task description for T373980: Hosts using nftables are not reachable via ssh from alert[12]002. Reboot needed..

Sep 5 2024, 9:09 AM · collaboration-services, Infrastructure-Foundations, SRE Observability (FY2024/2025-Q1), Observability-Alerting

fgiunchedi renamed T373980: Hosts using nftables are not reachable via ssh from alert[12]002. Reboot needed. from Hosts using nftables are not reachable via ssh from alert[12]002 to Hosts using nftables are not reachable via ssh from alert[12]002. Reboot needed..

Sep 5 2024, 8:54 AM · collaboration-services, Infrastructure-Foundations, SRE Observability (FY2024/2025-Q1), Observability-Alerting

Sep 4 2024

fgiunchedi added a comment to T372418: Put the alert1002 and alert2002 hosts in production.

In T372418#10114221, @Jgreen wrote:

Fundraising servers use nsca exclusively, so we've configured them to send to all four alert[12]00[12] hosts while things are in flux. I'll try to keep an eye on this task but please let us know if when we should remove extraneous hosts from the list.

Sep 4 2024, 9:02 AM · Patch-For-Review, SRE Observability (FY2024/2025-Q1), Observability-Alerting

fgiunchedi created T373980: Hosts using nftables are not reachable via ssh from alert[12]002. Reboot needed..

Sep 4 2024, 9:00 AM · collaboration-services, Infrastructure-Foundations, SRE Observability (FY2024/2025-Q1), Observability-Alerting

Sep 2 2024

fgiunchedi added a comment to T352510: Send Selenium tests status to Prometheus.

Thanks for reaching out @Peter ! I'm sure the jenkins server can be polled for metrics by prometheus. As long as the workers run in production they are also able to reach the pushgateway, hope that helps!

Sep 2 2024, 8:08 AM · User-zeljkofilipin, MediaWiki-Core-Tests, Browser-Tests, Quality-and-Test-Engineering-Team (Test Infrastructure)

Aug 28 2024

fgiunchedi added a comment to T351927: Decide and tweak Thanos retention.

In T351927#10072017, @fgiunchedi wrote:

In T351927#10072008, @MatthewVernon wrote:

@fgiunchedi these alerts are still firing; I thought you were going to silence them. Did I misunderstand, and you'd like me to?

Yes please silence the alerts and I'll disable compactions, thank you!

Aug 28 2024, 1:00 PM · User-fgiunchedi, Observability-Metrics

fgiunchedi updated subscribers of T353912: Observability Bookworm upgrades.

The prometheus bookworm upgrade has been completed as part of T326657: Add prometheus-https load balancer, thank you @andrea.denisse for your help with this

Aug 28 2024, 10:41 AM · SRE Observability (FY2024/2025-Q1), Patch-For-Review

fgiunchedi updated the task description for T353912: Observability Bookworm upgrades.

Aug 28 2024, 10:40 AM · SRE Observability (FY2024/2025-Q1), Patch-For-Review

Aug 27 2024

fgiunchedi added a comment to T326657: Add prometheus-https load balancer.

In T326657#10094284, @jhathaway wrote:

In T326657#10090776, @fgiunchedi wrote:

I'm for acking the probefailure alerts for puppetmaster hosts only, given that they are effectively a false positive now. What do you think @jhathaway ?

that sounds fine, thanks for asking

Aug 27 2024, 11:12 AM · Patch-For-Review, SRE Observability (FY2024/2025-Q1), Observability-Metrics

fgiunchedi closed T373369: Service puppetmaster1001:8141 has failed probes (http_puppetmaster1003_eqiad_wmnet_backend_https_ip4) as Resolved.

I'm tentatively resolving since the silences are in place, please feel free to reopen as needed!

Aug 27 2024, 7:55 AM · observability, Infrastructure-Foundations

fgiunchedi added a comment to T373369: Service puppetmaster1001:8141 has failed probes (http_puppetmaster1003_eqiad_wmnet_backend_https_ip4).

Thank you for filing the task, I was also looking at the same failed probes due to the recent Prometheus Bookworm upgrade. tl;dr is I think it is safe to ack these alerts and I will do so, see also https://phabricator.wikimedia.org/T326657#10090776

Aug 27 2024, 7:48 AM · observability, Infrastructure-Foundations

Aug 26 2024

fgiunchedi updated the task description for T372418: Put the alert1002 and alert2002 hosts in production.

Aug 26 2024, 2:28 PM · Patch-For-Review, SRE Observability (FY2024/2025-Q1), Observability-Alerting

fgiunchedi added a comment to T357333: SystemdUnitFailed alerts are too noisy for data-persistence.

First of all, my apologies for the long lead time on this. I have reviewed upstream documentation, issues and alert logs; the most promising issue I could find that may explain this behavior is https://github.com/prometheus/alertmanager/issues/3808 to which I've inquired further.

Aug 26 2024, 1:28 PM · Patch-For-Review, Data-Persistence, Observability-Alerting

fgiunchedi updated subscribers of T326657: Add prometheus-https load balancer.

The two hosts in Bookworm (prometheus2006 and prometheus1006) work well, the only problem I could find is that probes for puppetmaster https endpoints (not puppetserver!) are failing, this is a long-standing issue due to the fact that said endpoints use certs without SAN. Bookworm prometheus-blackbox-exporter has been compiled with newer golang (>= 1.17) which doesn't allow to ignore certs without SANs anymore.

Aug 26 2024, 8:47 AM · Patch-For-Review, SRE Observability (FY2024/2025-Q1), Observability-Metrics

Aug 22 2024

fgiunchedi closed T290473: Startup rsyslog error on thanos-fe2001 about duplicate queue directory as Resolved.

This has been resolved in the meantime

Aug 22 2024, 1:20 PM · Observability-Metrics

fgiunchedi added a comment to T326657: Add prometheus-https load balancer.

I tested the bookworm in place upgrade on prometheus2006 and things seem to be working as expected. I did the following:

Aug 22 2024, 10:12 AM · Patch-For-Review, SRE Observability (FY2024/2025-Q1), Observability-Metrics

fgiunchedi added a comment to T372416: Implement cgroups for users' JupyterHub environments in order to mitigate resource contention on the stat servers.

@bking re: IRC question "output a process list in the alert email body" we don't have facilities for that, however we do have resource utilization per-unit available, that will give you a breakdown of what jupyer users are doing, for example:

Aug 22 2024, 8:40 AM · Data-Platform-SRE (2024.09.06 - 2024.09.27)

Aug 21 2024

fgiunchedi added a comment to T326657: Add prometheus-https load balancer.

Update after team meeting: I'll be starting the in-place Bookworm upgrade since it'll unblock this issue, it is something we have to do anyways, and I have prometheus host in Pontoon running on Bookworm with no obvious problems.

Aug 21 2024, 2:49 PM · Patch-For-Review, SRE Observability (FY2024/2025-Q1), Observability-Metrics

fgiunchedi closed T263103: Compress graphite carbon-cache log files as Declined.

Graphite is going away

Aug 21 2024, 10:36 AM · Observability-Metrics

fgiunchedi moved T116878: Create grafana dashboard for stewards showing number of blocks per wiki from Inbox to Radar on the Observability-Metrics board.

Aug 21 2024, 10:35 AM · Observability-Metrics, observability, Grafana, WMF-General-or-Unknown, Stewards-and-global-tools

fgiunchedi moved T278309: Move librenms deployment to Debian package from Inbox to Backlog on the Observability-Metrics board.

Aug 21 2024, 10:34 AM · Patch-For-Review, Observability-Metrics

fgiunchedi moved T297231: [Data Quality] Sending Apache Spark metrics to PushGateway from Inbox to Radar on the Observability-Metrics board.

Aug 21 2024, 10:34 AM · Data-Engineering, Observability-Metrics

fgiunchedi closed T350010: Deploy 'cloud' Prometheus instance to codfw, a subtask of T336774: Current status of cloudmetrics and its components, as Resolved.

Aug 21 2024, 10:27 AM · cloud-services-team, Cloud-VPS, Observability-Metrics