User Details
- User Since
- Oct 3 2014, 8:06 AM (525 w, 1 d)
- Availability
- Available
- IRC Nick
- godog
- LDAP User
- Filippo Giunchedi
- MediaWiki User
- FGiunchedi (WMF) [ Global Accounts ]
Sep 25 2024
Sep 24 2024
Nice! Thank you @hnowlan, resolving as we're done
Sep 20 2024
Sep 19 2024
Sep 18 2024
Sep 17 2024
Sep 16 2024
Doh! You are quite right, the relevant alert is MaxConntrack
Thank you @ssingh, there's also centrallog hosts running bullseye + 0.8 in addition to cloudlb hosts, at any rate I tried installing the bookworm anycast-healthchecker 0.9 package on bullseye on a test host and it seems to work as expected (i.e. a rebuild for bullseye doesn't seem to be needed)
@ssingh from T370068: Upgrade anycast-healthchecker to 0.9.8 (from 0.9.1-1+wmf12u1) I couldn't find any obvious blocker to have 0.9 on bullseye. Do you remember if there were there any obvious blockers for that to happen ? Thank you
Also note that prometheus-postgres-exporter since 0.12.0 has gained support for replication monitoring. This is IMHO the proper solution, though it will require >= trixie unless we chose to backport the package instead: https://github.com/prometheus-community/postgres_exporter/blob/master/CHANGELOG.md#0120--2023-03-21
It is an option that can/should be considered I think since it would mean future-proofing the postgresql monitoring infrastructure.
Resolving since we haven't run into further problems with envoy metrics ingestion
Sep 13 2024
Sep 12 2024
From my investigation so far on IRC:
script is deployed!
Sep 11 2024
The Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and replace it with a more specific project tag to this task. Thanks!
Sep 10 2024
This is done, I've set the service as non-paging since we're using it for the prometheus web interface (i.e. humans) whereas the http service is paging since that's for automated access
This is done! The only other use of graphite_threshold is 'zuul_gearman_wait_queue' which will be addressed as part of T233089: Export zuul metrics to Prometheus. There are of course graphite-itself alerts left, which will be removed together with graphite.
Sep 9 2024
I have merged the ported mw alerts (thank you @Clement_Goubert !) and changed the referenced dashboard at https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts to show the prometheus metrics.
Also please note that we're blocked on T374340 before the failover can happen
This is great, thank you all for your help!
With https://gerrit.wikimedia.org/r/c/operations/puppet/+/1060516 merged now corto's profile::corto::active_host hiera setting needs to be changed on failover, @andrea.denisse FYI
Sep 6 2024
I'm optimistically resolving this, please reopen if sth is amiss!
Thank you for the report, the script indeed breaks on renamed users (I112dcae4bf being the cause). I've manually deleted the old user and the script now completes, so you should have access @Southparkfan ! I've filed T374190: grafana-ldap-users-sync breaks on renamed users for followup
Sep 5 2024
Thank you to all involved so far with the reboots -- much appreciated!
Sep 4 2024
Sep 2 2024
Thanks for reaching out @Peter ! I'm sure the jenkins server can be polled for metrics by prometheus. As long as the workers run in production they are also able to reach the pushgateway, hope that helps!
Aug 28 2024
The prometheus bookworm upgrade has been completed as part of T326657: Add prometheus-https load balancer, thank you @andrea.denisse for your help with this
Aug 27 2024
I'm tentatively resolving since the silences are in place, please feel free to reopen as needed!
Thank you for filing the task, I was also looking at the same failed probes due to the recent Prometheus Bookworm upgrade. tl;dr is I think it is safe to ack these alerts and I will do so, see also https://phabricator.wikimedia.org/T326657#10090776
Aug 26 2024
First of all, my apologies for the long lead time on this. I have reviewed upstream documentation, issues and alert logs; the most promising issue I could find that may explain this behavior is https://github.com/prometheus/alertmanager/issues/3808 to which I've inquired further.
The two hosts in Bookworm (prometheus2006 and prometheus1006) work well, the only problem I could find is that probes for puppetmaster https endpoints (not puppetserver!) are failing, this is a long-standing issue due to the fact that said endpoints use certs without SAN. Bookworm prometheus-blackbox-exporter has been compiled with newer golang (>= 1.17) which doesn't allow to ignore certs without SANs anymore.
Aug 22 2024
This has been resolved in the meantime
I tested the bookworm in place upgrade on prometheus2006 and things seem to be working as expected. I did the following:
@bking re: IRC question "output a process list in the alert email body" we don't have facilities for that, however we do have resource utilization per-unit available, that will give you a breakdown of what jupyer users are doing, for example:
Aug 21 2024
Update after team meeting: I'll be starting the in-place Bookworm upgrade since it'll unblock this issue, it is something we have to do anyways, and I have prometheus host in Pontoon running on Bookworm with no obvious problems.
Graphite is going away
As far as I can tell this is done, boldly resolving
I'm boldly declining since it isn't obvious what the advantages are, vs obvious disadvantages (doing the work at the very least)