Page MenuHomePhabricator

event.WikipediaPortal referer modification
Open, MediumPublic

Description

Hello,

Per discussion in a recent Phab ticket, we in FR-Analytics would like to modify the referer field of the event.WikipediaPortal schema to pull in hostname only. This will allow us to whitelist this field for long-term analytics.

Can you assist with this modification, please? As always, please let me know if I can assist in any way with this request.

Event Timeline

fdans triaged this task as Medium priority.Apr 19 2021, 4:15 PM
fdans moved this task from Incoming to Smart Tools for Better Data on the Analytics board.

A couple comments.

  1. The other day, talking with the team, we thought we Analytics could take this task, as sanitizing a full URL by applying a mask, could be useful to other data sets. This is something we could do, I created a task for it: T281144.
  1. On the other hand, I thought that, even if we purge the URL and only leave the hostname, that could still hold privacy sensitive information, that could indicate user's preferences or interests or specific situation. I think we should ask the Security/Privacy team about this. Pinging @JFishback_WMF, what do you think?

@mforns thanks for adding me! If this isn't a huge rush, we'll add this to the next Privacy Engineering scrum.

A couple comments.

  1. The other day, talking with the team, we thought we Analytics could take this task, as sanitizing a full URL by applying a mask, could be useful to other data sets. This is something we could do, I created a task for it: T281144.
  1. On the other hand, I thought that, even if we purge the URL and only leave the hostname, that could still hold privacy sensitive information, that could indicate user's preferences or interests or specific situation. I think we should ask the Security/Privacy team about this. Pinging @JFishback_WMF, what do you think?

Hello @mforns, sorry for the delayed response. My understanding of T281144 is that, while the URL query strings would be masked, the hostname itself would still be stored in plain.

One of the major privacy concerns around full referrers is that they can reveal sensitive information about users — what article or topic they were reading on the referring website, details about their accounts, tokens, other intrusive and sensitive information, you name it. However, most of such details are contained in the referrer’s query strings. As such, the intention of masking those strings and keeping only the hostname seems like a decent mitigation. Of course, it doesn’t remove the privacy risk but it significantly lowers it, while providing useful analytics information. Furthermore, this approach is close to other mainstream alternatives in the industry such as “strict-origin-when-cross-origin” and ITP policies, where query strings are stripped off and only the hostname is kept.

Additionally, I ran a query and pulled out a sample of what is stored using the WikipediaPortal schema.

sguebo@stat1005: hive -e "SELECT event.referer, event.session_id, event.country FROM event.WikipediaPortal WHERE year = 2021 AND month = 8 AND day = 20 AND hour = 10 AND event.referer IS NOT NULL LIMIT 100;"

Analyzing this sample was helpful in assessing whether the combination of referrers and other fields, including country, would pose any significant privacy risks. So far, it is my understanding that, by stripping the query strings off, the likelihood of origin domains being so specific that they may give away some information about a visitor is quite low. Furthermore, in the event of a malicious actor gaining access to the stored data, the risk would still be considerably low since the remaining data would not provide any certainty about precise location or identity.

With the above in mind, and if have an accurate understanding, I would not be opposed to this modification proposed by this ticket.

@sguebo_WMF Hi!

I agree with you that masking the referer URL mitigates the privacy risk. For instance, if someone came from a video streaming site and the URL was https://video.streaming.site/?v=dkvbj349, it would become https://video.streaming.site/ and would be fine.

I still think though that in some cases it might contain sensitive information. For example, if someone came from a site pertaining to a politically controversial group with the referer URL: http://politically.controversial.group/account/?user=24uihd, the masked version would still show http://politically.controversial.group/ which can damage the user's privacy. I'd say this is one of the cases where we have to compare privacy risk vs. analytics benefits. That's why I suggested to call @JFishback_WMF, I think he can help us with this.

Hey @mforns! @sguebo_WMF has been working on this for the Privacy Engineering team and filled me in on the details so far. I concur with his analysis - since the likelihood of http://p.c.g appearing seems pretty low in the first place. And since, AIUI, even with a potentially problematic hostname, there is not a high level of additional detailed information with which to reidentify someone, this seems like a LOW risk to me. @sguebo_WMF is finalizing our risk review sheet right now (he might actually be done already, but I'm not sure yet), but please let us know if you think we've missed something. It seems like even with language and country being included in the schema, the likelihood of being able to track hostname back to an individual user is pretty low. Are there other properties that concern you that we maybe missed?

Hey @mforns! @sguebo_WMF has been working on this for the Privacy Engineering team...

Oh @JFishback_WMF! I mistakenly thought he worked on it representing the data set owners.
That was a bad mistake on my side. I already knew he was working with you in the Security team,
but somehow, I didn't make the obvious connection. I shouldn't have pinged you for another opinion.

@sguebo_WMF I'm very sorry about that. I apologize.

Regarding the risk of this data set, I completely trust your analysis. And I agree with:

Furthermore, in the event of a malicious actor gaining access to the stored data, the risk would still be considerably low since the remaining data would not provide any certainty about precise location or identity.

Hey @mforns, that confusion is totally understandable so not worries at all. The good thing is that I have now updated my title in Phabricator.

I will bring up this task today at our grooming meeting, so we prioritize it.

@sguebo_WMF & @EYener, we discussed this task and will go ahead and implement this feature.
However, as it is something not trivial, we won't be able to do it this quarter.
We're aiming to tackle it next quarter.

@sguebo_WMF & @EYener, we discussed this task and will go ahead and implement this feature.
However, as it is something not trivial, we won't be able to do it this quarter.
We're aiming to tackle it next quarter.

This is fine by me. I think it is a sound move to take the appropriate time to work on it.

Hi all, just piping in to say thank you for adding this to your queue! There is no urgent need to add this to this quarter's work (and wow, this quarter is nearly done with) but I look forward to using this data when it becomes available.

Hi all! Pinging @mforns who was helpful with the original whitelist request from phab ticket T273246. I recall that we had some discussion over whether we could keep the referrer data in the sanitized version of this data, which is why we created this new task - to scrub that data.

However, as we start to work with the dataset more, we're also noticing that the event property does not have all the elements we expect from working with the non-sanitized version. Particularly, the event.country property is missing. It seems like this would be okay to leave in the sanitized version, as the geocoded_data is still present.

It would help us out a lot if we didn't have to modify our query structure and we were still able to pull from the event struct in order to get this data. Is there a concern about keeping all elements in the event struct populated?

Hey @EYener :-)
The country_code is indeed already in the geocoded_data field, so I don't think that there are any privacy concerns in allow-listing the event.country field!
Now, it might be that the values of event.country do not match the values inf geocoded_data.country_code, since they are collected in different ways.
Also, if you have to combine this dataset at some point with other datasets using the country_code dimension, it might be more robust to use the geocoded_data.country_code field, maybe?
You decide! In case you want to allow-list event.country, please submit a patch to the allow-list, as you did before in: https://gerrit.wikimedia.org/r/c/analytics/refinery/+/666227/5/static_data/eventlogging/whitelist.yaml
and add me as a reviewer :-)

Is there a concern about keeping all elements in the event struct populated?

There is no concern (that I can see) in keeping event.country. But I'd say that the other non-allow-listed fields (destination and referer) might be privacy sensitive.

Awesome thank you @mforns! @JMando and I can tackle this over the next few weeks.