Page MenuHomePhabricator

Bring Banner History data into Fundraising infrastructure
Open, Needs TriagePublic

Description

Erin has use cases here:
https://docs.google.com/document/d/1Y5uyYJqFZuDBi2GcmFyRZpg_8UAmArYTxxg-QO9BquE/edit#

This task will be rewritten as we talk with analytics engineering and get more info.

----------old description----------

We are interested in storing banner history event data in some form within the Fundraising data ecosystem so that we can blend it with other elements of fundraising data for analysis.

Currently, this data is stored within Hive. It has been whitelisted by the Analytics Engineering team to allow for storage longer than 90 days and is kept in the event_sanitized.centralnoticebannerhistory schema.

Data we are interested in accessing internally is documented here: https://meta.wikimedia.org/wiki/Schema:CentralNoticeBannerHistory Particularly within the "l" property. This information can be combined with FR-internal data to answer interesting questions such as outlined in an old blog post, here https://blog.wikimedia.org/2017/10/03/fundraising-banner-limit/

Additional information related to banner interactions which would be helpful to have internally are banner status codes: https://www.mediawiki.org/wiki/Extension:CentralNotice/Statuses,_reasons_and_status_codes
(It's possible these could/should be two separate asks).

Event Timeline

Hi! A few thoughts on this:

  • AFIK Banner history data is not currently stored any quickly-queriable or aggregated format, only Hive. It comes in via EventLogging, but I think the MySQL store available on the Analytics cluster for other events is disabled.
  • This is in part because the data points that we'd query are in an array, which in turn contains JSON strings. A DB or other type of store for fast, useful querying would probably have to re-organize the data into two tables, parsing the JSON strings into the columns of one of them.
  • I don't know if Analytics' EventLogging infrastructure has features for such specialized munging.
  • Aside from whether or not it'd be easy to do this on the Analytics cluster, it seems it'd be quite useful to have at least a copy on the FR cluster, to do joins with other donor data. Or, the munging could happen directly on the FR cluster.

Also just to recall, in case it's relevant: this event is sent in two different scenarios. First, we get it from 1% of pageviews for all users selected for a campaign with banner history enabled. Second, we get it from 100% of users who click a button to donate in a banner. In the second scenario, if the user does reach the Payments infrastructure, we can associate the data from this event with donor data.

Additional information related to banner interactions which would be helpful to have internally are banner status codes: https://www.mediawiki.org/wiki/Extension:CentralNotice/Statuses,_reasons_and_status_codes

This data is already in the event. :) It's the last data point in the JSON text of entries in the l property. This is mentioned in the schema as "status of the event". Apologies if the wording isn't clear!

Hope this is useful! Thanks so much!!! ;)

  • AFIK Banner history data is not currently stored any quickly-queriable or aggregated format, only Hive. It comes in via EventLogging, but I think the MySQL store available on the Analytics cluster for other events is disabled.

Completely sidestepping the very important "does it make sense?" and "which data exactly?" and "how do we transform the schema?" questions for now, it looks like Sqoop can be used to export data from Hive. I see Analytics is using Sqoop elsewhere, although I don't know if for exporting.

@EYener Hi! Is there a specific reason for this task being private? If not, let's make it public? Thanks much!!!

I made this private. I'm happy to open it up after a quick chat in our Monday meeting.

DStrine changed the visibility from "Custom Policy" to "Public (No Login Required)".Jun 8 2020, 8:06 PM