Page MenuHomePhabricator

Integrate file revisions with description page history
Open, Stalled, MediumPublic

Description

Currently, file revisions are managed separately from revisions of the file description page. That can lead to confusion and inconsistencies not only on the "surface" of user interaction, but also in the internal update and management processes. Evidence: T5498, T2778, T35292, T42178, T28741, T589 (this is from a cursory search, there are likely several more).

This RFC proposes to integrate the upload history with the file edit history. The RFC should be considered in two parts: whether this should be done, and if yes, how it should be done.

When deciding whether upload and edit history should be combined, numerous edge cases are to be considered:

  • what does a diff for an upload look like?
  • how does revision deletion/oversight interact with upload revisions?
  • how would undo and revert work?
  • does an edit always change either the file or the text, or can it change both, so data and meta-data can be kept in sync?

Also, consider that besides the file data and the description text, in the future we may have a third set of data tied to the revision, namely the structured meta-data for the file, managed using Wikibase.

When discussing the how, perhaps the notion of having multiple content blobs per page and revision could be helpful. Such "attachments" or "multipart content" has been discussed several times as a mechanism for managing meta-data such as categories outside the wikitext.

Related Objects

View Standalone Graph
This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.
StatusSubtypeAssignedTask
OpenNone
StalledNone
OpenNone

Event Timeline

daniel raised the priority of this task from to Needs Triage.
daniel updated the task description. (Show Details)
daniel added a project: TechCom-RFC.
daniel subscribed.

For the record: I'm very interested in the discussion, but have no resources to invest into implementing any kind of decision. My intention with this RFC is to give guidance to other RFCs such as T589 and future discussions.

One way to think about this is to have multiple "content blobs" associated with each revision (currently we only have one blob, identified rev_text_id). Edits may change one or more of them. If a blob isn't changed between revisions, both revisions point to the same blob (we already have this kind of "empty" revision for recording page moves in the page history).

We could associate additional content blobs with revisions using an additional table (or by adding a column to the text table that points "back" to the revision it "belongs" to).

Notes from the IRC meeting (in no particular order):

Daniel: Proposal: multiple blobs per revision, with a role, model, and hash associated. may be a good time to move to timeuuid revision ids. unify content handler with media handler, and file repo with external store / RESTBase.
Daniel: i'm thinking splitting the revision table into two, one for the actual revision (timestamp, user, id) and one for the blobs( blob-id, url, content-model, format, hash, rev-id )

Daniel: UI questions: file description page, upload page, history view, undo/revert, diff pages
Daniel: unify ContentHandler with MediaHandler, and FileRepository with ExternalStore.
Gabriel: This is similar to what RESTBase is doing:

<DanielK_WMDE> The XML format would need to change to accomodate multiple content blobs per revision.

Tim: need to have more of a product, UI-focused discussion, with user involvement, mockups, then requirements and then architecture

Daniel: "upload" revissions can be identified using a revision tagging mechanism
<bawolff> The scope on this proposal seems humungous...
<gwicke> I see history more as a timeline of events associated with a logical bit of content; it would be nice to store events by something like a timeuuid and then merge them in a history view. Can even select events to show dynamically, add new types of events later.
<bawolff> Altering description on upload would be a lot easier (in terms of presenting a sane UI) when/if wikidata for images actually happens

<TimStarling> I mean the creating/deleting/moving file revisions, there is a lot of code to do that
<TimStarling> mostly it will just go away
<TimStarling> which is nice

One of the main question is where and how to store the necessary info (blob-id, url, content-model, format, hash, rev-id) to associate multiple revision content data blobs with an entry in the revision table (and edit event).

Tim: encode it in the text table, kind of like what we do for ES
Daniel: have a new revision_blob table split off from the revision table
Gabriel: have a REST service associate blobs with the revision id / uuid.

Moved to "draft" state, because this needs more thought in order to become a concrete proposal. It also needs splitting up into manageable tasks, and some sense of resourcing. Right now this is an idea for a major restructuring, with no concrete need to go forward with it. Although the "media meta data" effort could benefit from this.

@GWicke: it just occurred to me that hash based addressing of blobs would be very nice for this, especially for uploaded files. That would make it very easy to "rename" files without having to do anything on the file system.

@daniel: Indeed. Another benefit is that we don't need to invalidate caches, and stored content using the older version remains consistent without a need to re-render the HTML. See also T66214 and T1210.

<spam>https://gerrit.wikimedia.org/r/179402 aims to make both page and file revisions implement a common interface.</spam>

For the record, this would be pretty easy to to with MCR. Outline:

  • define a "media" slot
  • define a "media" content model
  • the media content model is a (JSON?) data structure that contains the following:
    • the name or path of the file, as used by FileRepo
    • basic meta data, like the file size and mime type
    • more advanced meta data extracted from the file, such as image dimensions, video duration, EXIF data, etc.
  • the media content model does not support direct editing. Instead, the editing interface for this model (and slot) is the upload API/UI.
  • old dummy revisions that represent uploads could have a new slot added retroactively (NOTE: this changes the revision hash and size!)
  • older uploads that don't have a dummy revision associated with them would need to have a new revising injected into the page history.
  • The image table would be kept but would no longer serve as a source of truth. It would instead be secondary data for querying files by meta data extracted from their latest revision.
  • The oldimage table could go away entirely.
Krinkle changed the task status from Open to Stalled.Sep 16 2020, 7:40 PM
Krinkle triaged this task as Medium priority.
Krinkle moved this task from Old to P1: Define on the TechCom-RFC board.