Page MenuHomePhabricator

DOM Pass for wrapping bare text found in <body> and other "block" (in html4-parlance) nodes like <blockquote>, <td>, <th>.
Closed, ResolvedPublic

Description

Right now, Parsoid tries to mimic PHP parser + Tidy p-wrapping semantics but doesn't fully get it right.

  1. See T109650.
  2. Parsoid currently leaves behind bare text in <blockquote> in some scenarios (where that text showed up on the same wikitext line as the <blockquote> or </blockquote> tags). However, Tidy wraps these text nodes in <p> tags (probably a HTML4 behavior). In HTML5, it is no longer necessary to wrap text nodes in p-tags. However, since line-based processing in doBlockLevels in the PHP parser will introduce p-tags in some scenarios but not others, it is likely we will want <p> tags wrapping all text content in these tags (blockquote, td, th).

So, once T89331 is resolved, we should rip out the block-tag behavior (meant to mimic Tidy) out of the p-wrapping token transformer and instead introduce a DOM pass that addresses bare text found in the DOM (either as children of <body> or in "block" nodes such as <blockquote>, <td>, <th>).

If implemented in the PHP parser as well, this can lead to identical output in both the PHP parser and Parsoid. This could also remove odd behavior currently found in the output of PHP+Tidy combo where wikitext such as "foo\n\nbar\n\nbaz" found in a <td> will result in foo being bare text, but bar and baz will be wrapped in p-tags (which is basically why PHP parser and Parsoid have different output for T109650).

However, independent of what we do with the PHP parser, it makes sense to do some of the p-wrapping as a post-html-generation dom pass inside Parsoid to make p-wrapping more consistent.

Related Objects

Event Timeline

ssastry raised the priority of this task from to Medium.
ssastry updated the task description. (Show Details)
ssastry added projects: Parsoid, MediaWiki-Parser.
ssastry subscribed.
ssastry set Security to None.

As far as the blockquote element is concerned, it seems more like a usage issue than a Tidy or parser quirk to me. The reported glitch appears to only account for single paragraph scenarios and that is why the improper usage is not revealed.

Using Blockquote Testbed as a reference, one can see there is no discernable difference in the final rendering for a single "paragraph" of text and the applied position of the opening and closing blockquote tags - not even when a paragraph tag is also applied (examples A-1, A-2, and A-3)

The same is not true if more than one paragraph of text is being quoted (see examples B-1, B-2 and B-3).

Since both paragraphs become one in ex. B-1 under wiki mark-up (along with the current css definitions in play) when finally rendered, it's clear that the opening and closing blockquote tags should always reside on their own lines. In other terms: any text, be it a single paragraph or multiple paragraphs, should be "contained" rather than "wrapped" by blockquote tags (see B-2). And since blockquote application happens through templates more so than straight html, following that practice should be easy enough to standardize given some effort.

B-2's final rendering mirrors B-3's - which would be the normal HTML used outside of the mediawiki ~ wiki mark-up environment.

The thing about table-cells is whole other matter; mostly because of well-entrenched wiki practice(s) concerning html tables in general.

Thanks for this investigation @GOIII. So,, with some tweaks to the blockquote-emitting templates that you outlined, the output diff between Parsoid HTML and PHP HTML will go away for the <blockquote> scenario without needing any tweaks to the PHP parser.

But, from a Parsoid code maintenance point of view, I think we will consider simplifying our implementation to use a DOM pass so that some of these weird edge cases are handled uniformly.

Change 443744 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] [WIP] Move to DOM based wrapping of bare text

https://gerrit.wikimedia.org/r/443744

Change 443744 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Move to DOM based wrapping of bare text

https://gerrit.wikimedia.org/r/443744