Unsupervised Parsing by Searching for Frequent Word Sequences among Sentences with Equivalent Predicate-Argument Structures

Junjie Chen; Xiangheng He; Danushka Bollegala; Yusuke Miyao

doi:10.18653/v1/2024.findings-acl.225

Unsupervised Parsing by Searching for Frequent Word Sequences among Sentences with Equivalent Predicate-Argument Structures

Junjie Chen, Xiangheng He, Danushka Bollegala, Yusuke Miyao

Abstract

Unsupervised constituency parsing focuses on identifying word sequences that form a syntactic unit (i.e., constituents) in target sentences. Linguists identify the constituent by evaluating a set of Predicate-Argument Structure (PAS) equivalent sentences where we find the constituent appears more frequently than non-constituents (i.e., the constituent corresponds to a frequent word sequence within the sentence set). However, such frequency information is unavailable in previous parsing methods that identify the constituent by observing sentences with diverse PAS. In this study, we empirically show that constituents correspond to frequent word sequences in the PAS-equivalent sentence set. We propose a frequency-based parser, span-overlap, that (1) computes the span-overlap score as the word sequence’s frequency in the PAS-equivalent sentence set and (2) identifies the constituent structure by finding a constituent tree with the maximum span-overlap score. The parser achieves state-of-the-art level parsing accuracy, outperforming existing unsupervised parsers in eight out of ten languages. Additionally, we discover a multilingual phenomenon: participant-denoting constituents tend to have higher span-overlap scores than equal-length event-denoting constituents, meaning that the former tend to appear more frequently in the PAS-equivalent sentence set than the latter. The phenomenon indicates a statistical difference between the two constituent types, laying the foundation for future labeled unsupervised parsing research.

Anthology ID:: 2024.findings-acl.225
Volume:: Findings of the Association for Computational Linguistics: ACL 2024
Month:: August
Year:: 2024
Address:: Bangkok, Thailand
Editors:: Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3760–3772
Language:
URL:: https://aclanthology.org/2024.findings-acl.225
DOI:: 10.18653/v1/2024.findings-acl.225
Bibkey:
Cite (ACL):: Junjie Chen, Xiangheng He, Danushka Bollegala, and Yusuke Miyao. 2024. Unsupervised Parsing by Searching for Frequent Word Sequences among Sentences with Equivalent Predicate-Argument Structures. In Findings of the Association for Computational Linguistics: ACL 2024, pages 3760–3772, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):: Unsupervised Parsing by Searching for Frequent Word Sequences among Sentences with Equivalent Predicate-Argument Structures (Chen et al., Findings 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.findings-acl.225.pdf

PDF Cite Search