Poggi, Francesco
(2015)
Structural patterns for document engineering: from an empirical bottom-up analysis to an ontological theory, [Dissertation thesis], Alma Mater Studiorum Università di Bologna.
Dottorato di ricerca in
Informatica, 27 Ciclo. DOI 10.6092/unibo/amsdottorato/7123.
Documenti full-text disponibili:
Abstract
This thesis aims at investigating a new approach to document analysis based on the idea of structural patterns in XML vocabularies. My work is founded on the belief that authors do naturally converge to a reasonable use of markup languages and that extreme, yet valid instances are rare and limited. Actual documents, therefore, may be used to derive classes of elements (patterns) persisting across documents and distilling the conceptualization of the documents and their components, and may give ground for automatic tools and services that rely on no background information (such as schemas) at all.
The central part of my work consists in introducing from the ground up a formal theory of eight structural patterns (with three sub-patterns) that are able to express the logical organization of any XML document, and verifying their identifiability in a number of different vocabularies. This model is characterized by and validated against three main dimensions: terseness (i.e. the ability to represent the structure of a document with a small number of objects and composition rules), coverage (i.e. the ability to capture any possible situation in any document) and expressiveness (i.e. the ability to make explicit the semantics of structures, relations and dependencies).
An algorithm for the automatic recognition of structural patterns is then presented, together with an evaluation of the results of a test performed on a set of more than 1100 documents from eight very different vocabularies. This language-independent analysis confirms the ability of patterns to capture and summarize the guidelines used by the authors in their everyday practice.
Finally, I present some systems that work directly on the pattern-based representation of documents. The ability of these tools to cover very different situations and contexts confirms the effectiveness of the model.
Abstract
This thesis aims at investigating a new approach to document analysis based on the idea of structural patterns in XML vocabularies. My work is founded on the belief that authors do naturally converge to a reasonable use of markup languages and that extreme, yet valid instances are rare and limited. Actual documents, therefore, may be used to derive classes of elements (patterns) persisting across documents and distilling the conceptualization of the documents and their components, and may give ground for automatic tools and services that rely on no background information (such as schemas) at all.
The central part of my work consists in introducing from the ground up a formal theory of eight structural patterns (with three sub-patterns) that are able to express the logical organization of any XML document, and verifying their identifiability in a number of different vocabularies. This model is characterized by and validated against three main dimensions: terseness (i.e. the ability to represent the structure of a document with a small number of objects and composition rules), coverage (i.e. the ability to capture any possible situation in any document) and expressiveness (i.e. the ability to make explicit the semantics of structures, relations and dependencies).
An algorithm for the automatic recognition of structural patterns is then presented, together with an evaluation of the results of a test performed on a set of more than 1100 documents from eight very different vocabularies. This language-independent analysis confirms the ability of patterns to capture and summarize the guidelines used by the authors in their everyday practice.
Finally, I present some systems that work directly on the pattern-based representation of documents. The ability of these tools to cover very different situations and contexts confirms the effectiveness of the model.
Tipologia del documento
Tesi di dottorato
Autore
Poggi, Francesco
Supervisore
Dottorato di ricerca
Scuola di dottorato
Scienze e ingegneria dell'informazione
Ciclo
27
Coordinatore
Settore disciplinare
Settore concorsuale
Parole chiave
XML, descriptive markup, document visualisation, ontology, pattern recognition, structural patterns, EARMARK, document engineering, markup languages, OWL, information visualization
URN:NBN
DOI
10.6092/unibo/amsdottorato/7123
Data di discussione
4 Giugno 2015
URI
Altri metadati
Tipologia del documento
Tesi di dottorato
Autore
Poggi, Francesco
Supervisore
Dottorato di ricerca
Scuola di dottorato
Scienze e ingegneria dell'informazione
Ciclo
27
Coordinatore
Settore disciplinare
Settore concorsuale
Parole chiave
XML, descriptive markup, document visualisation, ontology, pattern recognition, structural patterns, EARMARK, document engineering, markup languages, OWL, information visualization
URN:NBN
DOI
10.6092/unibo/amsdottorato/7123
Data di discussione
4 Giugno 2015
URI
Statistica sui download
Gestione del documento: