1. Introduction
NER is a well-known task in the field of NLP. The NEISS project [
1] in which we work in close cooperation with Germanists is devoted to the automation of diverse processes during the creation of digital editions. One key task in this area is the automatic detection of entities in text corpora that correspond to a common NER task. Currently, the best results for NER tasks have been achieved with Transformer-based [
2] language models, such as Bidirectional Encoder Representations from Transformers (BERT) [
3]. Classically, a BERT is first pre-trained with large amounts of unlabeled text to obtain a robust language model and then fine-tuned to a downstream task. In particular, for the pre-training step, many variants of BERT, such as ALBERT [
4], RoBERTa [
5], or XLNet [
6], were already investigated. Pre-training is resource-intensive and takes a long time (several weeks) for training. For that reason, online platforms, such as Hugging Face [
7], offer a zoo of already pre-trained networks that can be directly used to train a downstream task. However, the available models are not always suitable for a certain task, such as NER, in German because they can be pre-trained on a different domain (e.g., language, time epoch, or text style).
Furthermore, when philologists create new digital editions, different research priorities can be set so that a different associated NER task is created each time. That is why philologists must also be able to train individual NER tasks themselves who commonly only have access to limited compute resources. For this reason, the aim is to train NER tasks on smaller BERT models as best as possible. Since our focus is the NER, we test if the optimizations work consistently on five different German NER tasks. Due to our project aims, we evaluated our new methods on the German language. We suspect a consistent behavior on similar European languages such as English, French and Spanish. Two of the considered tasks rely on new NER datasets that we generated from existing digital text editions.
Therefore, in this article, we examine which techniques are optimal to pre-train and fine-tune a BERT to solve NER tasks in German with limited resources. We investigate this on smaller BERT models with six layers that can be pre-trained on a single GPU (RTX 2080 Ti 11 GB) within 9 days, whereas fine-tuning can be performed on a notebook CPU in a few hours.
We first compared different well-established pre-training techniques, such as Mask Language Model (MLM), Sentence Order Prediction (SOP), and Next Sentence Prediction (NSP), on the final result of the downstream NER task. Furthermore, we investigated the influence of absolute and relative positional encoding, as well as Whole-Word Masking (WWM).
As a second step, we compared various approaches for carrying out fine-tuning since the tagging rules cannot be learned consistently by classical fine-tuning approaches. In addition to existing approaches, such as the use of Linear Chain Conditional Random Fields (LCRFs), we propose the so-called Class-Start-End (CSE) tagging and a specially modified form of LCRFs for NER, which led to an increased performance. Furthermore, for decoding, we introduced a simple rule-based approach, which we call the Entity-Fix rule, to further improve the results.
As already mentioned, the training of a BERT requires many resources. One of the reasons is that the memory amount of BERT depends quadratically on the sequence length when calculating the energy values (attention scores) in its attention layers, which leads to memory problems for long sequences. In this article, we propose Whole-Word Attention, a new modification of the Transformer architecture that not only reduces the number of energy values to be calculated by about a factor of two but also results in slightly improved results.
In summary, the main goal of this article is to enable the training of efficient BERT models for German NER on limited resources. For this, the article provides different methodology and claims the following contributions:
We introduce and share two datasets for German NER formed from existing digital editions.
We investigate the influence of different BERT pre-training methods, such as pre-training tasks, varying positional encoding, and adding Whole-Word Attention on a total of five different NER datasets.
On the same NER tasks, we investigate different approaches to perform fine-tuning. Hereby, we propose two new methods that led to performance improvements: Class-Start-End tagging and a modified form of LCRFs.
We introduce a novel rule-based decoding strategy achieving further improvements.
We propose Whole-Word Attention, a modification of the BERT architecture that reduces the memory requirements of the BERT models, especially for processing long sequences, and also leads to further performance improvements.
The remainder of this article is structured as follows: In
Section 2, we present our datasets, including the two new German NER datasets. In
Section 3, we introduce the different pre-training techniques, and
Section 4 describes fine-tuning. Subsequently, in
Section 5, we introduce Whole-Word Attention (WWA). In all these sections, we provide an overview of the existing techniques with the corresponding related work that we adopted and also introduce our novel methods. Afterwards,
Section 6 shows the conducted experiments and their results. We conclude this article with a discussion of our results and providing an outlook on future work.
4. Fine-Tuning Techniques for NER
The task of NER is to detect entities, such as persons or places, which possibly consist of several words within a text. As proposed in [
3], the traditional approach for fine-tuning a BERT to a classification task, such as NER, is to attach an additional feed-forward layer to a pre-trained BERT that predicts token-wise labels. In order to preserve and obtain information about the grouping of tokens into entities, Inside-Outside-Beginning (IOB) tagging [
23] is usually applied. IOB tagging introduces two versions of each entity class, one marking the beginning of the entity and one representing the interior of an entity, and an “other” class, which all together results in a total of
tag classes, where
e is the number of entity classes.
Table 3 shows an example in which the beginning token of an entity is prefixed with a “B-” and all other tokens with an “I-”.
In compliance with the standard evaluation scheme of NER tasks in [
24], we compute an entity-wise
score denoted by E-
. Instead of computing a token- or word-wise
score, E-
evaluates a complete entity as true positive only if all tokens belonging to the entity are correct. Our implementation of E-
relies on the widely used Python library
seqeval [
25].
Usually, IOB tagging is trained by a token-wise softmax cross-entropy loss. However, this setup of one feed-forward layer and a cross-entropy loss does not take into account the context of the tokens forming an entity. In the following, we will call this default approach of fine-tuning the BERT Default-Fine-Tuning. It can lead to inconsistent tagging; for example, an inner tag may only be preceded by an inner or beginning tag of the same entity and thus results in a devastating impact on the E--score. Therefore, we propose and compare three modified strategies that include context to prevent inconsistent NER tagging during training or decoding. The first approach is a modification of the IOB tagging, the second proposal uses Linear Chain Conditional Random Fields (LCRFs), and the last attempt applies rules to fix a predicted tagging.
Most papers on BERT models dealing with German NER, for example [
13] or [
12], do not focus on an investigation of different variants for fine-tuning. However, there are already studies for NER tasks in other languages (e.g., [
26,
27]) that show that the application of LCRFs can be beneficial for fine-tuning. The authors of [
27] also investigated whether it is advantageous for the fine-tuning of BERT models on NER tasks to link the pre-trained BERT models with LSTM layers. However, these experiments did not prove to be successful.
4.2. Fine-Tuning with Linear Chain Conditional Random Field with NER-Rule ()
Another approach to tackle inconsistent IOB tagging during fine-tuning of a BERT is based on Linear Chain Conditional Random Fields (LCRFs), which are a modification of Conditional Random fields, both proposed in [
28]. LCRFs are a common approach to train neural networks that model a sequential task and are therefore well suited for fine-tuning NER. The basic idea is to take into account the classification of the neighboring sequence members when classifying an element of a sequence.
The output
of our neural network for the NER task consists of a sequence of
n vectors whose dimension corresponds to the number of classes
. LCRF introduce so-called transition values
that are a matrix
of trainable weights, in the basic approach:
. An entry
of this matrix
can be seen as the potential that a tag of class
i is followed by a tag of class
j. In one of the easiest forms of LCRFs, which we choose, decoding aims to find the sequence
with the highest sum of corresponding transition values and elements of the corresponding output vectors, as shown in Equation (
6).
Equation (
6) is efficiently solved by the Viterbi-Algorithm (see, e.g., [
29]). During training, a log-likelihood loss is calculated that takes into account the transition values
and the network output
Y. The authors in [
29] provide a detailed description for its implementation.
Since the IOB tagging does not allow all possible transitions, [
30] tried to simply ban these forbidden transitions completely by assigning fixed non-trainable high negative values to the associated entries in
. However, this did not lead to any improvement in performance, but they were able to show that this allows fine-tuning to converge faster when switching from the classic IOB tagging to the more detailed IOBES tagging scheme [
30]. In contrast to them, we extend the original LCRF approach by explicitly modeling these forbidden transitions by adding additional trainable weights to the model when computing the transition values
. In the following, we call our adapted algorithm
.
Assume an NER task comprises the set of entities
, which results in
classes following the IOB tagging scheme. Thus, beside a label
O for unlabeled elements, for each entity
, there is a begin label
and an inner label
. For simplicity, we order these classes by
,
…,
,
,
…,
,
O, that is:
With respect to this ordering, we introduce the matrix
of all forbidden transitions as
Thus, an element is 1, if and only if a tag of class j can not follow on a tag of class i in the given NER task. This maps the constraint that the label of the predecessor of an interior tag of label I-X can only be the same interior label I-X or the corresponding begin label B-X.
In
Figure 1, we illustrate the definition of
.
Likewise, we define the matrix
by
as the matrix of all allowed tag transitions.
introduces two additional trainable weights
besides the weights
and constructs
by
where ⊙ is the point-wise product. If setting
and
, this defaults to the original LCRF approach. In this way, the model can learn an absolute penalty by
and a relative penalty by
for forbidden transitions. Note that
is mathematically equivalent to LCRF; the only purpose is to simplify and to stabilize the training.
Author Contributions
Conceptualization, J.Z. and K.S.; Data curation, J.Z. and K.S.; Formal analysis, J.Z. and K.S.; Funding acquisition, R.L.; Investigation, J.Z. and K.S.; Methodology, J.Z. and K.S.; Software, J.Z., K.S. and C.W.; Supervision, R.L.; Validation, J.Z. and K.S.; Visualization, J.Z. and K.S.; Writing—original draft, J.Z. and K.S.; Writing—review and editing, J.Z., K.S., C.W. and R.L. All authors have read and agreed to the published version of the manuscript.
Funding
This work was funded by the European Social Fund (ESF) and the Ministry of Education, Science, and Culture of Mecklenburg-Western Pomerania (Germany) within the project NEISS (Neural Extraction of Information, Structure, and Symmetry in Images) under grant no ESF/14-BM-A55-0006/19.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Conflicts of Interest
The authors declare no conflict of interest.
Abbreviations
BERT | Bidirectional Encoder Representations from Transformers |
CSE | Class-Start-End |
IOB | Inside-Outside-Beginning |
LCRF | Linear Chain Conditional Random Field |
LER | Legal Entity Recognition |
MLM | Mask Language Model |
NER | Named Entity Recognition |
NLP | Natural Language Processing |
NSP | Next Sentence Prediction |
DFT | Default-Fine-Tuning |
SOP | Sentence Order Prediction |
WWA | Whole-Word Attention |
WWM | Whole-Word Masking |
References
- NEISS Project Neuronal Extraction of Information, Structures and Symmetries in Images. Available online: https://www.neiss.uni-rostock.de/en/ (accessed on 22 October 2021).
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; The MIT Press: Cambridge, MA, USA, 2017; pp. 5998–6008. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1, pp. 4171–4186. [Google Scholar]
- Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. Albert: A lite bert for self-supervised learning of language representations. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
- Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.R.; Le, Q.V. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems; The MIT Press: Cambridge, MA, USA, 2019; pp. 5753–5763. [Google Scholar]
- Hugging Face. Available online: https://huggingface.co/ (accessed on 22 October 2021).
- Wick, C.; Kühn, B.; Leifert, G.; Sperfeld, K.; Strauß, T.; Zöllner, J.; Grüning, T. tfaip—A Generic and Powerful Research Framework for Deep Learning based on Tensorflow. J. Open Source Softw. 2021, 6, 3297. [Google Scholar] [CrossRef]
- Attardi, G. WikiExtractor. 2015. Available online: https://github.com/attardi/wikiextractor (accessed on 15 February 2020).
- Hamborg, F.; Meuschke, N.; Breitinger, C.; Gipp, B. news-please: A Generic News Crawler and Extractor. In Proceedings of the 15th International Symposium of Information Science, Berlin, Germany, 13–15 March 2017; pp. 218–223. [Google Scholar] [CrossRef]
- Benikova, D.; Biemann, C.; Kisselew, M.; Pado, S. GermEval 2014 Named Entity Recognition Shared Task: Companion Paper. 2014. Available online: http://nbn-resolving.de/urn:nbn:de:gbv:hil2-opus-3006 (accessed on 10 November 2020).
- Labusch, K.; Neudecker, C.; Zellhöfer, D. BERT for Named Entity Recognition in Contemporary and Historic German. In Proceedings of the 15th Conference on Natural Language Processing, Erlangen, Germany, 8–11 October 2019; pp. 1–9. [Google Scholar]
- Chan, B.; Schweter, S.; Möller, T. German’s Next Language Model. arXiv 2020, arXiv:2010.10906. [Google Scholar]
- Riedl, M.; Padó, S. A Named Entity Recognition Shootout for German. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018; Volume 2: Short Papers. Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 120–125. [Google Scholar] [CrossRef]
- Leitner, E.; Rehm, G.; Moreno-Schneider, J. A Dataset of German Legal Documents for Named Entity Recognition. arXiv 2020, arXiv:2003.13016. [Google Scholar]
- Hahn, B.; Breysach, B.; Pischel, C. Hannah Arendt Digital. Kritische Gesamtausgabe. Sechs Essays. 2020. Available online: https://hannah-arendt-edition.net/3p.html (accessed on 22 October 2021).
- TEI-Consortium. Guidelines for Electronic Text Encoding and Interchange. 2017. Available online: https://tei-c.org/ (accessed on 22 October 2021).
- Buchholz, S.; Marsi, E. CoNLL-X Shared Task on Multilingual Dependency Parsing. In Proceedings of the Tenth Conference on Computational Natural Language Learning (CoNLL-X), New York, NY, USA, 8–9 June 2006; Association for Computational Linguistics: Stroudsburg, PA, USA, 2006; pp. 149–164. [Google Scholar]
- Schrade, M.T. DER STURM. Digitale Quellenedition zur Geschichte der internationalen Avantgarde. 2018. Available online: https://sturm-edition.de/id/S.0000001 (accessed on 22 October 2021).
- Rosendahl, J.; Tran, V.A.K.; Wang, W.; Ney, H. Analysis of Positional Encodings for Neural Machine Translation. In Proceedings of the IWSLT, Hong Kong, China, 2–3 November 2019. [Google Scholar]
- Shaw, P.; Uszkoreit, J.; Vaswani, A. Self-attention with relative position representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, LA, USA, 1–6 June 2018; Volume 2, pp. 464–468. [Google Scholar]
- Cui, Y.; Che, W.; Liu, T.; Qin, B.; Yang, Z.; Wang, S.; Hu, G. Pre-training with whole word masking for chinese bert. arXiv 2019, arXiv:1906.08101. [Google Scholar]
- Ramshaw, L.A.; Marcus, M.P. Text Chunking Using Transformation-Based Learning. In Natural Language Processing Using Very Large Corpora; Armstrong, S., Church, K., Isabelle, P., Manzi, S., Tzoukermann, E., Yarowsky, D., Eds.; Text, Speech and Language Technology; Springer: Dordrecht, The Netherlands, 1999; pp. 157–176. [Google Scholar] [CrossRef] [Green Version]
- Sang, E.F.; De Meulder, F. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. arXiv 2003, arXiv:cs/0306050. [Google Scholar]
- Nakayama, H. Seqeval: A Python Framework for Sequence Labeling Evaluation. 2018. Available online: https://github.com/chakki-works/seqeval (accessed on 22 October 2021).
- Luoma, J.; Pyysalo, S. Exploring Cross-sentence Contexts for Named Entity Recognition with BERT. arXiv 2020, arXiv:2006.01563. [Google Scholar]
- Souza, F.; Nogueira, R.; Lotufo, R. Portuguese Named Entity Recognition using BERT-CRF. arXiv 2020, arXiv:1909.10649. [Google Scholar]
- Lafferty, J.; McCallum, A.; Pereira, F. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the International Conference on Machine Learning (ICML), Williamstown, MA, USA, 28 June–1 July 2001. [Google Scholar]
- Sutton, C.; McCallum, A. An Introduction to Conditional Random Fields. 2010. Available online: https://homepages.inf.ed.ac.uk/csutton/publications/crftutv2.pdf (accessed on 22 October 2021).
- Lester, B.; Pressel, D.; Hemmeter, A.; Ray Choudhury, S.; Bangalore, S. Constrained Decoding for Computationally Efficient Named Entity Recognition Taggers. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP, Online. 16–20 November 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 1841–1848. [Google Scholar] [CrossRef]
- Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. Adv. Neural Inf. Process. Syst. 2013, 26, 3111–3119. [Google Scholar]
- Zaheer, M.; Guruganesh, G.; Dubey, A.; Ainslie, J.; Alberti, C.; Ontanon, S.; Pham, P.; Ravula, A.; Wang, Q.; Yang, L.; et al. Big bird: Transformers for longer sequences. arXiv 2020, arXiv:2007.14062. [Google Scholar]
- Beltagy, I.; Peters, M.E.; Cohan, A. Longformer: The long-document transformer. arXiv 2020, arXiv:2004.05150. [Google Scholar]
- Leitner, E.; Rehm, G.; Moreno-Schneider, J. Fine-Grained Named Entity Recognition in Legal Documents. In Semantic Systems. The Power of AI and Knowledge Graphs; Acosta, M., Cudré-Mauroux, P., Maleshkova, M., Pellegrini, T., Sack, H., Sure-Vetter, Y., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2019; pp. 272–287. [Google Scholar] [CrossRef] [Green Version]
- Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2019, arXiv:1910.01108. [Google Scholar]
Figure 1.
Example for the definition of the matrix of all forbidden transitions for two entities , . If we follow the IOB tagging scheme, red arrows mark forbidden transitions between two sequence elements that lead to an entry 1 in .
Figure 2.
Left: Multi-Head Attention. Right: Our concept of Whole-Word Attention, where classical Multi-Head Attention is applied on words instead of tokens. Orange: Members of sequences, whose length is the number of tokens n. Red: Members of sequences, whose length is the number of words m. For a better overview, we define as the number of tokens the j’th word consists of.
Figure 3.
The E-
score for a BERT trained on Mask Language Model (MLM) task with Whole-Word Masking (WWM) for relative and absolute positional encoding fine-tuned with Default-Fine-Tuning (see
Section 4) tested on seven parts of the test list, which are sorted by token length, starting with short sequences.
Left shows the mean over all datasets, and the
right shows the results on one dataset (H. Arendt).
Table 1.
Distribution of NER entities in H. Arendt Edition. Column “Original attributes” lists which attributes from the original TEI files were combined into one “Entity” for the NER dataset. On average, an entity consists of 1.36 words.
Entity | # All | # Train | # Test | # Devel | Original Attributes |
---|
person | 1702 | 1303 | 182 | 217 | person, biblicalFigure, ficticiousPerson, deity, mythologicalFigure |
place | 1087 | 891 | 111 | 85 | place, country |
ethnicity | 1093 | 867 | 115 | 111 | ethnicity |
organization | 455 | 377 | 39 | 39 | organization |
event | 57 | 49 | 6 | 2 | event |
language | 20 | 14 | 4 | 2 | language |
unlabeled words | 153,223 | 121,154 | 16,101 | 15,968 | |
Table 2.
Distribution of NER entities in the Sturm Edition. On average, an entity consists of 1.12 words.
Entity | # All | # Train | # Test | # Devel |
---|
person | 930 | 763 | 83 | 84 |
date | 722 | 612 | 59 | 51 |
place | 492 | 374 | 59 | 59 |
unlabeled words | 33,809 | 27,047 | 3306 | 3456 |
Table 3.
IOB tagging example with unlabeled words (O) and the two entities: “location” (Loc) and “person” (Per). The first tag of each entity is prefixed with “B-”, while all the following tokens of that entity are marked with an “I-”. The first row are the words of the sentence that are split into one or more tokens (second row). The third row shows the tagged tokens based on the given entities (last row). The example sentence can be translated as “Peter lives in Frankfurt am Main”.
Words | Peter | lebt | in | Frankfurt | am | Main |
Tokens | Peter | lebt | in | Frank | _furt | am | Main |
Tagged Tokens | B-Per | O | O | B-Loc | I-Loc | I-Loc | I-Loc |
Entities | Person | | | Location |
Table 4.
CSE tagging example. Rows refer to the tokens and its respective target for start, end, and class.
Tokens | Peter | lebt | in | Frank | _furt | am | Main |
Start | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
End | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
Class | Per | O | O | Loc | Loc | Loc | Loc |
Table 5.
Example for the Entity-Fix rule. Rows refer to the tokens, its respective target, prediction, and the prediction resulting from decoding with the Entity-Fix rule. Changes are emphasized in bold.
Tokens | Peter | lebt | in | Frank | _furt | am | Main |
Target | B-Per | O | O | B-Loc | I-Loc | I-Loc | I-Loc |
Prediction | I-Per | O | O | B-Loc | I-Org | I-Org | I-Loc |
Prediction with Fix-Rule | B-Per | O | O | B-Loc | I-Loc | I-Loc | I-Loc |
Table 6.
Columns refer to the average E-
score (cf. [
24]) of three fine-tuning runs with Default-Fine-Tuning (see
Section 4) for five datasets and its standard deviation
multiplied by 100. Rows refer to the respective pre-training task, absolute or relative positional encoding (PE), and use of Whole-Word Masking (WWM); best results within
of the maximum (best) are emphasized
Pre-Train Task | PE | WWM | GermEval | H.Arendt | Sturm | LER CG | LER FG | Average |
---|
E- | | E- | | E- | | E- | | E- | | E- |
---|
MLM | abs. | - | 0.7785 | 0.09 | 0.7600 | 1.28 | 0.8236 | 1.20 | 0.9061 | 0.48 | 0.8969 | 0.37 | 0.8330 |
MLM | abs. | √ | 0.7901 | 0.21 | 0.7681 | 0.40 | 0.8102 | 1.73 | 0.9192 | 0.12 | 0.9028 | 0.21 | 0.8381 |
MLM | rel. | - | 0.7852 | 0.16 | 0.7674 | 0.54 | 0.8011 | 1.45 | 0.9198 | 0.34 | 0.9144 | 0.27 | 0.8376 |
MLM | rel. | √ | 0.8086 | 0.24 | 0.7741 | 0.79 | 0.8555 | 0.34 | 0.9347 | 0.30 | 0.9206 | 0.14 | 0.8587 |
MLM, NSP | abs. | - | 0.7609 | 0.10 | 0.7688 | 1.21 | 0.8085 | 0.20 | 0.9067 | 0.36 | 0.8928 | 0.56 | 0.8275 |
MLM, NSP | abs. | √ | 0.7484 | 1.52 | 0.7621 | 0.40 | 0.8110 | 1.38 | 0.9047 | 0.30 | 0.8930 | 0.30 | 0.8238 |
MLM, NSP | rel. | - | 0.7802 | 0.40 | 0.7471 | 0.48 | 0.8172 | 1.37 | 0.9193 | 0.39 | 0.9105 | 0.18 | 0.8348 |
MLM, NSP | rel. | √ | 0.7750 | 0.23 | 0.7564 | 1.15 | 0.8191 | 0.40 | 0.9168 | 0.07 | 0.9044 | 0.38 | 0.8343 |
MLM, SOP | abs. | - | 0.7530 | 0.18 | 0.7669 | 0.71 | 0.7942 | 1.20 | 0.8979 | 0.25 | 0.8817 | 0.14 | 0.8188 |
MLM, SOP | abs. | √ | 0.7355 | 0.38 | 0.7590 | 0.21 | 0.7995 | 2.40 | 0.9047 | 0.17 | 0.8950 | 0.35 | 0.8187 |
MLM, SOP | rel. | - | 0.7745 | 0.58 | 0.7548 | 0.52 | 0.8052 | 0.62 | 0.9176 | 0.24 | 0.9040 | 0.48 | 0.8312 |
MLM, SOP | rel. | √ | 0.7863 | 0.31 | 0.7807 | 0.41 | 0.8550 | 0.15 | 0.9246 | 0.24 | 0.9122 | 0.38 | 0.8518 |
Table 7.
Average E- score of all five NER tasks and three fine-tuning runs. Rows refer to the respective pre-training task, absolute or relative positional encoding (PE), and use of Whole-Word Masking (WWM); Columns refer to the fine-tuning methods Default-Fine-Tuning (DFT), Class-Start-End (CSE), Linear Chain Conditional Random Field (LCRF) and ; best results per column are emphasized.
Pre-Training Task | PE | WWM | Avg. DFT | Avg. CSE | Avg. LCRF | Avg. |
---|
MLM | abs. | - | 0.8330 | 0.8608 | 0.8455 | 0.8469 |
MLM | abs. | √ | 0.8381 | 0.8635 | 0.8443 | 0.8516 |
MLM | rel. | - | 0.8376 | 0.8682 | 0.8477 | 0.8539 |
MLM | rel. | √ | 0.8587 | 0.8785 | 0.8623 | 0.8651 |
MLM, NSP | abs. | - | 0.8275 | 0.8536 | 0.8346 | 0.8418 |
MLM, NSP | abs. | √ | 0.8238 | 0.8543 | 0.8358 | 0.8416 |
MLM, NSP | rel. | - | 0.8348 | 0.8639 | 0.8457 | 0.8471 |
MLM, NSP | rel. | √ | 0.8343 | 0.8595 | 0.8417 | 0.8459 |
MLM, SOP | abs. | - | 0.8188 | 0.8515 | 0.8292 | 0.8325 |
MLM, SOP | abs. | √ | 0.8187 | 0.8597 | 0.8312 | 0.8369 |
MLM, SOP | rel. | - | 0.8312 | 0.8598 | 0.8450 | 0.8473 |
MLM, SOP | rel. | √ | 0.8518 | 0.8714 | 0.8557 | 0.8605 |
Overall Average | | | 0.8340 | 0.8608 | 0.8432 | 0.8476 |
Table 8.
Average E- score of three fine-tuning runs on BERTs pre-trained with Mask Language Model (MLM), relative positional encoding, and Whole-Word Masking (WWM). Rows refer to the fine-tuning methods Default-Fine-Tuning (DFT), Class-Start-End (CSE), Linear Chain Conditional Random Field (LCRF), and ; Columns refer to the NER task; best results per column are emphasized.
Fine-Tuning Task | Entity-Fix Rule | GermEval | H. Arendt | Sturm | LER CG | LER FG | Average |
---|
DFT | - | 0.8086 | 0.7741 | 0.8555 | 0.9347 | 0.9206 | 0.8587 |
DFT | √ | 0.8408 | 0.7903 | 0.8706 | 0.9474 | 0.9427 | 0.8783 |
CSE | - | 0.8397 | 0.8048 | 0.8647 | 0.9429 | 0.9401 | 0.8785 |
CSE | √ | 0.8397 | 0.8048 | 0.8647 | 0.9429 | 0.9401 | 0.8785 |
LCRF | - | 0.8216 | 0.7822 | 0.8453 | 0.9365 | 0.9261 | 0.8623 |
LCRF | √ | 0.8422 | 0.7941 | 0.8629 | 0.9477 | 0.9410 | 0.8776 |
| - | 0.8220 | 0.7857 | 0.8508 | 0.9394 | 0.9278 | 0.8651 |
| √ | 0.8448 | 0.7999 | 0.8783 | 0.9488 | 0.9455 | 0.8823 |
Table 9.
Average E- score of three fine-tuning runs with on BERTs pre-trained with Mask Language Model (MLM), relative positional encoding, and Whole-Word Masking (WWM). Rows refer to the use of WWA and the maximal sequence length in pre-training; Columns refer to the use of the Entity-Fix rule.
WWA | Max Seq. Length Pre-Train | Avg. E- | Avg. E- with Entity-Fix Rule |
---|
- | 320 token | 0.8651 | 0.8823 |
√ | 320 token | 0.8676 | 0.8832 |
√ | 300 words | 0.8674 | 0.8832 |
Table 10.
Comparison of our best results on the two LER-tasks with the previous state of the art models of [
34], where T-
and E-
are the token- and entity-wise
scores, respectively.
Task | Model | T- | E- |
---|
LER CG | Previous SoTA [34] | 0.9595 | - |
LER CG | our best | 0.9842 | 0.9488 |
LER FG | Previous SoTA [34] | 0.9546 | - |
LER FG | our best | 0.9811 | 0.9455 |
Table 11.
Comparison of our best results on the GermEval-task with other state of the art models, where E-
is the entity-wise
score. Our shown result is again the average of 3 fine-tuning runs. The score of DistilBERT was taken from
https://huggingface.co/dbmdz/flair-distilbert-ner-germeval14 accessed on 19 October 2021.
Model | Params | E- |
---|
BiLSTM-WikiEmb [14] | - | 0.8293 |
[35] | 66 mio | 0.8563 |
[13] | 110 mio | 0.8798 |
[13] | 335 mio | 0.8895 |
our best on small BERTs | 34 mio | 0.8448 |
Table 12.
Comparison of some technical attributes between
(from [
13]) and our small BERTs. The operations per training is a coarse estimation based on the theoretic compute power and pre-training time.
| | Our Small BERTs |
---|
Parameter | 110 mio | 34 mio |
Pre-training hardware | 1×TPU v3 | 1× GPU |
Memory | 128 GB | 11 GB |
Compute power | 420 TFLOPS | 14 TFLOPS |
Tokens seen in pre-training | | |
Pre-training time | ≈7 days | ≈9 days |
Operations per training | ≈254 EFLOP | ≈11 EFLOP |
| Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).