Jannik Strötgen

Domain-Sensitive Temporal Tagging


Скачать книгу

have to consider when processing news-style, narrative-style, colloquial-style, and so-called autonomic-style documents, the latter covering documents that contain many temporal expressions that cannot be normalized to real points in time, but only according to some local or autonomic time frame. Examples of autonomic-style documents are specific types of scientific texts and literary works.

      We believe that this book provides researchers, practitioners, and developers a valuable resource for designing and improving temporal tagging techniques and tools, or just for applying them in a useful manner as part of more complex text analysis and exploration pipelines. While publicly available temporal taggers already provide sophisticated output for several application scenarios, there is still a lot of work in this area ahead of us. This book aims at providing a solid foundation on which such work can be built.

      Jannik Strötgen and Michael Gertz

      Saarbrücken, Germany and Heidelberg, Germany

      July 2016

       Acknowledgments

      This book gives an in-depth overview of methods, tools, and techniques of temporal tagging in different domains. Based on the number of publications and evaluation competitions, the past few years clearly show that this field is taking on an enormous interest in the research community and industry. We thus would like to thank all researchers who actively contribute new ideas to this field, organize evaluation competitions, and provide temporal tagging tools and resources for other researchers and the public.

      Although this book is about temporal tagging in general and not just about our temporal tagger HeidelTime, we want to take the opportunity to thank all contributors of HeidelTime for their great work and many users for helpful feedback to further improve the tool. We also would like to thank the many students at Heidelberg University who contributed in the form of student projects, and bachelor and master theses.

      In particular, we thank Anne-Lyse Minard and Steven Bethard for their valuable reviews of the draft of this book. They put a lot of effort into the reviews and provided numerous valuable comments as well as suggestions to significantly improve the book. Finally, we want to thank the series editor Graeme Hirst for his great support and his instant replies to all our questions. It is time for a big thank you!

      CHAPTER 1

       Introduction

      Temporal tagging is a specific task in natural language processing (NLP), in which temporal expressions are extracted from text documents and normalized to some standard format. Since temporal expressions are prevalent in many types of documents and because temporal information is an important dimension in any information space, applications of several domains can benefit from the output of temporal taggers.

      This book covers the topic of temporal tagging and is structured as follows. In this chapter, we describe the task of temporal tagging, and then present some examples of NLP and NLP-related application scenarios in which temporal information can be exploited to provide more meaningful and useful results. In Chapter 2, we provide background knowledge and cover basic concepts related to temporal information. The foundations of temporal tagging are described in Chapter 3, and temporal tagging of different types of documents and thus domain-sensitive temporal tagging are explained in Chapter 4. An overview of existing techniques and tools for temporal tagging including our own system HeidelTime is provided in Chapter 5. Finally, future research directions are discussed in Chapter 6. However, to guarantee the correct understanding of two important terms frequently used in this book, we start with defining the concepts “temporal expression” and “value of a temporal expression”.

      • A temporal expression is either an expression referring to a date or time of any granularity (e.g., “March 11, 2007”, “yesterday”, “June 2016”, “20th century”, “9 pm”), an expression referring to a duration (e.g., “three years”, “several months”), or an expression referring to the periodical aspect of an event (e.g., “every Monday”, “twice a week”).

      • The value (of a temporal expression) covers the (most important) semantics of the temporal expression in a standard format, that is, the normalized information of the expression.

      Examples of and more details about different types of temporal expressions and annotation standards for temporal expressions will be covered later in this book, but these definitions are crucial to understand the task of temporal tagging, which is defined and explained next.

      Temporal tagging addresses the extraction, classification, and normalization of temporal expressions occurring in text documents. It is a prerequisite of the full task of temporal annotation (temporal information extraction), which concerns the detection and interpretation of temporal expressions, events, and temporal relations between events and between temporal expressions and events [Verhagen et al., 2009]. However, temporal tagging is not only valuable in the context of temporal information extraction, but also in many research areas and application scenarios as will be detailed in Section 1.2.

      In general, temporal tagging can be considered as a specific type of named entity recognition and normalization. Although the three standard named entity types are person, organization, and location [Nadeau and Sekine, 2007], “the notion of named entity is commonly extended to include things that are not entities per se, but nevertheless have practical importance and do have characteristic signatures that signal their presence” [Jurafsky and Martin, 2008, p. 762]. Thus, further types of information are sometimes also covered under the named entity umbrella, for example, genes and proteins, numbers, and temporal expressions.

      The classical tasks of named entity recognition (NER) tools are to identify the spans of named entities in texts and to classify the extracted named entities into pre-defined classes of entities. Thus, the normalization of entities to a unique identifier or some value in a standard format is only performed if the named entities’ normalization—depending on the type of entity also referred to as disambiguation, linking, or resolution—is addressed, too. In contrast, a temporal tagger identifies the spans of temporal expressions in texts and normalizes the expressions according to some standard format. Depending on the annotation specifications, expressions are also sometimes classified according to their type, e.g., whether an expression is a date (e.g., May 3, 2009) or a duration (e.g., three days). However, this classification of temporal expressions can be considered as a part of the normalization process and thus, one can specify the two subtasks of temporal tagging as follows.

      • Extraction: given a text, determine the spans of all temporal expressions.

      • Normalization: given a text and a set of extracted temporal expressions, assign the temporal semantics to each expression in the form of normalized values in a standard format that adheres to some annotation specification.

      Figure 1.1 illustrates the two tasks of a temporal tagger. Given a text document (left), determine the temporal expressions (middle), and assign a normalized value in a standard format to each identified temporal expression (right). In Chapter 3, we will give an overview of existing annotation standards for temporal expressions. These define what should be considered as a temporal expression and how temporal expressions are to be normalized. Before that, however, we will first outline some application scenarios in which temporal expressions can be exploited, and then have a closer look at the concept of time in Chapter 2.