4.5 Sense-annotated Corpora
4.6 Chapter Conclusion
5 Advanced Disambiguation Methods
5.1 Automatic Knowledge Base Construction
5.2 Distant Supervision
5.2.1 Method
5.2.2 Overview of Work in this Area
5.3 Continuous Vector Space Models of KBs
5.3.1 Method
5.3.2 Overview of Work in this Area
5.4 Chapter Conclusion
6.1 Multilingual Semantic Relatedness
6.2 Computer-aided Translation
6.2.1 Overview of Work in this Area
6.2.2 Illustrative Example
6.3 Chapter Conclusion
7.2 Curation Interfaces
7.3 Resource API’s for Text Processing
7.4 Chapter Conclusion
8.2 Outlook
Foreword
Lexical semantic knowledge is vital for most tasks in natural language processing (NLP). Such knowledge has been captured through two main approaches. The first is the knowledge-based approach, in which human linguistic knowledge is encoded directly in a structured form, resulting in various types of lexical knowledge bases. The second is the corpus-based approach, in which lexical semantic knowledge is learned from corpora and then represented in either explicit or implicit manners.
Historically, the knowledge-based approach preceded the corpus-based one, while the latter has been dominating the center-stage of NLP research in the last decades. Yet, the development and use of lexical knowledge bases (LKBs) continued to be a major thread. An illustration of this fact may be found in the number of citations for the fundamental 1998 WordNet book [Fellbaum, 1998a], over 12,000 at the time of writing (according to Google Scholar), which somewhat exceeds the number of citations for the primary text book on statistical NLP from about the same period [Manning and Schütze, 1999]. Despite the overwhelming success of corpus-based methods, whether supervised or unsupervised, their output may be quite noisy, particularly when it comes to modeling fine-grained lexical knowledge such as distinct word senses or concrete lexical semantic relationships. Human encoding, on the other hand, provides more precise knowledge at the fine-grained level. The ongoing popular use of LKBs, and particularly of WordNet, seems to indicate that they still provide substantial complementary information relative to corpus-based methods (see Shwartz et al. [2015] for a concrete evaluation showing the complementary behavior of corpus-based word embeddings and information from multiple LKBs).
While WordNet has been by far the most widely-used lexical resource, it does not provide the full spectrum of needed lexical knowledge, which brings us to the theme of the current book. As reviewed in Chapter 2, additional lexical information has been encoded in quite a few LKBs, either by experts or by web communities through collaborative efforts. In particular, collaborative resources provide the opportunity to obtain much larger and more frequently updated resources than is possible with expert work. Knowledge resources like Wikipedia1 or Wikidata2 include vast lexical information about individual entities and domain specific terminology across many domains, which falls beyond the scope of WordNet. Hence, it would be ideal for NLP technology to utilize in an integrated manner the union of information available in a multitude of lexical resources. As an illustrating example, consider an application setting, like a question answering scenario, which requires knowing that Deep Purple was a group of people. We may find in Wikipedia that it was a “band,” map this term to its right sense in WordNet and then follow a hypernymy chain to “organization,” whose definition includes “a group of people.”
As hinted in the above example, to allow such resource integration we need effective methods for linking, or aligning, the word senses or concepts encoded in various resources. Accordingly, the main technical focus of this book is about existing resource integration efforts, resource linking algorithms, and the utility of such algorithms within disambiguation tasks. Hence, this book would first be of high value for researchers interested in creating or linking LKBs, as well as for developers of NLP algorithms and applications who would like to leverage linked lexical resources. An important aspect is the development and use of linked lexical resources in multiple languages, addressed in Chapter 7.
Looking forward, may be the most interesting research prospect for linked lexical knowledge bases is their integration with corpus-based machine learning approaches. A relatively simple form of combining the information in LKBs with corpus-based information is to use the former, via distant supervision, to create training data for the latter (discussed in Section 6.2). A more fundamental research direction is to create a unified knowledge representation framework, which integrates directly the human-encoded information in LKBs with information obtained by corpus-based methods. A promising framework for such integrated representation has emerged recently, under the “embedding” paradigm, where dense continuous vectors are used to represent linguistic objects, as reviewed in Section 6.3. Such representations, i.e., embeddings, have been initially created separately from corpus data—based on corpus co-occurrences, as well as from knowledge bases—based on and leveraging their rich internal structure. Further research suggested methods for creating unified representations, based on hybrid objective functions that consider both corpus and knowledge base structure. While this research line is still in initial phases, it has the potential to truly integrate corpus-based and human-encoded knowledge, and thus unify these two research endeavors which have been pursued mostly separately in the past. From this perspective, and assuming that human-encoded lexical knowledge can provide useful additional information on top of corpus-based information, the current book should be useful for any researcher who aims to advance state of the art in lexical semantics.
While considering the integration of implicit corpus-based and explicit human-encoded information, we may notice that the joint embedding approach goes the “implicit way.” While joint embeddings do encode information coming from both types of resources, this information is encoded in opaque continuous vectors, which are not immediately interpretable, thus losing the transparency of the original symbolically-encoded human knowledge. Indeed, developing methods for interpreting embedding-based representations is an actively pursued theme, but it is yet to be seen whether such attempts will succeed to preserve the