Automatic Text Simplification. Horacio Saggion. Читать онлайн. Hotlib. HOTLIB.NET

Automatic Text Simplification

which is more accessible than plain Wikipedia articles because of the use of simple language and simple grammatical structures. There are also initiatives which aim to give access to easy-to-read material in particular and web accessibility in general the status of a legal right.

The number of websites containing manually simplified material pointed out above clearly indicates a need for simplified texts. However, manual simplification of written documents is very expensive and manual methods will be not cost-effective, especially if we consider that news is constantly being produced and therefore simplification would, in turn, need to keep the same pace. Nevertheless, there is a growing need for methods and techniques to make texts more accessible. For example, people with learning disabilities who need simplified text constitute 5% of the population. However, according to data from the Easy-to-Read Network,¹⁵ if we consider people who cannot read documents with heavy information load or documents from authorities or governmental sources, the percentage of people in need of simplification jumps to 25% of the population.¹⁶ In addition, the need for simplified texts is becoming more important as the incidence of disability increases as the population ages.

1.5 STRUCTURE OF THE BOOK

Having briefly introduced what automatic text simplification is and the need for such technology, the rest of the book will cover a number of relevant research methods in the field which have been the object of scientific inquiry for more than 20 years. Needless to say, many relevant works will not be addressed here; however, we have tried to cover most of the techniques which have been used, or are being used, at the time of writing. In Chapter 2, we will provide an overview of the topic of readability assessment given its current relevance in many approaches to automatic text simplification. In Chapter 3, we will address techniques which have been proposed to address the problem of replacing words and phrases by simpler equivalents: the lexical simplification problem. In Chapter 4, we will cover techniques which can be used to simplify the syntactic structure of sentences and phases, with special emphasis on rule-based linguistically motivated approaches. Then in Chapter 5, machine learning techiques, optimization, and other statistical techniques to “learn” simplification systems will be described. Chapters 6 and 7 cover very related topics—in Chapter 6 we will present fully fledged text simplification systems which have as users specific target populations, while in Chapter 7, we will cover sub-systems or methods specifically based on targeted tasks or user characteristics. In Chapter 8, we will cover two important topics: the available datasets for experimentation in text simplification and the current text simplification evaluation techniques. Finally, in Chapter 9, we close with an overview of the field and critical view of the current state of the art.

¹ http://literacynet.org/

² http://www.lecturafacil.net/

³ http://www.lattlast.se/

⁴ http://8sidor.lattlast.se

⁵ http://www.klartale.no

⁶ http://www.journal-essentiel.be/

⁷ http://www.wablieft.be

⁸ http://www.dr.dk/Nyheder/Ligetil/Presse/Artikler/om.htm

⁹ http://www.dueparole.it

¹⁰ http://papunet.net/selko

¹¹ http://www.noticiasfacil.es

¹² http://www.literacyworks.org/learningresources/

¹³ http://www.inclusion-europe.org

¹⁴ http://simple.wikipedia.org

¹⁵ http://www.easytoread-network.org/

¹⁶Bror Tronbacke, personal communication, December 2010.

CHAPTER 2

Readability and Text Simplification

A key question in text simplification research is the identification of the complexity of a given text so that a decision can be made on whether or not to simplify it. Identifying the complexity of a text or sentence can help assess whether the output produced by a text simplification system matches the reading ability of the target reader. It can also be used to compare different systems in terms of complexity or simplicity of the produced output. There are a number of very complete surveys on the relevant topic of text readability which can be understood as “what makes some texts easier to read than others” [Benjamin, 2012, Collins-Thompson, 2014, DuBay, 2004]. Text readability, which has been investigated for a long time in academic circles, is very close to the “to simplify or not to simplify” question in automatic text simplification. Text readability research has often attempted to devise mechanical methods to assess the reading difficulty of a text so that it can be objectively measured. Classical mechanical text readability formulas combine a number of proxies to obtain a numerical score indicative of the difficulty of a text. These scores could be used to place the texts in an appropriate grade level or used to sort text by difficulty.

2.1 INTRODUCTION

Collins-Thompson [2014]—citing [Dale and Chall, 1948b]—defines text readability as the sum of all elements in textual material that affect a reader’s understanding, reading speed, and level of interest in the material. The ability to quantify the readability of a text has long been a topic of research, but current technology and the availability of massive amounts of text in electronic form has changed research in computational readability assessment, considerably. Today’s algorithms take advantage of advances in natural language processing, cognition, education, psycholinguistics, and linguistics (“all elements in textual material”) to model a text in such a way that a machine learning algorithm can be trained to compute readability scores for texts. Traditional readability measures were based on semantic familiarity of words and the syntactic complexity of sentences. Proxies to measure such elements are, for example, the number of syllables of words or the average number of words per sentence. Most traditional approaches used averages over the set of basic elements (words or sentences) in the text, disregarding order and therefore discourse phenomena. The obvious limitations of early approaches were always clear: words with many syllables are not necessarily complex (e.g., children are probably able to read or understand complex dinosaur names or names of Star Wars characters before more-common words are acquired) and short sentences are not necessarily easy to understand (poetry verses for example). Also, traditional formulas were usually designed for texts that were well formatted (not web data) and relatively long. Most methods are usually dependent on the availability of graded corpora where documents are annotated with grade levels. The grades can be either categorical or ordinal, therefore giving rise to either classification or regression algorithmic approaches. When classification is applied, precision, recall, f-score, and accuracy can be used to measure classification performance and compare different approaches. When regression is applied, Root Mean Squared Error (RMSE) or a correlation coefficient can be used to evaluate the algorithmic performance. In the case of regression, assigning a grade of 4 to a 5th-grade text

Скачать книгу