Horacio Saggion

Automatic Text Simplification


Скачать книгу

construction, it also might be necessary to apply transformations at the lexical level to keep the text grammatical. Furthermore, with a text being a coherent and cohesive unit, any change at a local level (words or sentences) certainly will affect in one way or another textual properties (at the local and global level): for example replacing a masculine noun with a feminine synonym during lexical simplification will certainly require some languages to repair local elements such as determiners and adjectives, as well as pronouns or definite expressions in following or preceding sentences. Pragmatic aspects of the text, such as the way in which the original text has been created to communicate a message to specific audiences, are generally ignored by current systems.

      As we shall see in this book, most approaches treat text simplification as a sequence of transformations at the word or sentence level, disregarding the global textual content (previous and following text units), thereby affecting important properties such as cohesion and coherence.

      Various studies have investigated ways in which a given text is transformed into an easier-to-read version. In order to understand what text transformations would be needed and what transformations could be implemented automatically, Petersen and Ostendorf [2007] performed an analysis of a corpus of original and abridged CNN news articles in English (114 pairs), distributed by the Literacyworks organization,1 aimed at adult learners (i.e., native speakers of English with poor reading skills). They first aligned the original and abridged versions of the news articles looking for the occurrence of an original-version sentence corresponding to a sentence in the abridged version. After having aligned the corpus, they observed that sentences from the original documents can be dropped (around 30%) or aligned to one (47% of same sentences) or more sentences (19%) in the abridged version (splits). The one-to-one alignments correspond to cases where the original sentence is kept practically untouched, cases where only part of the original sentence is kept, and cases of major re-writing operations. A small fraction of pairs of the original sentences were also aligned to a single abridged sentence, accounting for merges. Petersen and Ostendorf’s study also tries to automatically identify sentences in the original document which should be split since those would be good candidates for simplification. Their approach consists of training a decision-tree learning algorithm (C4.5 [Quinlan, 1993]) to classify a sentence into split or nonsplit. They used various features including sentence length and several statistics on POS tags and syntactic constructions. Cross-validation evaluation experiments show that it is difficult to differentiate between the two classes; moreover, sentence length is the most informative feature, which explains much of the classification performance. Another interesting contribution is the study of dropped sentences, for which they train a classifier with some features borrowed from summarization research; however, the classifier is only slightly better than a majority baseline (i.e., not drop).

      In a similar way, Bott and Saggion [2011b] and Drndarevic and Saggion [2012a,b] identified a series of transformations that trained editors apply to produce simplified versions of documents. Their case in notably different from Petersen and Ostendorf [2007] given the characteristics of the language—Spanish—and target population of the simplified text version: people with cognitive disabilities. Bott and Saggion [2011b] analyzed a sample of sentence-aligned original and simplified documents to identify expected simplification operations such as sentence split, sentence deletion, and various types of change operations (syntactic, lexical, etc.). Moreover, additional operations such as insertion and reordering were also documented. Drndarevic and Saggion [2012a,b] specifically concentrate on identifying lexical changes, in addition to synonym substitution, cases of numerical expression re-writing (e.g., rounding), named entity reformulation, and insertion of simple definitions. Like Petersen and Ostendorf [2007], Drndarevic and Saggion train a Support Vector Machine (SVM) algorithm [Joachims, 1998] to identify sentences which could be deleted, improving over a robust baseline that always deletes the last sentence of the document.

      The creation of text simplification tools without considering a particular target population could be justifiable in that aspects of text complexity affect a large range of users with reading difficulties. For example, long and syntactically complex sentences are generally hard to process. Some particular sentence constructions, such as syntactic constructions which do not follow the canonical subject-verb-object (e.g., passive constructions), may be an obstacle for people with aphasia [Devlin and Unthank, 2006] or an autism spectrum disorder (ASD) [Yaneva et al., 2016b]. The same is true for very difficult or specialized vocabulary and infrequent words which can also prove difficult to understand for people with aphasia [Carroll et al., 1998, Devlin and Unthank, 2006] and ASD [Norbury, 2005]. Moreover, there are also certain aspects of language that prove difficult to specific groups of readers. Language learners, for example, may have a good capacity to infer information, although they may have a very restricted lexicon and may not be able to understand certain grammatical constructions. Dyslexic readers, in turn, do not have a problem with language understanding per se, but with the understanding of the written representation of language. In addition, readers with dyslexia were found to read better when using more frequent and shorter words [Rello et al., 2013b]. Finally, people with intellectual disabilities may have problems processing and retaining large amounts of information [Fajardo et al., 2014, Feng et al., 2009].

      In order to create adapted versions for specific populations, various initiatives exist which promote accessible texts. An early proposal is Basic English, a language of reduced vocabulary of just over 800 word forms and a restricted number of grammatical rules. It was conceived after World War II as a tool for international communication or a kind of interlingua [Ogden, 1937]. Other initiatives are Plain English (see “Language for Special Purposes” in Crystal [1987]), for English in the U.S. and U.K., and the Rational French, a French-controlled language to make technical documentation more accessible in the context of the aerospace industry [Barthe et al., 1999]. In Europe, there are associations dedicated to the adaptation of text materials (books, leaflets, laws, official documents, etc.) for people with disabilities or low literacy levels, examples of which are the Easy-to-Read Network in Scandinavian countries, the Asociación Lectura Fácil2 in Spain, and the Centrum för Lättläst in Sweden.3 These associations usually provide guidance or recommendation about how to prepare or adapt textual material. Some such recommendations are as follows:

      • use simple and direct language;

      • use one idea per sentence;

      • avoid jargon and technical terms;

      • avoid abbreviations;

      • structure text in a clear and coherent way;

      • use one word per concept;

      • use personalization; and

      • use active voice.

      These recommendations, although intuitive, are sometimes difficult to operationalize (for both humans and machines) and sometimes even impossible to follow, especially in the case of adapting an existing piece of text.

      Although adapted texts have been produced for many years, nowadays there is a plethora of simplified material on the Web. The Swedish “easy-to-read” newspaper 8 Sidor4 is published by the Centrum för Lättläst to allow people access to “easy news.” Other examples of similarly oriented online newspapers and magazines are the Norwegian Klar Tale,5 the Belgian l’Essentiel6 and Wablie,7 the Danish Radio Ligetil,8 the Italian Due Parole,9 and the Finnish Selo-Uutiset.10 For Spanish, the Noticias Fácil website11 provides easy-to-read news for people with disabilities. The Literacyworks website12 offers CNN news stories in original and abridged (or simplified) formats, which can be used as learning resources for adults with poor reading skills. At the European level, the Inclusion Europe website13 provides good examples of how full text simplifications and simplified summaries in various European languages can provide improved access to relevant information. The Simple English Wikipedia14 provides