Horacio Saggion

Automatic Text Simplification


Скачать книгу

      Although these works are interesting because they consider a different user population, they still lack an analysis of the effect that different automatic tools have in readability assessment performance: since parsers, coreference resolution systems, and lexical chainers are imperfect, an important question to be asked is how changes in performance affect the model outcome.

      Crossley et al. [2007] investigate three Coh-Metrix variables [Graesser et al., 2004] for assessing the readability of texts from the Bormuth corpus, a dataset where scores are given to texts based on aggregated answers from informants using cloze tests. The number of words per sentence as an estimate of syntactic complexity, argument overlap—the number of sentences sharing an argument (noun, pronouns, noun phrases)—, and word frequencies from the CELEX database [Celex, 1993] were used in a multiple regression analysis. Correlation between the variables used and the text scores was very high.

      Flor and Klebanov [2014] carried out one of the few studies (see Feng et al. [2009]) to assess lexical cohesion [Halliday and Hasan, 1976] for text readability assessment. Since cohesion is related to the way in which elements in the text are tied together to allow text understanding, a more cohesive text may well be perceived as more readable than a less cohesive text. Flor and Klebanov define lexical tightness, a metric based on a normalized form of pointwise mutual information by Church and Hanks [1990] (NPMI) that measures the strength of associations between words in a given document based on co-occurrence statistics compiled from a large corpus. The lexical tightness of a text is the average of NPMIs values of all content words in the text. It is shown that lexical tightness correlates well with grade levels: simple texts tend to be more lexically cohesive than difficult ones.

      There is increasing interest in assessing document readability in the context of web search engines and in particular for personalization of web search results: search results that, in addition to matching the user’s query, are ranked according to their readability (e.g., from easier to more difficult). One approach is to display search results along with readability levels (Google Search offered in the past the possibility of filtering search results by reading level) so that users could select material based on its reading level assessment; however, this is limited in that the profile or expertise of the reader (i.e., search behavior) is not taken into consideration when presenting the results. Collins-Thompson et al. [2011] introduced a tripartite approach to personalization of search results by reading level (appropriate documents for the user’s readability level should be ranked higher) which takes advantage of user profiles (to assess their readability level), document difficulty, and a re-ranking strategy so that documents more appropriate for the reader would move to the top of the search result list. They use a language-model readability assessment method which leverages word difficulty computed from a web corpus in which pages have been assigned grade levels by their authors [Collins-Thompson and Callan, 2004]. The method departs from traditional readability formulas in that it is based on a probabilistic estimation that models individual word complexity as a distribution across grade levels. Text readability is then based on distribution of those words occurring in the document. The authors argue that traditional formulas which consider morphological word complexity and sentence complexity (e.g., length) features and that sometimes require word-passages of certain sizes (i.e., at least 100 words) to yield an accurate readability estimate appear inappropriate in a web context where sentence boundaries are sometimes nonexistent and pages can have very little textual content (e.g., images and captions). To estimate the reading proficiency of users and also to train some of the model parameters and evaluate their approach, they rely on the availability of proprietary data on user-interaction behaviors with a web search engine (containing queries, search results, and relevance assessment). With this dataset at hand, the authors can compute a distribution of the probability that a reader likes the readability level of a given web page from web pages that the user visited and read. A re-ranking algorithm, LambdaMART [Wu et al., 2010], is then used to improve the search results and bring results more appropriate to the user to the top of the search result list. The algorithm is trained using reading level for pages and snippets (i.e., search results summaries), user reading level, query characteristics (e.g., length), reading level interactions (e.g., snippet-page, query-page), and confidence values for many of the computed features. Re-ranking experiments across a variety of query-types indicate that search results improve at least one rank for all queries (i.e., the appropriate URL was ranked higher than with the default search engine ranking algorithm). Related to work on web documents readability is the question of how different ways in which web pages are parsed (i.e., extracting the text of the document and identifying sentence boundaries) influence the outcome of traditional readability measures. Palotti et al. [2015] study different tools for extracting and sentence-splitting textual content from pages and different traditional readability formulas. They found that web search results ranking varies considerably depending on different readability formulas and text processing methods used and also that some text processing methods would produce document rankings with marginal correlation when a given formula is used.

      Given the proliferation of readability formulas, one may wonder how they differ and which one should be used for assessing the difficulty of a given text. Štajner et al. [2012] study the correlation of a number of classic readability formulas and linguistically motivated features using different corpora to identify which formula or linguistic characteristics may be used to select appropriate text for people with an autism-spectrum disorder.

      The corpora included in the study were: 170 texts from Simple Wikipedia, 171 texts from a collection of news texts from the METER corpus, 91 texts from the health section of the British National Corpus, and 120 fiction texts from the FLOB corpus.4 The readability formulas studied were the Flesch Reading Ease score, the Flesch-Kincaid grade level, the SMOG grading, and FOG index. According to the authors, the linguistically motivated features were designed to detect possible “linguistic obstacles” that a text may have to hinder readability. They include features of structural complexity such as the average number of major POS tags per sentence, average number of infinitive markers, coordinating and subordinating conjunctions, and prepositions. Features indicative of ambiguity include the average number of sentences per word, average number of pronouns and definite descriptions per sentence. The authors first computed over each corpus averages of each readability score to identify which corpora were “easier” according to the formulas. To their surprise and according to all four formulas, the corpus of fiction texts appears to be the easiest to read, with health-related documents at the same readability level as Simple Wikipedia articles. In another experiment, they study the correlation of each pair of formulas in each corpus; their results indicate almost perfect correlation, indicating the formulas could be interchangeable. Their last experiment, which studies the correlation between the Flesch-Kincaid formula and the different linguistically motivated features, indicates that although most features are strongly correlated with the readability formula, the strength of the correlation varies from corpus to corpus. The authors suggest that because of the correlation of the readability formula with linguistic indicators of reading difficulty, the Flesch score could be used to assess the difficulty level of texts for their target audience.

      Most readability studies consider the text as the unit for assessment (although Collins-Thompson et al. [2011] present a study also for text snippets and search queries); however, some authors have recently become interested in assessing readability of short units such as sentences. Dell’Orletta et al. [2014a,b], in addition to presenting a readability study for Italian where they test the value of different features for classification of texts into easy or difficult, also address the problem of classifying sentences as easy-to-read or difficult-to-read. The problem they face is the unavailability of annotated corpora for the task, so they rely on documents from two different providers: easy-to-read documents are sampled from the easy-to-read newspaper Due Parole5 while the difficult-to-read documents are sampled from the newspaper La Repubblica.6 Features for document classification included in their study are: raw text features such as sentence-length and word-length averages, lexical features