Michael J. Paul

Social Monitoring for Public Health


Скачать книгу

is a Python library for text processing, supporting tokenization and classification.

      • Stanford Core NLP (https://stanfordnlp.github.io/CoreNLP/) is a set of natural language processing tools, including named entity recognition and dependency parsing.

      • HLTCOE Concrete (http://hltcoe.github.io/) is a data serialization standard for NLP data that includes a variety of “concrete compliant” NLP tools.

      • Twitter NLP (https://github.com/aritter/twitter_nlp) is a Python toolkit that implements some core NLP tools with models specifically trained on Twitter data.

      • TweetNLP (http://www.cs.cmu.edu/~ark/TweetNLP/) is a toolkit implemented in Java and Python of text processing tools specifically for Twitter.

      • Weka (http://www.cs.waikato.ac.nz/ml/weka/) is a machine learning software package that supports tasks like classification and clustering. It has a graphical interface, making it more user-friendly than the other tools.

      We will now describe methods for extracting trends—levels of interest or activity across time intervals or geographic locations—from social media. First, we discuss how raw volumes of filtered content can be converted to trends by normalizing the counts. Second, we describe how filtered content can be used as predictors in more sophisticated statistical models to produce trend estimates. Examples of these two approaches, as applied to influenza surveillance, are contrasted in Figure 4.3.

       Counting and Normalization

      A simple method for extracting trends is to compute the volume of data filtered for relevance (Section 4.1.1) in each point (e.g., time period of location), for example the number of flu tweets per week [Chew and Eysenbach, 2010, Lamb et al., 2013, Lampos and Cristianini, 2010].

      It is important to normalize the volume counts to adjust for variation over time and location. For example, the system of Lamb et al. [2013] normalizes influenza counts by dividing the volumes by the counts of a random sample of public tweets for the same location and time period. Normalization is especially important for comparing locations, as volumes are affected by regional differences in population and social media usage, but normalization is also important for comparing values across long time intervals, as usage of a social media platform inevitably changes over time.

      Note that the search volume counts provided by Google Trends are already normalized, although normalization is plot dependent, and values cannot be compared between plots with establishing baselines for comparison. See Ayers et al. [2011b] for details.

       Statistical Modeling and Regression

      A more sophisticated approach to trend inference is to represent trends with statistical models. When a model is used to predict values, it is called regression. Regression models are used to fit data, such as social media volume, to “gold standard” values from an existing surveillance system, such as the influenza-like illness network from the Centers for Disease Control and Prevention (CDC).

image

      The simplest type of regression model is a univariate (one predictor) linear model, which has the form: yi = b + βxi, for each point i, where a point is a time period such as week. For example, yi could be the CDC’s influenza prevalence at week i and xi could be the volume of flu-related social media activity in the same week [Culotta, 2010, Ginsberg et al., 2009]. The β value is the regression coefficient, interpreted as the slope of the line in a linear model, while b is an intercept. By plugging social media counts into a regression model, one can estimate the CDC’s values.

      Other predictors can be included in regression models besides social media volume. A useful predictor is the trend itself: the previous week’s value is a good predictor of the current week, for example. A kth-order autoregressive (AR) model is a regression model whose predictors are the previous k values. For example, a second-order autoregressive model has the form yi = β1yi−1 + β2yi−2. If predictors are included in addition to the time series data itself, such as the social media estimate xi, it is called an autoregressive exogenous (ARX) model. ARX models have been shown to outperform basic regression models for influenza prediction from social media [Achrekar et al., 2012, Paul et al., 2014].

      A commonly used extension to the linear autoregressive model is the autoregressive integrated moving average (ARIMA) model, which assumes an underlying smooth behavior in the time series. These models have also been used for predicting influenza prevalence [Broniatowski et al., 2015, Dugas et al., 2013, Preis and Moat, 2014].

      Конец ознакомительного фрагмента.

      Текст предоставлен ООО «ЛитРес».

      Прочитайте эту книгу целиком, купив полную легальную версию на ЛитРес.

      Безопасно оплатить книгу можно банковской картой Visa, MasterCard, Maestro, со счета мобильного телефона, с платежного терминала, в салоне МТС или Связной, через PayPal, WebMoney, Яндекс.Деньги, QIWI Кошелек, бонусными картами или другим удобным Вам способом.

/9j/4RddRXhpZgAATU0AKgAAAAgABwESAAMAAAABAAEAAAEaAAUAAAABAAAAYgEbAAUAAAABAAAA agEoAAMAAAABAAIAAAExAAIAAAAeAAAAcgEyAAIAAAAUAAAAkIdpAAQAAAABAAAApAAAANAALcbA AAAnEAAtxsAAACcQQWRvYmUgUGhvdG9zaG9wIENTNiAoV2luZG93cykAMjAxNzowOTowMSAxMDox NzowMAAAA6ABAAMAAAABAAEAAKACAAQAAAABAAAIy6ADAAQAAAABAAAK3QAAAAAAAAAGAQMAAwAA AAEABgAAARoABQAAAAEAAAEeARsABQAAAAEAAAEmASgAAwAAAAEAAgAAAgEABAAAAAEAAAEuAgIA BAAAAAEAABYnAAAAAAAAAEgAAAABAAAASAAAAAH/2P/tAAxBZG9iZV9DTQAB/+4ADkFkb2JlAGSA AAAAAf/bAIQADAgICAkIDAkJDBELCgsRFQ8MDA8VGBMTFRMTGBEMDAwMDAwRDAwMDAwMDAwMDAwM DAwMDAwMDAwMDAwMDAwMDAENCwsNDg0QDg4QFA4ODhQUDg4ODhQRDAwMDAwREQwMDAwMDBEMDAwM DAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwM/8AAEQgAoACCAwEiAAIRAQMRAf/dAAQACf/EAT8AAAEF AQEBAQEBAAAAAAAAAAMAAQIEBQYHCA