Contexts.” Journal of the Academy of Marketing Science 26 (1): 31–44.
Breiman, L. 2001. “Statistical Modeling: The Two Cultures.” Statistical Science 16 (3): 199–215.
Cao, B. 2016. “Future Healthy Life Expectancy among Older Adults in the US: A Forecast Based on Cohort Smoking and Obesity History.” Population Health Metrics, 14 (1), 1–14.
Chakraborty, G., P. Murali, and G. Satish. 2013. Text Mining and Analysis: Practical Methods, Examples, and Case Studies Using SAS. SAS Institute.
Coussement, K. 2014. “Improving Customer Retention Management through Cost-Sensitive Learning.” European Journal of Marketing 48 (3/4): 477–495.
Dejaeger, K., W. Verbeke, D.Martens, and B. Baesens. 2012. “Data Mining Techniques for Software Effort Estimation: A Comparative Study.” IEEE Transactions on Software Engineering 38: 375–397.
Elder IV, J., and H. Thomas. 2012. Practical Text Mining and Statistical Analysis for Non-Structured Text Data Applications. Cambridge, MA: Academic Press.
Han, J., and M. Kamber. 2011. Data Mining: Concepts and Techniques. Amsterdam: Elsevier.
Hand, D. J., H. Mannila, and P. Smyth. 2001. Principles of Data Mining. Cambridge, MA: MIT Press.
Hyndman, R. J., A. B. Koehler, J. K. Ord, and R. D. Snyder. 2008. “Forecasting with Exponential Smoothing.” Springer Series in Statistics, 1–356.
Peto, R., G. Whitlock, and P. Jha. 2010. “Effects of Obesity and Smoking on U.S. Life Expectancy.” The New England Journal of Medicine 362 (9): 855–857.
Shmueli, G., and O. R. Koppius. 2011. “Predictive Analytics in Information Systems Research.” MIS Quarterly 35 (3): 553–572.
Tan, P. – N., M. Steinbach, and V. Kumar. 2005. Introduction to Data Mining. Reading, MA: Addison Wesley.
Van Gestel, T., and B. Baesens. 2009. Credit Risk Management: Basic Concepts: Financial Risk Components, Rating Analysis, Models, Economic and Regulatory Capital. Oxford: Oxford University Press.
Verbeke, W., D. Martens, and B. Baesens. 2014. “Social Network Analysis for Customer Churn Prediction.” Applied Soft Computing 14: 431–446.
Verbraken, T., C. Bravo, R. Weber, and B. Baesens. 2014. “Development and Application of Consumer Credit Scoring Models Using Profit-Based Classification Measures.” European Journal of Operational Research 238 (2): 505–513.
Widodo, A., and B. S. Yang. 2011. “Machine Health Prognostics Using Survival Probability and Support Vector Machine.” Expert Systems with Applications 38 (7): 8430–8437.
CHAPTER 2
Analytical Techniques
INTRODUCTION
Data are everywhere. IBM projects that every day we generate 2.5 quintillion bytes of data. In relative terms, this means 90 % of the data in the world has been created in the last two years. These massive amounts of data yield an unprecedented treasure of internal knowledge, ready to be analyzed using state-of-the-art analytical techniques to better understand and exploit behavior about, for example, your customers or employees by identifying new business opportunities together with new strategies. In this chapter, we zoom into analytical techniques. As such, the chapter provides the backbone for all other subsequent chapters. We build on the analytics process model reviewed in the introductory chapter to structure the discussions in this chapter and start by highlighting a number of key activities that take place during data preprocessing. Next, the data analysis stage is elaborated. We turn our attention to predictive analytics and discuss linear regression, logistic regression, decision trees, neural networks, and random forests. A subsequent section elaborates on descriptive analytics such as association rules, sequence rules and clustering. Survival analysis techniques are also discussed, where the aim is to predict the timing of events instead of only event occurrence. The chapter concludes by zooming into social network analytics, where the goal is to incorporate network information into descriptive or predictive analytical models. Throughout the chapter, we discuss standard approaches for evaluating these different types of analytical techniques, as highlighted in the final stage of the analytical process model.
DATA PREPROCESSING
Data are the key ingredient for any analytical exercise. Hence, it is important to thoroughly consider and gather all data sources that are potentially of interest and relevant before starting the analysis. Large experiments as well as a broad experience in different fields indicate that when it comes to data, bigger is better. However, real life data can be (typically are) dirty because of inconsistencies, incompleteness, duplication, merging, and many other problems. Hence, throughout the analytical modeling steps, various data preprocessing checks are applied to clean up and reduce the data to a manageable and relevant size. Worth mentioning here is the garbage in, garbage out (GIGO) principle that essentially states that messy data will yield messy analytical models. Hence, it is of utmost importance that every data preprocessing step is carefully justified, carried out, validated, and documented before proceeding with further analysis. Even the slightest mistake can make the data totally unusable for further analysis, and completely invalidate the results. In what follows, we briefly zoom into some of the most important data preprocessing activities.
The application of analytics typically requires or presumes the data to be presented in a single table, containing and representing all the data in some structured way. A structured data table allows straightforward processing and analysis, as briefly discussed in Chapter 1. Typically, the rows of a data table represent the basic entities to which the analysis applies (e.g., customers, transactions, firms, claims, or cases). The rows are also referred to as observations, instances, records, or lines. The columns in the data table contain information about the basic entities. Plenty of synonyms are used to denote the columns of the data table, such as (explanatory or predictor) variables, inputs, fields, characteristics, attributes, indicators, and features, among others. In this book, we will consistently use the terms observation and variable.
Several normalized source data tables have to be merged in order to construct the aggregated, denormalized data table. Merging tables involves selecting information from different tables related to an individual entity, and copying it to the aggregated data table. The individual entity can be recognized and selected in the different tables by making use of (primary) keys, which are attributes that have specifically been included in the table to allow identifying and relating observations from different source tables pertaining to the same entity. Figure 2.1 illustrates the process of merging two tables – that is, transaction data and customer data – into a single, non-normalized data table by making use of the key attribute ID, which allows connecting observations in the transactions table with observations in the customer table. The same approach can be followed to merge as many tables as required, but clearly the more tables are merged, the more duplicate data might be included in the resulting table. It is crucial that no errors are introduced during this process, so some checks should be applied to control the resulting table and to make sure that all information is correctly integrated.
Figure 2.1 Aggregating normalized data tables into a non-normalized data table.
The aim of sampling is to take a subset of historical data (e.g., past transactions), and use that to build an analytical model. A first obvious question that comes to mind concerns the need for sampling. Obviously, with the availability of high performance computing facilities (e.g., grid and cloud computing), one could also try to directly analyze the full dataset. However, a key requirement for a good sample is that it should be representative for the future entities on which the analytical model will be run. Hence, the timing aspect becomes important since, for instance, transactions of today are more similar to transactions of tomorrow than they are to