and how smart each student is, given the data about students, the courses they take, and the grades they obtain. For example, we may learn that s1 is intelligent, s2 is not as intelligent, course c2 is difficult and course c3 is not difficult, etc. This model then allows for the prediction that s3 will do better than s4 in course c4.
Standard textbook supervised learning algorithms that learn, e.g., a decision tree, a neural network, or a support vector machine (SVM) to predict grade are not appropriate; they can handle ordinals, but cannot handle the names of students and courses. It is the relationship among the individuals that provides the generalizations from which to learn. Traditional classifiers are unable to take into account such relations. This also holds for learning standard graphical models, such as Bayesian networks. These approaches make what can be seen as a single-table single-row assumption, which requires that each instance is described in a single row by a fixed set of features and all instances are independent of one another (given the model). This clearly does not hold in this dataset as the information about student s1 is spread over multiple rows, and that about course c1 as well. Furthermore, tests on student = s1 or course = c3 would be meaningless if we want to learn a model that generalizes to new students.
StarAI approaches take into account the relationships among the individuals as well as deal with uncertainty.
1.3 THE BENEFITS OF MASTERING STARAI
The benefits of combining logical abstraction and relations with probability and statistics are manifold.
• When learning a model from data, relational and logical abstraction allows one to reuse experience in that learning about one entity improves the prediction for other entities. This can generalize to objects that have never been observed before.
• Logical variables, which are placeholders for individuals, allow one to make abstractions that apply to all individuals that have some common properties.
• By using logical variables and unification, one can specify and reason about regularities across different situations using rules and templates rather than having to specify them for each single entity separately.
• The employed and/or learned knowledge is often declarative and compact, which potentially makes it easier for people to understand and validate.
• In many applications, background knowledge about the domain can be represented in terms of probability and/or logic. Background knowledge may improve the quality of learning: the logical aspects may focus the search on the relevant patterns, thus restricting the search space, while the probabilistic components may provide prior knowledge that can help avoid overfitting.
Relational and logical abstraction have the potential to make statistical AI more robust and efficient. Incorporating uncertainty makes relational models more suitable for reasoning about the complexity of the real world. This has been witnessed by a number of real-world applications.
1.4 APPLICATIONS OF STARAI
StarAI has been successfully applied to problems in citation analysis, web mining, natural language processing, robotics, medicine bio- and chemo-informatics, electronic games, and activity recognition, among others. Let us illustrate using a few examples.
Example 1.2 Mining Electronic Health Records (EHRs) As of today, EHRs hold over 50 years of recorded patient information and, with increased adoption and high levels of population coverage, are becoming the focus of public health analyses. Mining EHR data can lead to improved predictions and better disease characterization. For instance, Coronary Heart Disease (CHD) is a major cause of death worldwide. In the U.S., CHD is responsible for approximated 1 in every 6 deaths with a coronary event occurring every 25 s and about 1 death every minute based on data current to 2007. Although a multitude of cardiovascular risks factors have been identified, CHD actually reflects complex interactions of these factors over time. Thus, early detection of risks will help in designing effective treatments targeted at youth in order to prevent cardiovascular events in adulthood and to dramatically reduce the costs associated with cardiovascaular dieases.
Figure 1.3: Electronic Health Records (EHRs) are relational databases capturing noisy and missing information with probabilistic dependencies (the black arrows) within and across tables.
Doing so, however, calls for StarAI. As illustrated in Fig. 1.3, EHR data consists of several diverse features (e.g., demographics, psychosocial, family history, dietary habits) that interact with each other in many complex ways making it relational. Moreover, like most data sets from biomedical applications, EHR data contains missing values, i.e., all data are not collected for all individuals. And, EHR data is often collected as part of a longitudinal study, i.e., over many different time periods such as 0, 5, 10 years, etc., making it temporal. Natarajan et al. [2013] demonstrated that StarAI can uncover complex interactions of risk factors from EHRs. The learned probabilistic relational model performed significantly better than traditional non-relational approaches and conforms to some known or hypothesized medical facts. For instance, it is believed that females are less prone to cardiovascular issues than males. The relational model identifies sex as the most important feature. Similarly, in men, it is believed that the low- and high-density lipoprotein cholesterol levels are very predictive, and the relational model confirms this. For instance, the risk interacts with a (low-density lipoprotein) cholesterol level in year 0 (of the study) for a middle-aged male in year 7, which can result in a relational conditional probability
Figure 1.4: Populating a knowledge base with probabilistic facts (or assertions) extracted from dark data (e.g., text, audio, video, tables, diagrams, etc.) and background knowledge.
The model also identifies complex interaction between risk factors at different years of the longitudinal study. For instance, smoking in year 5 interacts with cholesterol level in later years in the case of females, and the triglyceride level in year 5 interacts with the cholesterol level in year 7 for males. Finally, using data such as the age of the children, whether the patients owns or rents a home, their employment status, salary range, their food habits, their smoking and alcohol history, etc., revealed striking socio-economic impacts on the health state of the population.
Example 1.3 Extracting value from dark data Many companies’ databases include natural language comments buried in tables and spreadsheets. Similarly, there are often tables and figures embedded in documents and web pages. Like dark matter, dark data is this great mass of data buried in text, tables, figures, and images, which lacks structure and so is essentially unprocessable by traditional methods. StarAI helps bring dark data to light (see, e.g., [Niu et al., 2012, Venugopal et al., 2014]), making the knowledge base construction, as illustrated in Fig. 1.4 feasible. The resulting relational probabilistic models are richly structured with many different entity types with complex interactions.
To carry out such a task, one starts with transforming the dark data such as text documents into relational data. For example, one may employ standard NLP tools such as logistic regression, conditional random fields and dependency parsers to decode the structure (e.g., part-of-speech tags and parse trees) of the text or run some pattern matching techniques to identify candidate entity mentions and then store them in a database. Then, every tuple in the database or result of an database query is a random variable in a relational probabilistic model. The next step is to determine which of the individuals (entities) are the same as each other and same as the entities that are already known about. For instance,