Burr Settles

Active Learning


Скачать книгу

they study Japanese words in their work). The approach not only reduces annotation effort, but also limits the size of the database used in nearest-neighbor learning, which in turn expedites the classification algorithm.

      It is worth noting that some authors (e.g., Moskovitch et al., 2007; Thompson et al., 1999) use the term “selective sampling” to refer to the pool-based scenario described next. Under this interpretation, the term merely signifies that queries are made with a selected subset of instances sampled from a real data distribution. However, in most of the literature selective sampling refers to the stream-based scenario described here.

      Pool-Based Sampling. For many real-world learning problems, large collections of unlabeled data can be gathered at once. This motivates pool-based sampling (Lewis and Gale, 1994), which assumes that there is a small set of labeled data L and a large pool of unlabeled data U available. The approach is illustrated in Figure 1.5(b). Queries are selected from the pool, which is usually assumed to be closed (i.e., static or non-changing), although this is not strictly necessary. Queries are typically chosen in a greedy fashion, according to a utility measure used to evaluate all instances in the pool (or, perhaps if U is very large, a subsample thereof). The binary search algorithm for the alien fruits example in Section 1.1 is a pool-based active learning algorithm.

      The pool-based scenario has been studied for many real-world problem domains in machine learning, such as text classification (Hoi et al., 2006a; Lewis and Gale, 1994; McCallum and Nigam, 1998; Tong and Koller, 2000), information extraction (Settles and Craven, 2008; Thompson et al., 1999), image classification and retrieval (Tong and Chang, 2001; Zhang and Chen, 2002), video classification and retrieval (Hauptmann et al., 2006; Yan et al., 2003), speech recognition (Tür et al., 2005), and cancer diagnosis (Liu, 2004), to name only a few. In fact, pool-based sampling appears to be the most popular scenario for applied research in active learning, whereas query synthesis and stream-based selective sampling are more common in the theoretical literature.

      The main difference between stream-based and pool-based active learning is that the former obtains one instance at a time, sequentially from some streaming data source (or by scanning through the data) and makes each query decision individually. Pool-based active learning, on the other hand, evaluates and ranks the entire collection of unlabeled data before selecting the best query. While the pool-based scenario appears to be much more common among application papers, one can imagine settings where the stream-based approach is more appropriate. For example, when memory or processing power is limited, as with mobile and embedded devices, or when the data set is too large to load into memory and must be scanned sequentially from disk. Unless otherwise noted, however, we will assume a pool-based scenario for our discussion of the algorithms discussed in the remainder of this book.

      1More detail on PAC learning and active learning will be discussed in Chapter 6.

      2Relaxations of these and other assumptions are discussed in Chapter 7.

      CHAPTER 2

       Uncertainty Sampling

      “Information is the resolution of uncertainty.”

      — Claude Shannon

      Let us revisit the alien fruits problem from Section 1.1, and use it as a running example. Recall that we want to efficiently test fruits for ⊕ safe vs. ⊖ noxious to eat. One solution is to lay out all the fruits in a line from most round to most irregular, and use a binary search algorithm to actively select which fruits to test. This approach is fine for our simple thought experiment, but we want a more general search algorithm for arbitrary problems with many input dimensions, potentially many choices for output (e.g., multiple class labels or output structures), and probably even noisy training labels. As a starting point, recall that we can use supervised learning to select a threshold parameter θ that lies somewhere in the transition from one label to the other.

      One reasonable way to choose this threshold value is to be as non-committal as possible: set θ to be halfway between the known ⊕ and ⊖ fruits which are closest together (this is what so-called max-margin learning algorithms do). Such a classifier should be fairly confident about its predictions for fruits which are far away from the thresholded classification boundary, but as x (the fruit’s irregularity measure) approaches θ the model becomes much less certain. Intuitively, the instances that are least certain would offer the most information about the problem, since the more confident classifications are probably correct. What if we adopt a simple active learning strategy that queries the instance closest to the decision boundary? In fact, we would recover the binary search algorithm from Section 1.1, as illustrated by Figure 2.1.

      This type of active learning strategy is commonly known as uncertainty sampling (Lewis and Catlett, 1994). The basic premise is that the learner can avoid querying the instances it is already confident about, and focus its attention instead on the unlabeled instances it finds confusing. The example in Figure 2.1 quantifies uncertainty by |θx|, the distance of instance x from the boundary θ, which is a fine measure for hypothesis classes that provide such a distance measure. However, we are interested in generalizing active learning to more complex problems that go beyond binary classification in a relatively noise-free environment like this one. An elegant way to extend the approach is to use a probabilistic classifier which can output a posterior distribution (Y|x) over the label variable Y given the input and learned model parameters. Under this interpretation, we would want to query the instance for which (ŷ|x)—where ŷ refers to the classifier’s most likely prediction for x—is closest to a uniform distribution (0.5 in the case of binary classification). While a probabilistic interpretation is not strictly necessary, there has been significant work in machine learning on probabilistic classifiers, and graphical models in particular (for a thorough overview, see Koller and Friedman, 2009). By framing our discussion of uncertainty sampling in the language of probability, we can easily generalize the techniques in this chapter to a variety of interesting cases, including problems with many input features, multiple output labels, and even structured prediction tasks, which we will discuss later in this chapter.

      Figure 2.1: The binary search from Figure 1.3, re-interpreted as an uncertainty sampling approach. The best instance to query is deemed to be the one closest to the threshold θ.

      To visualize the way in which uncertainty sampling generalizes to a noisy, two-dimensional classification problem, consider Figure 2.2. Figure 2.2(a) shows a toy data set constructed from two Gaussians centered at (-2,0) and (2,0) with standard deviation σ = 1. There are 400 instances total, 200 drawn from each class distribution. In a real-world setting, these instances may be available but their labels would not. Figure 2.2(b) illustrates the traditional supervised learning approach of randomly selecting instances for labeling. The line shows the linear decision boundary of a logistic regression model (i.e., where the posterior label probability equals 0.5) trained using 30 points. Notice that most of the labeled instances in this training set are far from zero on the horizontal axis, which is where the Bayes optimal decision boundary should be. As a result, this classifier only achieves 70% accuracy on the remaining unlabeled data. Figure 2.2(c) tells a different story, however: the active learner uses uncertainty sampling to focus on the instances closest to its decision boundary, assuming it can adequately explain the data in other parts of the input space. As a result, it avoids requesting labels for redundant or irrelevant instances, and achieves 90% accuracy using the same budget of