we want to estimate the original signal, i.e. the image x from the observation y. When n is the Gaussian white noise with zero mean and a variance of σ2, the noise has a constant power spectral density in the frequency domain. Equation (2.3) can be written as ∥Ax−y∥2=σ2. The expected solution space of x is on the surface of the hyper-ball, as shown in figure 2.1.
Figure 2.1. Under the data fidelity constraint, the minimized L2 distance corresponds to a non-sparse solution, and the minimized L1 distance most likely gives a sparse solution. Adapted with permission from Hastie et al (2009). Copyright 2009 Springer.
How to find the optimal solution? It is well known that the L1-norm is defined as the summation of absolute component values of a vector, while the L2-norm is the square root of the sum of the squared element values. Therefore, the geometric meanings of the L1- and L2-norm-based regularization are quite different, with the L1-based regularization clearly favoring a sparse solution, as shown in figure 2.1.
Let us consider the optimization problem as expanding the manifold defined by the regularizer until it intersects the manifold defined by ∥Ax−y∥2=σ2 and then returning the touching point or singling out any point in the intersection. When the touching point does not find the optimal solution, the result can be a smooth version or noisy version, as is depicted in figure 2.1. Since our data fidelity term is quadratic, it is very likely that the touching point in the L1 case is at a corner of the square, cube or hyper-square defined by ∥x∥1. On the other hand, in the L2 case, the circle or ball or hyper-ball defined by ∥x∥2 does not have such a property. Since a hyper-ball has no corner at all, the probability for the touching point to arrive at a sparse position becomes very slim. Intuitively, if the square/cubic boundary expands in the L1 case, the only way to reach a non-sparse solution is for the L1 boundary to meet the solution subspace on the ‘side’ of the titled square/cube. But this only happens if the solution subspace is in parallel to the square/cubic boundary, which could happen, such as in an under-determined case, but with a very low probability. Thus, the L1 solution will almost certainly be sparse, which explains why the L1 solution tends to give a sparse solution. Accordingly, the L1-norm minimization is widely performed to satisfy the sparse constraint.
The L1-norm-based regularization shows a high utility in the inverse problem solution. Theoretically, the convergence of this solution is also important. An iterative solution to equation (2.2) with total variation minimization that is measured with an L1-norm has been proved to be convergent (Chambolle 2004). Nevertheless, the best operator for the sparsest solution should be established based on the L0-norm minimization that counts the number of nonzero components of the solution vector, leading to the highest representation efficiency. Unfortunately, the L0-norm solution is an NP problem and computationally impracticable. In 2006, it was demonstrated that the L1-norm prior is equivalent to the L0-semi-norm prior in signal recovery for some important linear systems (Dandes 2006). Thus, the L1-norm-based sparse regularization is popular in solving the optimization problem. Moreover, being capable of extracting structural information efficiently in images, various machine learning-based techniques have been applied in the optimization framework, which exhibit excellent performance, superior to conventional sparse operators such as total variation minimization (Chambolle and Lions 1997).
2.2 Single-layer neural network
In this section, we will introduce two typical dictionary learning algorithms both of which are single-layer neural networks. Although these algorithms are theoretically equivalent to the sparsity constraint method (Olshausen and Field 1996) and ICA method (Bell and Sejnowski 1997) as we introduced in the previous chapter, the algorithms introduced in this chapter are more data-driven and have been widely used in practice.
Previously, we introduced Bayesian reasoning, in which the prior information of images is of great importance and must be analytically expressed. Thus, it is an essential task to find an efficient measure to express different information contents. This points to the field of image representation and sensory coding. The HVS tends to ignore the first-order and second-order features when sensing structures in natural scenes. Since the HVS has an efficient strategy for conveying as much information as possible, features used by the HVS should be statistically independent of each other. Researchers suggested a number of de-correlating strategies to remove statistical redundancy from natural images and represent the original information by a dictionary consisting of over-complete atoms.
Sparse dictionary learning is a representation method aiming at finding a sparse representation of an input signal or image as a linear combination of basic building units. One of the most important applications of sparse dictionary learning is in the field of compressed sensing or signal recovery. In the context of compressed sensing, a high-dimensional input signal can be recovered with only a few dictionary components provided that the signal is sparse. A sparse representation of that input signal is an important assumption since not all signals satisfy the sparsity condition. One of the key principles behind dictionary learning is that the dictionary must be generated from data. In other words, the dictionary is not constructed according to a principle or principles in advance. The dictionary learning methods are developed to represent an input signal with a minimized number of dictionary components and at maximized representation accuracy.
Let a dictionary be denoted as D∈Rn×K, each column of D denoted as dk is a component of the dictionary and also referred to as an atom of the dictionary. Without loss of generality, the dictionary D has K atoms and each atom consists of n elements. A signal or image (after vectorization) y∈Rn×1 can be represented sparsely as a linear combination of atoms in the dictionary D, which is formulated as y=Dα, or approximated as y≈Dα, and ∥y−Dα∥p⩽ε for some small value ε and a certain Lp-norm. In order to seek a sparse solution, a variety of regularization schemes were proposed, which reflect prior information of coding coefficients α∈RK×1, into the optimization problem, such as the L0-norm constraint
minx∥α∥0s.t.y=Dα(2.4)
or
minx∥α∥0s.t.∥y−Dα∥2⩽ε,(2.5)
where ∥a∥0 denotes the total number of nonzero entries in the vector α. The use of the L0-norm as a measure of sparsity makes the problem nonconvex and, as mentioned earlier, solving the equation directly is an NP-hard problem (Elad 2010). The choice of these atoms is crucial for efficient sparse representation, but it is not trivial. Therefore, it is informative to study what kind of dictionary expression or visual coding is accurate and efficient.
Different from the dictionary learning and sparse representation methods presented in chapter 1, in the following sections we will introduce more efficient methods to complete the tasks. In general, the methodology of dictionary learning and sparse representation contains two tasks. The first is to learn an over-complete dictionary D in equations (2.4) and (2.5) from a set of data which captures plenty of essential structures, i.e. the atoms of the dictionary such that any single datum can be sparsely or independently represented by a few numbers of the structures of the dictionary. The second is sparse coding which means finding, for a specific signal, the representation vector α regarding the dictionary D to meet the sparse constraint.