(IR) techniques, which vary in their weight allocation for each term.
Figure 1.3 Term weighing schemes for feature extraction.
IR like Vector Space Model (VSM), Latent Semantic Indexing (LSI), topic modeling techniques, and clustering techniques are used in the feature extraction of text documents for term weighing process. The following sub sections describe the rationales of different feature extraction techniques used in text analysis.
1.4.1 Vector Space Model
In the vector space model, the documents are represented as vectors through BagOfWords (BoW) model. It considers the set of words as “bag” but not the order of words present in the document. This may use Boolean model or vector space model to denote the weight of terms. Boolean model gives the weight as 1 or 0 based on the presence or absence of word in the document. Vector space model uses the term frequency as the weight of terms. Term weighting is the important factor in the document representation, which decides the efficiency of the IR system. It includes three components like Term Frequency (TF), Inverse Document Frequency (IDF) and the document length normalization. TF gives the distribution of each word in the document whereas IDF expresses the importance of each word in the document. Higher the number of occurrences of word yields lesser IDF value. Equation 1.1 determines the weight of word by using TF-IDF scheme.
where tfij is the term frequency of term “i” in document “j,” N is the total number of documents in the collection, dfi is the document frequency of term “i” in the collection, and wij is the weight for term “i” in document “j.” Generally, the Term Document (TD) matrix of size “m x n” is built between words and documents, where “m” represents the terms (rows) and “n” represents the documents (columns), “w” represents the weight of the term.
1.4.2 Latent Semantic Indexing (LSI)
The Term Document matrix is very sparse in nature, if it is built for all words in the document collection. The terms may be unique across the documents and may not be repeated in all documents. It increases the size of matrix. The disadvantage of simple vector space model is that it cannot relate two synonymous words present in the document. In order to reduce the sparseness in the matrix and to address the synonymy issue, the vector space model can be extended and Latent Semantic Indexing (LSI) can be used for document indexing. Figure 1.4 shows the synonymy and polysemy representation of words in English language. LSI technique analyzes the text documents to determine the hidden meaning or concepts. For example, when the word “bank” comes along with other words like mortgage and loan, then it can be concluded that it is associated with a financial sector. If the word “bank” comes along with other words like fish and pond, then it is associated with the water body. This problem is solved by LSI technique by merely not comparing the words in the document space but does comparison of both words and documents in the concept space.
Figure 1.4 Synonymy and polysemy issues in English.
LSI uses Singular Value Decomposition (SVD) to reduce the dimensions of TD matrix. It reconstructs the matrix with the least possible information. It is a matrix factorization technique that factors “m x n” matrix into three matrices USVT where matrix U represents the term matrix in concept space, matrix VT represents the document matrix in concept space, and the S matrix is of singular values by which the number of dimensions or concepts can be selected. The complexity in SVD lies in figuring out how many dimensions or concepts that do exist in the document collection while approximating the matrix. The original TD matrix can be approximated to “k” dimensions, where k is much smaller than the rank of TD matrix. The value of “k” can be determined empirically. Usually, its value ranges between 100 and 350 for the large data collection. Figure 1.5 shows the schematic representation of truncated matrix on TD matrix.
Figure 1.5 Approximated TD matrix by SVD.
LSI indexes words using low dimensional representation and word co-occurrence. The association of terms with documents, i.e., the semantic structure improves the relevancy of results for queries [56]. Value of “k” in low hundreds improves precision and recall value. LSI has its own disadvantages like more computation time and negative values in the approximated TD matrix.
1.4.3 Clustering Techniques
Clustering methods identify similar groups of data in a data set collection. Centroid model, the K-Means algorithm, is an iterative clustering algorithm groups all the data point closer to the centroid. It is important to have prior knowledge on the data set, as this algorithm takes the number of clusters as input. It partitions the “n” data points into “k” clusters in which each data point belongs to the cluster with the nearest mean. There are many variations exist in using K-Means algorithm like using Euclidian distance between centroid and the data point, fuzzy C-Means clustering and so on. Like LDA, K-Means is also an unsupervised learning algorithm where the user needs to give the number of clusters required. The only difference is that K-Means produces “k” disjoint clusters whereas LDA assigns a document to a mixture of topics. The problems like synonymy and polysemy can be better resolved with the use of LDA than K-Means algorithm technique.
1.4.4 Topic Modeling
Vector Space Models (VSM) get the raw text data and indexes the term against the documents. LSI technique shows 30% improved accuracy compared with traditional VSMs [29]. Further, the previous research carried out in IR domain with LSI technique says that LSI improves recall but precision is not comparatively improved [22, 26]. This disadvantage can be overcome by processing the raw data before applying indexing technique. Hidden concepts in the document collection can be included while indexing the terms and documents which substantially improves the accuracy [58]. Latent Dirichlet Allocation (LDA) technique [58] uncovers latent “topics” in a document collection where topics are a kind of features. It is a language model for modeling the topics of documents in a probabilistic approach. Each document may contain a mixture of different topics. Each topic may contain many occurrences of words related to it in documents. Figure 1.6 shows the framework of LDA model for topic (or feature) categorization of text documents.
Figure 1.6 LDA framework.
For example, the Word 2 is categorized under two different topics say, “topic 1” and “topic 2.” The context of this word varies and it is determined