Группа авторов

Semantic Web for Effective Healthcare Systems


Скачать книгу

(IR) techniques, which vary in their weight allocation for each term.

Schematic illustration of term weighing schemes for feature extraction.

      IR like Vector Space Model (VSM), Latent Semantic Indexing (LSI), topic modeling techniques, and clustering techniques are used in the feature extraction of text documents for term weighing process. The following sub sections describe the rationales of different feature extraction techniques used in text analysis.

      1.4.1 Vector Space Model

      1.4.2 Latent Semantic Indexing (LSI)

Schematic illustration of synonymy and polysemy issues in English. Schematic illustration of approximated TD matrix by SVD.

      LSI indexes words using low dimensional representation and word co-occurrence. The association of terms with documents, i.e., the semantic structure improves the relevancy of results for queries [56]. Value of “k” in low hundreds improves precision and recall value. LSI has its own disadvantages like more computation time and negative values in the approximated TD matrix.

      1.4.3 Clustering Techniques

      Clustering methods identify similar groups of data in a data set collection. Centroid model, the K-Means algorithm, is an iterative clustering algorithm groups all the data point closer to the centroid. It is important to have prior knowledge on the data set, as this algorithm takes the number of clusters as input. It partitions the “n” data points into “k” clusters in which each data point belongs to the cluster with the nearest mean. There are many variations exist in using K-Means algorithm like using Euclidian distance between centroid and the data point, fuzzy C-Means clustering and so on. Like LDA, K-Means is also an unsupervised learning algorithm where the user needs to give the number of clusters required. The only difference is that K-Means produces “k” disjoint clusters whereas LDA assigns a document to a mixture of topics. The problems like synonymy and polysemy can be better resolved with the use of LDA than K-Means algorithm technique.

      1.4.4 Topic Modeling

Schematic illustration of LDA framework.

      For example, the Word 2 is categorized under two different topics say, “topic 1” and “topic 2.” The context of this word varies and it is determined