Home Business & Finance
Table of Contents:
Document clustering is a machine learning technique that groups documents into clusters based on their similarity. For text mining, clustering is used for various functions such as document selection, organization, summarization, and visualization.
There are multiple approaches to clustering, and a wide variety of algorithms exist for completing this task.
Clustering algorithms are typically unsupervised (refer to our definition of unsupervised learning in Chapter 1).
The most well-known clustering algorithm is the “k-means” algorithm. In this algorithm, each cluster is represented by the mean of all its closest data points (i.e., each cluster representing a cluster of documents grouped by the algorithm based on their similarity measure or distance measure). Similar clustering techniques use other measures of central tendencies, such as the median or the mode. Different clustering methodologies include density-based clustering and hierarchical clustering.
Document classification is a machine learning technique that assigns predefined classes to documents.
In contrast to clustering algorithms, classification algorithms are typically supervised. Researchers provide the algorithm with training examples that include the correct class (also called classification or category) and the features used to represent each document (such as the vector-based representation). The classification algorithm then constructs a model that best maps the given features to each class.
When the training data have only two classes, a binary classifier is constructed. Where there are more than two classes, a multi-class classifier is required (Blake, 2011).
Examples of classification algorithms include:
There are several applications of document classification, such as spam filtering, email routing, content tagging (which improves browsing and accelerates searches in extensive unstructured text collections), and customer opinion and sentiment analysis.
Entity and Relation Extraction
Entity extraction algorithms are used to extract entities such as person names, organization names, locations, dates, phone numbers, reference numbers, prices, amounts, and other items, from documents.
Relation extraction algorithms are used to identify and characterize relations between entities such as person-organization (e.g., an employee of), person-location (e.g., born in), or organization-location (e.g., headquartered in). Some algorithms focus on event extraction, which is aimed at identifying entities that are related to an event.
Information extraction (IE) algorithms use various machine learning approaches, including rule learning-based methods, classificationbased methods, and sequential labeling-based methods (Tang et al., 2008).
• Rule learning-based systems use predefined instructions on extracting the desired information (i.e., words or text fragments) from the text. They include:
There are multitudes of applications of information extraction from documents in today’s digital business environment. Examples include: