Home Computer Science
For most of the NLP tasks, large bodies of text are used. Such text bodies are known as text corpus. It will be in a structured format. For social media analytics, a text corpus is created using text from social media and the web. For handling new problems, one can build their corpus. Some of the available text corpora include the following:
• Reuters—It is a collection of news documents.
The emergence of DNN revolutionized the entire field of NLP. The use of DNN for NLP gave rise to high-performance language models. The DNN learns features automatically from the dataset. The automatic feature learning capacity of DNN makes the language model efficient and independent of the programmer. The existing techniques for NLP mainly depend on the programmer's ability to create a language model.
The DNN works well for most of the NLP applications. Text summarization, NLG, next word prediction, question answering. Chatbots, etc., can be done efficiently using the DNN. Text summarization has a wide range of applications in this busy world due to time constraints. Text summarization is the process of extracting meaningful concepts from a large text and creating a smaller version of the given text without losing its meaning. A summary can reduce the reading time and convey the document content effectively. Automatic summarization is an active research topic in NLP. Extractive summarization creates the summary by extracting essential sentences or words from the document and combining the extracted sentences to form a summary text, whereas abstractive summarization understands the main concepts in the piece of text and then creates a summary in natural language using NLG techniques.
Various approaches are there for performing the summarization task. However, abstractive summarization is not easy to create using existing methods since it involves NLG. Abstractive summarization can be done effectively using deep learning techniques. Many architectures are there for implementing DNNs. The main architectures include CNN and RNN. The CNN works well with image data, and RNNs are suitable for processing sequential data. Simple RNNs are rarely used since it does not have memory, and they cannot remember long sequences. LSTM and GRUs are an improvised version of RNN specially designed for sequence processing. LSTM and GRU can capture long-term dependencies with the help of memory units.
220.127.116.11 ABSTRACTIVE TEXT SUMMARIZATION USING LSTM AND GRU
The text summarization approach is used for creating summaries of reviews about products in shopping websites such as Amazon. The problem of product review summarization can be modeled as a sequence to sequence problem. A text sequence is converted to another text sequence with the help of NLG. Typically, sequence to sequence language models is created using encoder-decoder architecture. An LSTM/GRU encoder encodes the text sequence to a context vector. The decoder takes context vector and internal states of LSTM/GRU encoder as an input and generates the summary of the review word by word. The decoder is trained to create the next word prediction. The decoder starts producing the summary when it receives a “start” token and stops on getting an “end” token.
Here, a summary of product reviews from Amazon is created by building an LSTM model and a GRU model and compares the performance of both models using sparse categorical cross-entropy function as the loss function. The workflow is given as follows.
18.104.22.168.1 Load Dataset
The Amazon product review dataset can be downloaded from http://jmcauley. ucsd.edu/data/amazon/. The reviews of books, Electronics, etc. are available on the website. The reviews of “Cell Phones and Accessories” that contain 194,439 reviews are used for performing text summarization. The metadata of the dataset used was given as follows.
FIGURE 14.15 Sample preview of Amazon review dataset.
The dataset contains many columns that are not relevant to our problem. The dataset is to be preprocessed in such a way that the dataset only contains information relevant to our problem. In addition, the contractions such as [can't, aren't] should be expanded to [can not, are not]. Then, the special characters and stop words should be removed from the dataset. Apply preprocessing, separately to the summary, and review text. After preprocessing, save cleaned_text and cleaned_summary as separate columns in the data frame. Append “start” and “end” tokens at the beginning and end of the summary.
22.214.171.124.3 Prepare the Dataset for Modeling the Language Model Based on Our Problem
After preprocessing, prepare the dataset for feeding it into an NN. The dataset is split into framing data and test data. Random split is done using train test _split function in skleam. About 20% of the data is used for testing and the remaining for training.
After splitting, a corpus specific dictionary is built. The dictionary is created from the corpus with the help of a tokenizer. The tokenizer creates a dictionary of unique words in the corpus. The unique word is kept as key and the occurrence of that word as its count. Then, it sorts the entire dictionary based on the occurrence count. The value after sorting is kept as word index. For example, the word “start” has the highest occurrence count in the corpus; then, it is at index 1 in the word index dictionary.
Then, each review that contains a sequence of words is converted into integer list of integer indices. For example, consider the cleaned_sumrnary “good quality” with “start” token at the beginning and “end” token at the end, that is, [start good quality end]. It is converted to [1, 7, 26, 2].
Next, find the review sequence having a maximum length, and convert every sequence to that maximum length by padding Os. Before feeding into the network, standardize the length of review and summary to decide the number of neurons needed in the embedding layer of the encoder and the decoder.
FIGURE 14.16 Data representation after cleaning.
FIGURE14.17 Word_index dictionary.
126.96.36.199.4 Build the Model
Ail encoder-decoder architecture is used for building sequence to sequence models where input and output are sequences of different lengths. Here, it is required a word-level sequence to sequence model for comparison. The encoder-decoder architecture can be build using SimpleRNN, LSTM, GRU, etc., but SimpleRNN fails to remember the lengthy sequences. LSTM and GRU are introduced to capture the dependencies in lengthy sequences. An encoder-decoder architecture is modeled using both LSTM and GRU.
The encoder encodes the sequence and generates a context vector. The output generated by the encoder, along with its internal states, is given as input to the decoder. The decoder decodes the context vector and uses the internal states for generating the output sequence word by word. The encoder and the decoder are separate NNs and are trained by considering the problem.
Note: RNN can be LSTM/GRU. If it is LSTM, the internal state vector consists of hidden states and cell states in each time step. If it is GRU, internal state vector consists of hidden states only. V indicates the list of vocabulary.
For solving the text summarization problem, the encoder is trained as a classifier network. The encoder classifies each word in the review as an important word and unimportant word. A context vector is created using the essential words and is generated as an encoder output.
On the decoder side, it takes summary as input and generates the same summary with one offset. The decoder is trained in such a way that it predicts the next word given the previous word.
In addition to this, the attention mechanism can be used for predicting a more accurate summary. Generally, the encoder gives equal importance to all words in the sequence. However, an encoder that gives more attention to specific parts of the sequence is to be designed. Consider the Amazon reviews; mostly, the starting of the review and end of the review conveys more information. Therefore, those parts need more attention. It can be achieved using the attention mechanism. A weight is assigned to all parts, and parts that need more importance are assigned with an increased weight.
188.8.131.52.5 Train the Model
An encoder-decoder architecture is created with an attention mechanism. The Amazon reviews dataset was fit into the encoder-decoder network. Each review from the dataset is processed, and output was generated. About 80% of the data was used for training, and 20% was used for validation or
FIGURE 14.18 Encoder-decoder architecture testing.ie. Train on 155,240 samples and validate them on 38,810 samples. An early stopping mechanism was used to stop the training when there is no significant improvement in the parameter used for evaluation. Here, the training stops when the validation loss increases or there is no significant decrease in the validation loss value.
184.108.40.206.6 Make Predictions Using the Trained Model
The final step is to make predictions using the trained model. A new review is sent to the encoder as input. The encoder generates a context vector for that input sequence. The output of the encoder, along with the internal states, is sent as input to the decoder. The decoder starts predicting the target sequence word by word when it receives a “start” token along with encoder output and internal states. The decoder predicts the next word until it encounters an “end” token or maximum summary length is attained. This predicted sequence will be in the form of word vectors like [1,20,21,2]. Decode the sequence using the dictionary that has already created. The word vector [1,20,21,2] will be converted to [start really nice end].
Dataset used: Amazon Cell Phone and Accessories review. No of reviews: 194,439.
Activation Function: Sofhnax.
Batch size :512.
TABLE 14.7 LSTM and GRU Performance Comparison
For the comparison of LSTM and GRU, a model of both LSTM and GRU is created using Keras for Amazon review summarization under similar constraints. The performance of both models is evaluated using sparse categorical cross-entropy loss function and validation loss. LSTM had less loss than the GRU network. However, GRU reached almost the same loss value with less time when compared to LSTM. In addition, the words predicted using LSTM and GRU are different for the same review set.