Search

Data Extraction using Question Answering Systems



Motivation:


Looking for particular data points of interest in a document or a collection of documents is like looking for a needle in a stack of heterogenous haystacks. The data point of interest, say the number of employees in a particular year represented by labor unions, is sure to be reported in wildly free-flowing formats. Just take these examples:

  1. As of January 31, 2019, we had 4,650 employees worldwide, of which 1,919 were based in North America, 960 were based in Europe, and 1,771 were based in Asia-Pacific.

  2. At December 31, 2018, we had 4,525 employees. None of our employees are represented by a labor union or are subject to bargaining rights. We consider our relations with our employees to be positive.

  3. Currently, our manufacturing facilities in Canada employ approximately 850 workers, of which 700 are subject to five separate collective bargaining agreements. Two agreements, covering 400 employees, expire in November 2019 and June 2020.


Example 1 has no information about the data point, the second one does not have the exact wording required and does not include a number, while the third one has the data point in a sentence that has a rather complicated parse structure.


The traditional approaches of dependency parsing, pattern and regex matching, and entity relation extraction didn’t prove to be generic enough for fast onboarding of datasets. Writing custom expressions, patterns, and parses of the dependency tree proved to be quite tedious.


Thus, we started asking ourselves, how can we improve? And voila — the answer to this question was Question Answering.


Question Answering Systems:

For humans, answering questions is the most logical way to infer and learn. Questions answering systems aren’t new. I am going to show my age here, but at one point, AskJeeves was my go-to search engine.



With the recent advancements in Transformer based language models, we are approaching SkyNet level of intelligence in NLP. So, did it seem natural to try this approach out? Indeed, it did!


A quick overview of BERT:

BERT uses Transformer encoder blocks. The transformer encoder uses attention (Multi-Headed Self Attention) mechanism that learns contextual relations between words (or sub-words) in text.

BERT alleviates the unidirectionality constraint by using a “masked language model” (MLM) pre-training objective. The MLM model randomly masks some of the tokens from the input, and the objective is to predict the masked word based on its surroundings (left and right of the word).

As opposed to directional models, which read the text input sequentially (left-to-right or right-to-left), the MLM objective enables the representation to use both the left and the right context, which allows to pre-train a deep bidirectional Transformer.



In addition to the masked language model, BERT also uses a “next sentence prediction” task as the pre-training objective. This makes BERT better at handling relationships between multiple sentences. During training, 50% of the inputs are paired, in which the second sentence is the subsequent sentence in the original document. The other 50% of the input is random sentences from the corpus as the second sentence.



Simply put, BERT makes sure the model builds some sort of associations between individual words in a sentence (à la Dependency Tree) and between individual sentences.


Using BERT in Question Answering Systems:



Building a Question Answering System with BERT: SQuAD 2.0

For the Question Answering task, BERT takes the input question and passage as a single packed sequence. The input embeddings are the sum of the token embeddings and the segment embeddings. The input is processed in the following way before entering the model:

  1. Token embeddings: A [CLS] token is added to the input word tokens at the beginning of the question and a [SEP] token is inserted at the end of both the question and the paragraph.

  2. Segment embeddings: A marker indicating Sentence A or Sentence B is added to each token. This allows the model to distinguish between sentences. In the below example, all tokens marked as A belong to the question, and those marked as B belong to the paragraph.


Source:


To fine-tune BERT for a Question-Answering system, it introduces a start vector and an end vector. The probability of each word being the start-word is calculated by taking a dot product between the final embedding of the word and the start vector, followed by a softmax over all the words. The word with the highest probability value is considered.

A similar process is followed to find the end-word.


Transformer Architecture of Layers to find start-word and end-word

Note: The start vector and the end vector will be the same for all the words.


To train BERT for Question Answering, we took the Squad 2.0 dataset, which does handle unanswerable questions, and augmented it with the pertinent questions answer pairs from our dataset. We trained on 2 g4dn.8xlarge Keras Multi GPU training.


Explainability:


The results were pretty good, but for the ones that were incorrect, we wanted to better understand why the model provides certain responses. We were able to visualize gradients in the trained DNN to infer the relationship between inputs and outputs. The gradient quantifies how much a change in each input dimension would change the predictions in a small neighborhood around the input.


Results:


Let’s go back to the original data points and questions.

  1. As of January 31, 2019, we had 4,650 employees worldwide, of which 1,919 were based in North America, 960 were based in Europe, and 1,771 were based in Asia-Pacific.

Request →

Response →

Explanation →


The model answered correctly, that this question is unanswerable. “is_impossible”: true, and thus paid no attention to any of the words in the context (After the ? in the above figure).


2. At December 31, 2018, we had 4,525 employees. None of our employees are represented by a labor union or are subject to bargaining rights. We consider our relations with our employees to be positive.


Request →

Response →

Explanation →


As per the screenshot, the model pays more attention to the token none and think it’s the most feasible answer.


3. Currently, our manufacturing facilities in Canada employ approximately 850 workers, of which 700 are subject to five separate collective bargaining agreements. Two agreements, covering 400 employees, expire in November 2019 and June 2020.


Request →

Response →

Explanation →


As per the screenshot, the model pays more attention to the token 700 and think it’s the most feasible answer.

Notice how we did not have to change the question in these 3 very disparate examples. Correct answers in all of them, with minimal processing! Talk about an A+ score.

In our observations, not everything is this simple (sorry robots, you can’t take over yet). The pipeline we have built identifies the relevant context to find the answer in and tries different questions. We combine those answers, along with our extraction data from dependency parsing based models and pattern matching based models.


Conclusion:


From our experience, QA system has been the easiest to deal with and the most intuitive. The extraction accuracy is great and minimal postprocessing is required.


What’s next:

Training the model to extract multiple correct answers. There are some pesky cases, such as:


“In 2019, 2018 and 2017, our revenue was $19 million, $18 million and $16 million respectively.”


When asked a question like How much was the revenue? the answer selected is the closest to the word revenue, in this $19 million. The challenge would be to train a system to return all three answers and associate them with the year! Stay tuned!

kisspng-social-media-linkedin-computer-i

© 2020 GuardX Inc. All rights reserved.