Semantic Analysis of Footnotes using Extractive Summarization


Anyone who has worked with table extraction from documents knows Footnotes (or Footers) are those pesky little things that can take the joy out of a job well done, especially when it comes to making sense of the data extracted. Footnotes can contain a substantial amount of information about the table rows, in really “natural, free-flowing” natural language text (good-bye semi-structured assumptions). The problem is made more challenging in the case we were handling (extracting tables from company filings and making sense of them) by the fact that there is no standardization across different companies or countries. So, we turned to the Neural Network Gods for help!

After a series of experiments, we arrived at an approach which is a mixture of LSTM based extractive summarization and good old pattern matching/bagmatching / bag of words / SVM matching rules to arrive at a ROGUE-L score of 34.23, which is close to SOTA (State of the Art). We chose ROGUE-L, a Longest Common Subsequence (LCS) based statistic to measure. ROUGE-L takes into account sentence level structures and hence is the best standardized measure in our opinion.

LSTM explain:

There are many excellent LSTM background resources. Here is a small introduction on LTSM: LSTMs are a variation of RNNs (Recurrent Neural Networks), which are meant to consider the whole context while making a decision about the current word in the sentence. RNNs suffer from two problems: vanishing gradient and exploding gradient, which make it unusable for the sentence interpretation problem. Then later, LSTM (long short term memory) was invented to solve this issue by explicitly introducing a memory unit, called the cell into the network. With the help of Forget and Remember gates built into the cell, LSTM makes sure any one rogue feature will not dominate the prediction and the contribution of previous cells is not lost.

Text Summarization can be of two types:

1. Extractive Summarization — This approach selects passages from the source text and then arranges it to form a summary. One way of thinking about this is like a highlighter underlining the important sections. The main idea is that the summarized text is a sub-portion of the source text.

2. Abstractive Summarization — In contrast, the abstractive approach involves understanding the intent and writes the summary in your own words. I think of this as analogous to an overview section.

We dealt with extractive summarization since we wanted to remove the noise in the Footnotes first and then try to make sense of it.

Our Approach at a high level was a continuous learning loop of:


The length of footnotes ranged from 30 characters to 6300 characters (did I mention they were wild?) with an average length of around 3000 with the distribution skewed to the left. Just removing the stop words etc was not sufficient, as some parts of the footnotes were completely irrelevant to the problem we were trying to solve, in spite of them containing somewhat similar words to the relevant parts. We can’t release the actual data used, but here’s an example sentence, with Green parts relevant and Red irrelevant.

The offering consisted of (i) Options issued to X under Y agreement and (ii) Shares issued to Z1 and Warrants issued to Z2 which is a subsidiary of Z3. (iii) Shares issued to a subsidiary of Z4 We used the Foreseer platform’s data annotation tool to get the annotations done, receiving the relevant and irrelevant spans of data, ready to be loaded into a Pandas dataframe.


LSTM, or any encoder-decoder model expects a certain maximum length of the sentence as an input. We were dealing with a highly non-uniform dataset, with 40% of the sentences being less than 500 characters, 32% between 500–2000 and 28% over 2000. A common technique is to set the sequence length to the maximum length in the training set ( ~6300 characters) and pad the smaller sentences to the same length. However, in our case, that’d mean 40% of sentences (with 500 characters) with each sentence having at least 92% <pad> characters! Not ideal, as intuitively, any model will just learn to predict everything as <pad> and still get high accuracy. Original — I am a warrant. Padded — I am a warrant <pad><pad><pad><pad><pad><pad>. To overcome this, instead of padding and letting one entry dominate the set, we concatenated the input sentence to itself. Original — I am a warrant. Padded — I am a warrant I am a warrant I am.

Approach Selected

Summarization problems in the real world today are dominated by Abstractive summarization, with only a few like Pointer Networks that try to consider both. We focus on extraction of report data hence abstractive summarization does not work for us. We pioneered a comprehensive framework that applies to natural language extractive summarization across multiple use cases. The framework is fairly generic, an end to end solution combining annotation, extraction, reinforcement learning, and the flexibility to create a dynamic ensemble solution per use case on the fly. We have baked the framework into Foreseer, our flagship product. We experimented with 2 approaches for include/exclude summarization based on LSTMs 1. Sentence level — Consider the average of the word embeddings and consider the relationship between different sentences (in both directions) to decide whether a sentence needs to be included. 2. Word level — Same approach, but at the word level. Working with a training set of ~7000 sentences and a held-out set of 1300 sentences, we realized the word level approach outperforms significantly. We used MLFlow integrated into the Foreseer platform to keep a track of the experiments and talos for hyperparameter optimization.

The subsequent summaries were passed through a multiclass classifier and a NER model to arrive at the meaning of the sentence. The main problem we faced, as it turned out, was the summarization (or denoising).


Testing against the summaries provided, we achieved a ROGUE-L score of 35.23 and an overall accuracy of 87% on the whole Footnote analysis task. Our observation is that the hack we put in of repeating the sentence increased the summarization quality significantly, but there are other approaches that might result in better results, such as reversing the word order in the repeated sentence to reduce the causality between start and end of the sentences. Next on, we will experiment with Transformers, which are made amazingly easy to work with by the awesome folks at Huggingface! Stay Tuned!