An estimate indicates that 1.7MB of data is generated every second for every person on the planet. 90% of the data that exist today was created in the last 2 years. In an age where data is produced at incredible speed and volume, how do we generate value from data efficiently?
More often, the biggest problem to solve is not in our ability to analyze large volumes of data. The biggest challenge lies in our ability to collect or extract relevant data from all the noise out there.
If I am a Wall Street analyst that wants to analyze a stock or a company, among other things, I’ll need access to the company’s financial data. How do I get my hands on the company’s financial data?
(a) I download financial reports of the companies I am interested in and I manually collect the data from those reports myself. However, this will be an atrocious waste of my time and I wouldn’t want to do it.
(b) I subscribe to the financials data product of a data company and the data I need is available for me to download in just a few clicks.
Well, now, how does the data company get its hand on the data?
Almost always, data companies collect data from the financial reports published by the companies themselves. Most companies engage large teams of people to collect data manually from these reports for as many companies as they can. Some, use machines to collect some data systematically, with humans collecting the rest, painstakingly manually.
Can extraction of data from, let’s say, financial documents be nearly fully automated? Yes. In order to assess the feasibility of automating data extraction, it’s important to understand the nature of data and its reporting patterns.
Broadly, based on structure, presentation, reporting consistency and semantic patterns, data is categorized into three: Structured, unstructured and semi-structured data.
Structured data usually refers to data that is presented in a rigid, consistent style of reporting and presentation. As an example, precipitation data published on the National Oceanic and Atmospheric Administration’s website here is structured data.
Machines can be programmed to extract structured data easily because the data follows a highly consistent structure and style of presentation.
Unstructured data refers to data that is text-heavy, fluid, conversational or descriptive and highly inconsistent in its style of presentation. The press release below reporting Microsoft declaring dividends to its shareholders is a simple case of unstructured data.
Extracting data from unstructured press releases such as the one above is not all that simple. It requires a combination of sound machine learning and natural language processing models that are trained to recognize and extract relevant information from unstructured documents.
Semi-structured data refers to data that shares a certain semblance of structure at a higher level, but features a great degree of variety and difference in presentation and reporting as you dive deeper. Financial reports that public companies in the US file with Securities Exchange Commission (SEC) such as 10Ks and companies elsewhere that publish Annual Reports in PDF files are examples of semi-structured data.
Presented above is a part of the Income Statement of Swedbank. While banks present Income Statements in more or less the same structure with more or less the same components that apply to the banking industry, the semantics or the lingual expressions of those components and their detailed sub-components may vary across a broad spectrum.
Extracting semi-structured data can be a challenge depending on the degree of diversity and inconsistency of the presentation. Leveraging machine learning models that are trained to locate and extract relevant information is a tried-and-tested approach to derive value from semi-structured data.
Utopia in the data world is where all data is structured, organized and ready to be used. But it’s far from the reality. What consumers of data need today are efficient ways to systematically extract relevant data from large, complex documents and other kinds of data sources. In this day and age of technology where cars drive you around autonomously and robots perform surgical procedures on humans with little supervision, manually extracting data from documents is primitive and unnecessary. It’s beneath the technological capabilities that mankind has brought around.
Our team has spent the last two years developing an intelligent data extraction platform that enterprises today are turning to to systematically extract data that they need. If your business extracts data for consumption manually today, contact us to learn about our platform. We call it Foreseer and we believe that Foreseer may be able to help you collect data for a fraction of the cost at which you do today. Write to us for a demo at firstname.lastname@example.org.