This post is a summary of our investigation in the data labeling space and what we ended up as making our final choice. We were delivering a solution on our Foreseer platform for extracting data from financial documents for a Fortune 500 client. There was no pre-labeled data available and it became pretty clear very quickly that we had to build a fairly large labeled data set. We have used Amazon Turks in the past but outsourcing labeling wasn't an option for this client. This made us take a deeper look into the labeling choices.
Many machine learning techniques require labeled data to train models. This is akin to memorization for a human - practice makes perfect (or good enough). The quality of your model performance depends significantly on feature engineering and the availability of clean, labeled data. Garbage in - garbage out from ML models. Data engineering and clean data sourcing investment usually vastly outperform complex model building investment.
Flavors of Labeling
Data Labeling solutions roughly fit into the following categories:
Crowdsourced Labeling: Usually on the cloud but can be done on-prem if needed. These solutions can take on large projects quickly. Might have some optional data validation / rudimentary project management capabilities. Most of them chase the lowest cost of a human labeler without any subject matter expertise. Amazon Turks is an example of this kind of offering, though there are many players in this space. You get what you pay for - if the data set is large and doing it in house is not an option (or too expensive), data privacy is not required, data set does not require any domain-specific knowledge and is fairly easy for a layperson to understand and you can live with errors in data set then this option is for you. Scale.ai seems to be such a solution - though not easy to ascertain from their website.
2. Captive Labeling Team: Can be cloud-based or on-prem. These labeling solutions offer in house teams that mitigate risks around data privacy and to some extent can have subject matter aware annotation teams. Lionbridge, for example, offers such services as do many others. These tend to get pricey rather quickly. Scaling to a large data set or elastic tagging teams is not really feasible, depends on vendor capacity. Your data labeling quality depends on the vendor quality and the particular team you have been given.
3. Data Labeling Tools: There is a plethora of them in the market, we see a new tool come in the market every day. If you have a large operational workforce already available who can tag data for you then these tools can get you labeled data. The good folks at Spacy have created https://prodi.gy/ which is great for individual data scientist data tagging - ties into Spacy. We initially operationalized this and used it with a team of taggers that got the job done. The features that are typically needed are:
Project management - create / manage / delete
Data Quality Review (client audit)
Assisted labeling (use ML for ML!)
API for submission
Everyone makes this out to a big deal - we fail to see game changing value here - a nice to have .. sure. API has become the new old buzzword. Add an API in front of a standard service - instant Unicorn!
All of these approaches have one fundamental flaw -> They are not embedded in the business solution.
What we really needed was continuous learning - train against production quality data! So we built that, and we rolled it out to production (with a prayer and major apprehension). The solution was bootstrapped using a tiny set of manually labeled data using prodi.gy and it achieved ~35% accuracy on text data comprehension on some really difficult text. Users were manually fixing the output once released to production. Once the system started getting used live in production, the accuracy shot up to ~80% with a few thousand clean data points within weeks. The system is still improving and we retune the model periodically .. but by and large, it is a stand-alone deployment that is improving with continuous production usage.
The next step was to abstract the labeling solution piece out and make it part of our processing pipeline .. all without adding any extra work for users. I am proud to say we have finally achieved that. Now Foreseer pipeline has data labeling built into every validation screen that we produce. The users simply validate the data and our system keeps a delta of changes and produces model-specific labeled data. We also have native support for Jupyterlabs including push notifications and model retrain scheduled triggers.
Armed with these key features a complete end to end solution in Foreseer has three actors.
1. End users: Use the UI screens to validate extracted data and fix where the system went wrong. This data flows downstream to end client systems or to our UI / dashboards with full lineage maintenance of all changes.
2. System developers: Rapidly roll out end-user solution using Foreseer platform. Use our data ingestion, data pipeline, process flow, operational tooling and UI framework. Deliver production quality auto-scaling system in days.
3. Data scientists and model developers: Build models in isolation with tagged data using python. Support a couple of functions e.g, train() and fit() and we will take responsibility for model versioning, model training, and production deployment.
Going live was great, sure, but what we have really gained is a repeatable process, clearly define roles for different teams and a robust data extraction framework to roll out end-to-end human in the loop solutions for solving data extraction needs of diverse domains.
I hope this post has given you an introduction towards data labeling and how the efficiency of solutions can be improved by seamlessly embedding labeling of data and training of models in production-quality systems.
Feel free to contact us at firstname.lastname@example.org if you want detailed reports on labelling providers (We did do a rigorous analysis of current offerings) or if you want to see a demo of Foreseer in action.