The Inside Story ft. Igor Tonko, Junior Data Scientist on Project Legal AI.
The Inside Story blog series aims to give you a sneak peek into how we churn out successful projects at Hexad and shine light on the superhumans working behind each of those projects. This edition we sat down to chat with Junior Data Scientist at Hexad, Igor Tonko and his experience working on the Legal AI Project. Let’s dive in!
Project name – Legal AI
Timeline – July 2022
Role: Junior Data Scientist, Python Developer
Biggest Learning: The project was very interesting, I learnt a lot and used my knowledge of machine learning algorithms. The team consists of high professionals with lots of years of experience, strong coding style and principles.
Igor’s Process and Experience: When I was a Data Scientist on a Legal AI project. We developed a web application for automatic evaluation of legal documents. One of the biggest challenges right from the start was reading documents of various formats, such as .doc, .xlsx, .pdf… Reading Microsoft Word and Excel files is not very complicated, but PDF files have hard to process formatting.
After trying many different solutions we chose an external pdf reader combined with a computer vision model for reading tables.
Reading documents is just a first step. The next step is identifying components in text, such as License information, Copyrights, Components, Versions. We trained Name Entity Recognition (NER) models for this task. Identifying these parts is hard to do manually, so it is very challenging to train an accurate model. To increase accuracy the team came up with several supporting algorithms, not AI based. They help to identify patterns within documents and help to fix inconsistencies in model results.
The next step is an evaluation of documents. This is a part in which I was involved the most. Customers have a big list of requirements that the application needs to check. Some of them are simple algorithms, others using support from language processing models.
The last step is reporting the result. The application provides a .doc file with a report, also several files that represent outputs of the pipeline steps.