Legal data comes in different formats (and may be structured, semistructured, or unstructured) are too heterogeneous to be directly usable by scientists, and then they need to be blended, transformed and cleaned before generating new scientific insights.
Court legal databases primarily contain a large collection of unstructured data (PDF documents).
These data contain explanations of legal cases, with citations of the legislation and the legal reasons behind their decisions and they form the backbone of our analytical pipeline on which predictive models will be based.
The ETL module prepares the data for analysis by addressing three key harmonization issues:
i) minimizing optical character recognition (OCR) errors (mostly layout errors)
ii) harmonizing the data structure, and iii) data cleaning.
Firstly, the ETL extracts the texts from the PDF document files using an optical character recognition engine, then standardizes them into a flexible LS-JSON data structure, and finally loads them into the Data Lake.
LS-JSON is a flexible, JSON-based data format that can capture all aspects of a legal document, entity labels, and relationships across different portions of the text. It has a hierarchical structure and includes both the textual content of the legal document and the original metadata.
The ETL module creates LS-JSON documents by parsing the text and breaking it down into sentences to which additional metadata can be associated.
In many cases, preprocessing steps increase the performance of NLP models.
Usually, legal text contains misspellings and domain-specific abbreviations that introduce noise in the model training phase.
The text-preprocessing phase of the ETL module applies a variety of techniques to convert raw text into standardized sequences (e.g. expanding abbreviations, automated spelling detection and correction, layout error correction, special character removal, and etc) in order to improve the models’ understanding of the text.
Finally, the high flexibility of the LS-JSON format makes it possible to track all textual alterations introduced by the ETL module, so that the original data can always be reproduced and the quality of the pre-processing phase can be assessed.