Another important aspect of the Exploratory Data Analysis (EDA) phase conducted on text data is the statistics, which help to understand the data clearly, before employing them to train a Machine Learning (ML) model. One of the simplest statistics is the average length of the documents, which is calculated by summing up the number of words in each document and by dividing it by the total number of documents in the database. This statistic helps the data scientist to choose the best tools and techniques available to build the ML model and to improve its performance, since there is a variety of techniques, that can be employed according to the length of the document.
A set of bar charts are used to better visualise the average length of the legal documents in our Legal Database of Genoa’s Trial Court for different types of legal acts and for the subjects to which those documents refer. The bar chart below displays the average length of documents grouped per type – horizontal bars – and also reports the overall average length of the legal acts. For the whole period that the database covers – from 2008 to 2019 – the average length of documents is around 588 words per document. To display the average length of the documents in different periods, select or type in the years of interest in the data slicer. As expected, the Judgements – ‘S’, “Sentenze” – are the longest documents, followed by the Ordinances – ‘O’, “Ordinanze” – and by the Degrees – ‘D’, “Decreti” – that are the shortest types of documents in the database.
The bar graph below introduces one further division of the documents, since it depicts the average length of documents by type of documents in the two grades of justice. In the whole period considered, with regard to the Judgments, the legal acts concerning the II° grade of justice are in average slightly longer than the ones concerning the I° grade. Nevertheless, the former documents makes up only the 1% of our database and at this point its average might not be that representative.
The chart below illustrates the average length of the documents by their subjects – please refer to the previous article Exploring Trial Courts Legal Databases to find out more about the type of classification in our database regarding the subjects and the type of documents. In the period covered by the database, for different type of judicial acts, the legal documents that refer to “Persone e società” – (“Corporate Law”) – for the Judgments are in average the longest ones, whereas for the Ordinances the longest documents are the ones that refer to “Diritti della persona” – (“Individual rights”). It is possible to navigate between the types of documents by using the data slicer ‘Type of Docs’ to display the average lengths of the legal acts.
To be continued.