LawStats – Large-Scale German Court Decision Evaluation Using Web Service Classifiers

,


Introduction
Legal professionals have become accustomed to the use of digital media and tools in their practice and Natural Language Processing (NLP) and Information Extraction (IE) generally offer a lot of potential benefits for many domains.However, their application to the legal domain is extremely limited to date.The legal profession needs exact and correct decisions.Thus, many struggle to accept Machine Learning (ML) techniques with a reported performance below 100%.In digital systems, rule-based IE is still dominant as it offers a high precision.But while rule-based systems allow you to get detailed insights in a document collection, a meaningful understanding on document level is hardly achievable without ML methods.In addition, law is traditionally regarded as a normative and consensus-based science and only recently quantitative analysis and empirical methodology have become popular [5].An aspiring school of thought classifies law as a complex adaptive system [13] and therefore deems technology absolutely necessary in order to tackle this complexity [14].
The project LawStats is the result of a collaboration between the Language Technology group at the University of Hamburg with the Bucerius Law School.
Combining an entity extraction model trained by law students using the IBM Watson Knowledge Studio3 , and a tool for Aspect-Based Sentiment Analysis (ABSA) [15], the LawStats application analyzes court decisions and offers a faceted search interface to aid law practitioners.Users can explore the court decision database from the Bundesgerichtshof -Federal Court of Justice (BGH).The web application offers facetes for searching by judges, senates and lower courts like higher regional courts or district courts as well as by period of time.The user has the option to sort and search in all categories to look up information about court decisions and their components in a court decision database containing currently more than 50,000 court decisions.Users can upload and analyze additional court decision files to enlarge the database and test the application's analytical performance.

Related Work
The application of NLP tools and analysis on legal problems is a rather young area of research in Germany. 4In the U.S., empirical and NLP-based analysis of court decisions has led to impressive results such as predictive modeling of Supreme Court decisions [8].In Germany however, analysis of court decisions has so far been limited to special jurisdictions 5 , albeit with impressive results if ML techniques were used [20].Our procedure is not aimed at court decision predictions and thereby differs from Waltl's approach [20].We are also not using any pre-existing meta-data but extract all of the entities from the document text using our ML model and the outcome classification is solely based on a text classifier.In these regards our approach substantially differs from previous academic ventures in both method and mere size of the corpus.
Corpus Linguistic approaches to law studies exist as well [19], most notably the JuReKo corpus [4], and enable statistical analysis and evaluation [9].These works are related to our paper as they use NLP techniques to analyze court decisions, they differ, however, substantially from our work as for them ML techniques have not played an important role so far.
IE of document collections is often performed in journalism. 6Journalists search for Named Entities (NEs) and their relations in a corpus.Then, faceted search [16] is used instead of a simple keyword search, as it is more effective for professionals.Even though these frameworks offer impressive visualizations [22,3], they cannot be used for document classification, as it would require training for particular domains.With a set law domain and expert annotators, we are able to perform polarity classification as well.Since the task is similar to SA [2,11], we utilize a system originally developed for Sentiment Analysis (SA) and re-train it on our dataset annotations.
The presented system aids a Human in the Loop (HiL) working style, which is required for domains with 1) a lot of textual data and 2) the need for explainable ML classifications [6].Professionals explore pre-annotated data using a faceted search interface and can add annotations.E.g., HiL is being employed in the biomedical domain [21], which needs an entity-centric access (bottom-up).In our use-case, we are concentrating on revision outcomes (top-down classification), that can be explored by different meta information.

Pre-processing
We perform two pre-processing steps to enhance the quality of the training data, which translates into better performance of the resulting ML models.In a normalization step, we replace abbreviations and inconsistently formatted expressions with a standardized form to reduce sparsity in the model.This step is necessary as annotator time is limited and we are striving for a high recall in IE.Note that we perform preprocessing on the annotation set as well as on every other document that later enters the system to ensure consistency.
The second preprocessing step is to replace all dots that are not full stops.Since the Watson Knowledge Studio (WKS) has difficulties with German sentence splitting and over-segments document on abbreviation dots (such as bzw.), we use our own sentence splitter and replace all non-sentence-end dots with underscores.This is necessary since we heavily build on the notion of a sentence in our setup and annotation in WKS is currently only possible within sentence boundaries.

Information Extraction
We extract and store the information from the BGH decisions.First, we analyze the document to determine the decision outcome, i.e. whether the revision was successful or rejected.Here, we make use of the particular structure of court decisions, as the operative provisions of decision are typically set at the beginning of the document.To determine the verdict decision, we classify the first ten sentences of a decision and use the one with the highest confidence score as the indicator for the outcome.
Additionally, we extract the entities Gericht (court), Richter (judge), Aktenzeichen (docket number) and dates from the text.These are sorted to determine

System Architecture
Overview The architecture of the application (Figure 1) consists of a front-end website and a back-end web server using Spring Boot and Spring Data.Document storage is performed by an Apache Solr instance.For text analysis, we use a Java API 7 to send and receive data from Watson Natural Language Understanding (NLU) in order to extract relevant entities from the court decisions.The outcome of the decisions is determined by a text classifier (see Section 5.2 for details).

Data Flow
The data flow is presented in Figure 2. Once the PDF verdict document is uploaded, the document text is extracted and a normalization of sentence boundaries and dates is performed.Then, we send the document to the Watson NLU API, while at the same time analyzing the verdict decision.After analysis, the verdict document is constructed from both sources and stored in the Solr index.Annotation: Watson Knowledge Studio We define two different entity sets to be annotated by our team of seven domain experts.The first set contains all entities listed above (see Section 3.2).As courts and judges are finite and docket numbers and dates follow a definable pattern, we use dictionaries and regular expressions for pre-annotation.Our annotators have to correct false positives, limit annotations to relevant entities (i.e.not all courts, but only those, who were part of the procedural process) and annotate irregular mentions.Only annotations remaining after this manual step were used for training.Further, annotators identify the phrases used to indicate the outcome of the case.This task is done without pre-annotation.Our annotators could perform both tasks − correction and annotation − in one single pass.In total, 1850 court decisions were annotated.The decisions were randomly sampled on the corpus.We set the Inter Annotator Agreement (IAA) threshold at 0.8 9 and have 20 % of all documents annotated by at least two different annotators.Before training the entity extraction model and deploying it to NLU, we remove the phrase-based outcome expressions from the training data to avoid confusing the sequential classifier.
Evaluation We use the WKS performance tool with a training set of 1260 documents, a dev set of 414 documents and a test set of 126 documents.The results in Table 1 show that the results are generally reliable with the exception of the extraction of court names, as their recall is only at 0.68.Since the document text is normalized, dates and docket numbers can be identified with a pattern feature extractor.Additionally, they mostly occur in very confined contexts, where they are preceded by a few different keywords (e.g."Aktenzeichen").Further investigation has shown that courts appear in two entirely different functions in the decisions: as the deciding court (our target) and as lists of courts involved in previous relevant jurisprudence.This problem could be solved by limiting the IE to particular sections of the decisions such as the beginning or the very end.

Revision Outcome Classification
To evaluate the revision outcome of a court decision, we classify single sentences into the classes "Revisionserfolg" (revision successful), "Revisionsmisserfolg" (revision not successful) and "irrelevant".As described in Section 3.2, we take the first ten sentences of a decision, classify them independently and use the classification with the highest confidence score as the evaluation of the whole document.Here, we use an open-source text classification framework for German [15] 10 .

Annotation and Training
Training data is obtained from the WKS annotations.We extract the annotated sentences as well as a random set of irrelevant sentences and train a multi-class SVM [1] classifier.For the feature set, we compute TF-IDF (Term Frequency Inverse Document Frequency) scores and word embeddings [12] on an in-domain revision corpus.The corpus contains all BGH court decisions available online.We build a feature vector based on the TF-IDF values and concatenate it with the averaged word vectors in a sentence.Furthermore, we induce features on the training data.We obtain a list of 30 highest-scoring (TF-IDF) terms per label (positive, negative, irrelevant) and add the relative frequencies of these terms to the feature vector.The training data consists of 2,200 labeled sentences.We use a balanced ratio of sentences for the two classes of successful/non-successful revision and use twice as much of irrelevant sentences for training.For testing, we use 550 sentences.Evaluation A simple baseline of choosing the majority class (irrelevant) scores 0.46 F 1 .When we train the classifier on a standard out-domain feature set 11 , we reach 0.70 F-score.By pre-training the TF-IDF vectors and the word2vec model on the in-domain collection of revision decisions, we reach a score of 0.91 (see Table 2).Error analysis shows that the major factor limiting the performance is the strong similarity between sentences indicating a successful and an unsuccessful revision.In most documents, the long sentences follow a rigid structure.Variation in the expression of the final outcome requires additional training data.Especially the edge cases (partially successful) show a lot of variation in the verdict.Since we classify on document-level, we have performed a documentlevel evaluation of the revision outcomes to verify that our sentence extraction approach works as expected.Two expert annotators annotated 100 documents each.The possible error cases were wrong polarity (successful/not successful) and when the classifier picked an irrelevant sentence as the decision-bearing sentence.Results are presented in Table 3.With a precision of 0.87, we obtain a comparable performance as on the sentence level.Furthermore, the document selection features the same distribution as the training set (Fischer's test p < 0.0001), making it a representative sample of the complete collection.Error analysis of the incorrectly classified documents shows an even distribution.12 documents were wrongly classified as "not successful" versus 11 "successful".About a quarter of the wrongly classified documents are partly misclassified.E.g. the revision was partially successful but classified as "not successful".For training, we had added the partly successful class to the positive "successful" class.In about 2 % of the documents, the wrong sentence is selected by the classifier.These sentences are often short phrases containing judge names; the classifier learned their co-occurrence in training data.This could be alleviated by masking entities in this classification task.

User Interface
The publicly available web application can be divided into two main components: a web page where the user is able to upload his locally saved revisions and inspect them, and a section to filter and examine the existing database of revisions.On the upload page, external PDF revisions can be uploaded and analyzed.After the file has been analyzed, the result is added to the database and the user is redirected to a result page.
The application allows faceted search on metadata and automatically extracted information in the document collection.The user can search for judges, senates, the corresponding "Oberlandesgericht" (higher regional court equiv.),"Landesgericht" (state court equiv.), or "Amtsgericht" (district court equiv.)decisions as well as the docket number (see Figure 3).To assess diachronic developments of revision outcomes, users can search for a timespan in which the revisions or their respective previous court decisions were decided.To enable exploratory searches and comparison of verdict decisions by facet, facetes can be selected without query terms.Then, the application returns e.g.revision outcome statistics for all judges, courts, etc. Combinations of fields can be used here as well.The results page contains all extracted information about a decision such as courts, judges, etc. of the verdict file.Additionally, the page contains the classified revision outcome, the confidence score, and the sentence that determined the evaluation.While techniques in this work are rather standard, the value of this work lies in enabling a new field of application: an immense genuine added value from this application could be created with a thorough statistical analysis of factors correlating with success in front of the Federal Supreme Court.For this purpose, the quality of the entity extraction and the classification ought to be improved by different approaches and additional training.But even already now, the data set compiled using this application can be structured and analyzed profoundly by interdisciplinary teams.Both the confirmation of known influences like procedure types and yet unknown factors, e.g.duration of proceedings or geographical origin of the cases, would be an interesting starting point for substantial unprecedented large-scale legal analysis.

Fig. 3 .
Fig. 3. Faceted search showing revision outcomes of the Higher Regional Court (OLG) Hamburg, faceted by the deciding senate; time range 2010-2017.

Table 1 .
[10]uation of Entity RecognitionFor NER, we use the Watson NLU API with a custom model.Internally, WKS employs the Statistical Information and Relation Extraction (SIRE)8classifier for sequential annotation and extraction of entities.It works in a similar way to a standard Conditional Random Field (CRF)[10]by employing symbolic feature combinations.

Table 2 .
Sentence-level evaluation of revision outcomes

Table 3 .
Document-level evaluation of revision outcomes