Sentence Boundary Detection for Multilingual Legal Text

Introduction

Research has shown [2], that Sentence Boundary Detection (SBD) is not “solved” yet, especially in informal language and in special domains. Especially in the legal domain, conventional sentence splitting methods are still subpar. Although in German [3] recent research worked on mitigating this issue, in many other languages such as French, Italian, Spanish, or Portuguese, SBD is not sufficiently researched, to the best of our knowledge. With this project, we plan to close this gap by creating a dataset for multilingual legal SBD, evaluating the performance of current methods on it, and training a superior model.