Dr. Mihai Surdeanu, Dr. Ramesh Nallapati, and Professor Christopher Manning, all of the Stanford University Department of Computer Science Natural Language Processing Group, will present a paper entitled Legal Claim Identification: Information Extraction with Hierarchically Labeled Data (for the full text of the paper, click here for the conference proceedings in PDF and scroll down to the page numbered 22) at SPLeT 2010: The 3rd Workshop on Semantic Processing of Legal Texts, to be held 23 May 2010 in Malta.
Here is the abstract of the paper:
This paper introduces a novel Information Extraction problem, where only parts of documents have relevance and linguistic annotations are available only for these segments. The data is hierarchical: the top layer marks the relevant text segments and the bottom layer annotates domain-specific entity mentions, but only in the segments marked as relevant in the top layer. We investigate this problem in the legal domain, where we extract the text corresponding to litigation claims and entity mentions such as patents and laws in each claim. Because entity mentions are not labeled outside claims in training data, a top-down approach that extracts claims first and entity mentions next seems the most natural. However, we show that other models are superior. Using a simple semi-supervised approach we implement a bottom-up Conditional Random Field model; we also implement a joint hierarchical CRF using a combination of pseudo-likelihood and Gibbs sampling. We show that both these models significantly outperform the top-down approach.