Here are excerpts from the text. In the post, examples of code appear after each paragraph:
The U.S. Congress recently released a series of XML documents containing U.S. Laws. The structure of these documents allow us to find which sections of the law are most commonly cited. Examining which citations occur most frequently allows us to see what Congress has spent the most time thinking about.
Citations occur for many reasons: a justification for addition or omission in subsequent laws, clarifications, or amendments, or repeals. As we might expect, the most commonly cited sections involve the IRS (Income Taxes, specifically), Social Security, and Military Procurement.
To arrive at this result, we must first see how U.S. Code is laid out. The laws are divided into a hierarchy of units, which allows anything from an entire title to individual sentences to cited. These sections have an ID and an identifier – “identifier” is used an an citation reference within the XML documents, and has a different form from the citations used by the legal community, comes in a form like “25 USC Chapter 21 § 1901″. […]
The XML hierarchy defines seventeen different levels which can be cited: ‘title’, ‘subtitle’, ‘chapter’, ‘subchapter’, ‘part’, ‘subpart’, ‘division’, ‘subdivision’, ‘article’, ‘subarticle’, ‘section’, ‘subsection’, ‘paragraph’, ‘subparagraph’, ‘clause’, ‘subclause’, and ‘item’. […]
We can use a simple XPath expression to retrieve one of these […]: […]
Future work in this area will involve cleaning up the results to remove some of the “None” entries, building a visualization of the results, and training a tagger to recognize the human-readable versions of citation in court documents. In the long run, I hope these developments help make legal information more accessible to everyone, rather than being locked up in expensive databases.
For more details, please see the complete post.