The Patient Information Leaflet (PIL) Corpus

Version 2.0 (31 Mar 2006)

This directory contains the PIL corpus derived from the PIL corpus data. This contains all the documents originally processed, but with some non-PIL documents (such as product lists) and near-duplicates removed. This is the recommended version for research purposes. See the base corpus for the complete original corpus in a similar format. For details of the corpus package as a whole, see the PIL corpus home page.

Search the corpus

You can search the corpus online using a query interface.

Download the corpus

All corpus resources including the search tool are available for download: PIL-corpus-2.0.tar.gz (50 MB). To unpack the archive, you may need a Unix-like operating system such as Linux or MacOS X. To run the search tool locally, you will need Perl and a webserver.

Organisation of the corpus

The corpus is organised in the subdirerctory 'data' as a three-level directory tree. The top level indicates the company which produced the leaflet, the second level indicates the name of the product, the third level represents the actual document in various formats. The filename structure is as follows:
    data/<CompanyName>/<ProductName>/<ProductName>.<filetype>
So for example, the rtf version of Glaxo's Zantac Syrup is:
    data/Glaxo/Zantac_Syrup/Zantac_Syrup.rtf
In addition, the corpus contains the following subdirectories containing all the documents of a given file type: The mapping from company/product names to filenames is fairly consistent, but it is safest to refer to the lists provided in the following files to access the data:

The PIL corpus was initially developed as part of the ICONOCLAST project, supported by the EPSRC (grant no L77102).


Document last modified on 06 June 2006.
Email queries or corrections to .