The Patient Information Leaflet (PIL) Corpus

Version 2.0 (31 Mar 2006)

This directory contains the PIL corpus derived from the PIL corpus data. This contains all the documents originally processed, but with some non-PIL documents (such as product lists) and near-duplicates removed. This is the recommended version for research purposes. See the base corpus for the complete original corpus in a similar format. For details of the corpus package as a whole, see the PIL corpus home page.

Search the corpus

You can search the corpus online using a query interface.

Download the corpus

All corpus resources including the search tool are available for download: PIL-corpus-2.0.tar.gz (50 MB). To unpack the archive, you may need a Unix-like operating system such as Linux or MacOS X. To run the search tool locally, you will need Perl and a webserver.

Organisation of the corpus

The corpus is organised in the subdirerctory 'data' as a three-level directory tree. The top level indicates the company which produced the leaflet, the second level indicates the name of the product, the third level represents the actual document in various formats. The filename structure is as follows:

    data/<CompanyName>/<ProductName>/<ProductName>.<filetype>

So for example, the rtf version of Glaxo's Zantac Syrup is:

    data/Glaxo/Zantac_Syrup/Zantac_Syrup.rtf

In addition, the corpus contains the following subdirectories containing all the documents of a given file type:

doc - all documents in Microsoft Word format
html - all documents in html format. NB: some documents also have an associated subdirectory containing files required by the html version
rtf - all documents in RTF format
sgml - all documents in SGML markup (conforming to the specification in icon.dtd)

The mapping from company/product names to filenames is fairly consistent, but it is safest to refer to the lists provided in the following files to access the data:

manifest.html - a list of all the documents and links to corresponding files in all the formats (sorted by company name).
manifest2.html - a list of all the documents and links to corresponding files in all the formats (sorted by document name).
all-companies.txt - a list of all the companies, in alphabetical order.
all-docs.txt - a list of all the documents, in alphabetical order (there are no document name clashes between companies).
classified-docs.txt - a list of all the documents, sorted by company.
orig-docs.txt - a list of all the documents in the original corpus, before removal of duplicates etc. (same as all-docs.txt in the Base corpus, in alphabetical order.
removed-docs.txt - a list of all the documents removed from the original corpus, in alphabetical order.
icon.dtd - the DTD developed in the ICONOCLAST project and used for the SGML markup.

The PIL corpus was initially developed as part of the ICONOCLAST project, supported by the EPSRC (grant no L77102).

Document last modified on 06 June 2006.
Email queries or corrections to .