The Patient Information Leaflet (PIL) Corpus

Version 2.0 (31 March 2006)

This directory contains the base corpus of the PIL corpus data. This contains all the documents originally processed, including some files which are product-lists, not PILs, and some PILs which are virtually identical. For this reason it is probably not the best version to use for research purposes - see the PIL corpus for a more suitable option. For details of the corpus package as a whole, see the PIL corpus home page.

Organisation of the corpus

The corpus is contained in the subdirectory 'data' and is organised as a three-level directory tree. The top level indicates the company which produced the leaflet, the second level indicates the name of the product, the third level represents the actual document in various formats. The filename structure is as follows:

    data/<CompanyName>/<ProductName>/<ProductName>.<filetype>

So for example, the rtf version of Glaxo's Zantac Syrup is:

    data/Glaxo/Zantac_Syrup/Zantac_Syrup.rtf

The mapping from company/product names to filenames is fairly consistent, but it is safest to refer to the lists provided in the following files to access the data:

manifest.html - a list of all the documents and links to corresponding files in all the formats (sorted by company name).
manifest2.html - a list of all the documents and links to corresponding files in all the formats (sorted by document name).
all-companies.txt - a list of all the companies, in alphabetical order.
all-docs.txt - a list of all the documents, in alphabetical order (there are no document name clashes between companies).
classified-docs.txt - a list of all the documents, sorted by company.
icon.dtd - the DTD developed in the ICONOCLAST project and used for the SGML markup.

The PIL corpus was initially developed as part of the ICONOCLAST project, supported by the EPSRC (grant no L77102).

Document last modified on 31 March 2006 by Roger Evans.
Email queries or corrections to .