The Patient Information Leaflet (PIL) Corpus

Version 2.0 (31 March 2006)

This directory contains version 2.0 of the Patient Information Leaflet corpus, a collection of several hundred documents giving instructions to patients about their medication. The corpus was originally created from the ABPI compendium of patient information leaflets by manual scanning and conversion. Documents are available in rtf, doc, html formats, and also marked up with logical structure using a specially created sgml dtd specification. The corpus is organised in the following versions:

The base corpus consisting of all 595 documents originally processed;
The PIL corpus consisting of a subset of 471 documents after removal of non-PIL documents and near duplicates. (This is the corpus recommended for general use.)

The PIL corpus was initially developed as part of the ICONOCLAST project, supported by the EPSRC (grant no L77102).

Release notes

March 2006	Version 2.0 Tidied up for general release by Roger Evans
Nov 2000	Version 1.0 Initial internal release by Nadjet Bouayad-Agha

Projects using this resource

The following is a list of projects we know about that have made use of the PIL corpus. If you know of other uses not in this list, please send an email to .

2000: The ICONOCLAST project (ITRI, Brighton) originated ths corpus and used it in work on constraint-based generation in different styles.
2001: The PILLS project (ITRI, Brighton; IMI Freiburg; Berlitz) used the corpus in the development of a multilingual authoring tool for patient information leaflets.
2001: The RAGS project (ITRI, Brighton), used PIL data as the target for its RICHES demonstration generator.
2000-2004: Daniel Paiva's PhD thesis work, (ITRI, Brighton), used the corpus for work on stylistic control of generation.
2004: The COGENT project (ITRI, Brighton; Informatics, Sussex) is using the corpus in its work on wide-coverage generation.

Document last modified on 25 April 2007.