This directory contains the base corpus of the PIL corpus data. This
contains all the documents originally processed, including some files which
are product-lists, not PILs, and some PILs which are virtually identical. For
this reason it is probably not the best version to use for research
purposes - see the PIL corpus for a more suitable
option. For details of the corpus package as a whole, see the PIL
corpus home page.
Organisation of the corpus
The corpus is contained in the subdirectory 'data' and is organised as a
three-level directory tree. The top level indicates the company which produced
the leaflet, the second level indicates the name of the product, the third
level represents the actual document in various formats. The filename
structure is as follows:
So for example, the rtf version of Glaxo's Zantac Syrup is:
data/Glaxo/Zantac_Syrup/Zantac_Syrup.rtf
The mapping from company/product names to filenames is fairly consistent, but
it is safest to refer to the lists provided in the following files to access
the data:
manifest.html - a list of all the
documents and links to corresponding files in all the formats (sorted by
company name).
manifest2.html - a list of all the
documents and links to corresponding files in all the formats (sorted by
document name).
all-companies.txt - a list of all the
companies, in alphabetical order.
all-docs.txt - a list of all the
documents, in alphabetical order (there are no document name clashes between
companies).