This directory contains the PIL corpus derived from the PIL corpus
data. This contains all the documents originally processed, but with some
non-PIL documents (such as product lists) and near-duplicates removed.
This is the recommended version for research purposes. See the
base corpus for the complete original corpus in a similar
format. For details of the corpus package as a whole, see the PIL
corpus home page.
All corpus resources including the search tool are available for download: PIL-corpus-2.0.tar.gz (50 MB). To unpack the
archive, you may need a Unix-like operating system such as Linux or MacOS X. To
run the search tool locally, you will need Perl and a webserver.
Organisation of the corpus
The corpus is organised in the subdirerctory 'data' as a three-level directory
tree. The top level indicates the company which produced the leaflet, the
second level indicates the name of the product, the third level represents the
actual document in various formats. The filename structure is as follows:
So for example, the rtf version of Glaxo's Zantac Syrup is:
data/Glaxo/Zantac_Syrup/Zantac_Syrup.rtf
In addition, the corpus contains the following subdirectories
containing all the documents of a given file type:
doc - all documents in Microsoft Word format
html - all documents in html format. NB: some
documents also have an associated subdirectory containing files required by
the html version
rtf - all documents in RTF format
sgml - all documents in SGML markup (conforming
to the specification in icon.dtd)
The mapping from company/product names to filenames is fairly consistent, but
it is safest to refer to the lists provided in the following files to access
the data:
manifest.html - a list of all the
documents and links to corresponding files in all the formats (sorted by
company name).
manifest2.html - a list of all the
documents and links to corresponding files in all the formats (sorted by
document name).
all-companies.txt - a list of all the
companies, in alphabetical order.
all-docs.txt - a list of all the
documents, in alphabetical order (there are no document name clashes between
companies).
orig-docs.txt - a list of all the
documents in the original corpus, before removal of duplicates etc.
(same as all-docs.txt in the Base corpus, in alphabetical
order.
removed-docs.txt - a list of all the
documents removed from the original corpus, in alphabetical order.
icon.dtd - the DTD developed in the ICONOCLAST
project and used for the SGML markup.
The PIL corpus was initially developed as part of the ICONOCLAST
project, supported by the EPSRC (grant no L77102).
Document last modified on 06 June 2006.
Email queries or corrections to
.