This directory contains the PIL corpus derived from the PIL corpus
data. This contains all the documents originally processed, but with some
non-PIL documents (such as product lists) and near-duplicates removed.
This is the recommended version for research purposes. See the
base corpus for the complete original corpus in a similar
format. For details of the corpus package as a whole, see the PIL
corpus home page.
All corpus resources including the search tool are available for download: PIL-corpus-2.0.tar.gz (50 MB). To unpack the
archive, you may need a Unix-like operating system such as Linux or MacOS X. To
run the search tool locally, you will need Perl and a webserver.
Organisation of the corpus
The corpus is organised in the subdirerctory 'data' as a three-level directory
tree. The top level indicates the company which produced the leaflet, the
second level indicates the name of the product, the third level represents the
actual document in various formats. The filename structure is as follows: