February 20, 2014
March 3, 2014
Will be held on:
in conjunction with the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2014).
The PITR workshop series is associated with the ACL Special Interest Group on Speech and Language Processing for Assistive Technologies (SIG-SLPAT).
|(09:00) Session 1 - Keynote|
|09:00||Welcome and opening remarks|
|09:10||Keynote: Choosing Appropriate Words in Generated Texts for Low-Skill Readers Ehud Reiter|
|(11:00) Session 2 - Papers|
|11:00||One Step Closer to Automatic Evaluation of Text Simplification Systems|
Sanja Štajner, Ruslan Mitkov and Horacio Saggion Abstract
This study explores the possibility of replacing the costly and time-consuming human evaluation of the grammaticality and meaning preservation of the output of text simplification (TS) systems with some automatic measures. The focus is on six widely used machine translation (MT) evaluation metrics and their correlation with human judgements of grammaticality and meaning preservation in text snippets. As the results show a significant correlation between them, we go further and try to classify simplified sentences into: (1) those which are acceptable; (2) those which need minimal post-editing; and (3) those which should be discarded. The preliminary results, reported in this paper, are promising.
|11:20||Automatic diagnosis of understanding of medical words|
Natalia Grabar, Thierry Hamon and Dany Amiot Abstract
Within the medical field, very specialized terms are commonly used, while their understanding by laymen is not always successful. We propose to study the understandability of medical words by laymen. Three annotators are involved in the creation of the reference data used for training and testing. The features of the words may be linguistic (ie, number of characters, syllables, number of morphological bases and affixes) and extra-linguistic (ie, their presence in a reference lexicon, frequency on a search engine). The automatic categorization results show between 0.806 and 0.947 F-measure values. It appears that several features and their combinations are relevant for the analysis of understandability (ie, syntactic categories, presence in reference lexica, frequency on the general search engine, final substring).
|11:40||Exploring Measures of “Readability” for Spoken Language: Analyzing linguistic features of subtitles to identify age-specific TV programs|
Sowmya Vajjala and Detmar Meurers Abstract
We investigate whether measures of readability can be used to identify age-specific TV programs. Based on a corpus of BBC TV subtitles, we employ a range of linguistic readability features motivated by Second Language Acquisition and Psycholinguistics research.
Our hypothesis that such readability features can successfully distinguish between spoken language targeting different age groups is fully confirmed. The classifiers we trained on the basis of these readability features achieve a classification accuracy of 95.9%. Investigating several feature subsets, we show that the authentic material targeting specific age groups exhibits a broad range of linguistics and psycholinguistic characteristics that are indicative of the complexity of the language used.
|12:00||Keyword Highlighting Improves Comprehension for People with Dyslexia|
Luz Rello, Horacio Saggion and Ricardo Baeza-Yates Abstract
The use of certain font types and sizes improve the reading performance of people with dyslexia. However, the impact of combining such features with the semantics of the text has not yet been studied. In this eye-tracking study with 62 people (31 with dyslexia), we explore whether highlighting the main ideas of the text in boldface has an impact on readability and comprehensibility. We found that highlighting keywords improved the comprehension of participants with dyslexia. To the best of our knowledge, this is the first result of this kind for people with dyslexia.
|(14:00) Session 3 - Papers|
|14:00||An eye-tracking evaluation of some parser complexity metrics|
Matthew J. Green Abstract
Information theoretic measures of incremental parser load were generated from a phrase structure parser and a dependency parser and then compared with incremental eye movement metrics collected for the same temporarily syntactically ambiguous sentences, focussing on the disambiguating word. The findings show that the surprisal and entropy reduction metrics computed over a phrase structure grammar make good candidates for predictors of text readability for human comprehenders. This leads to a suggestion for the use of such metrics in Natural Language Generation (NLG).
|14:20||Syntactic Sentence Simplification for French|
Laetitia Brouwers, Delphine Bernhard, Anne-Laure Ligozat and Thomas Francois Abstract
This paper presents a method for the syntactic simplification of French texts. Syntactic simplification aims at making texts easier to understand by implifying complex syntactic structures that hinder reading. Our approach is based on the study of two parallel corpora (encyclopaedia articles and tales). It aims to identify the linguistic phenomena involved in the manual simplification of French texts and organise them within a typology. We then propose a syntactic simplification system that relies on this typology to generate simplified sentences. The module starts by generating all possible variants before selecting the best subset. The evaluation shows that about 80% of the simplified sentences produced by our system are accurate.
|(16:00) Session 4 - Posters|
|Medical text simplification using synonym replacement: Adapting assessment of word difficulty to a compounding language|
Emil Abrahamsson, Timothy Forni, Maria Skeppstedt and Maria Kvist Abstract
Medical texts can be difficult to understand for laymen, due to a frequent occurrence of specialised medical terms. Replacing these difficult terms with easier synonyms can, however, lead to improved readability. In this study, we have adapted a method for assessing difficulty of words to make it more suitable to medical Swedish. The difficulty of a word was assessed not only by measuring the frequency of the word in a general corpus, but also by measuring the frequency of substrings of words, thereby adapting the method to the compounding nature of Swedish. All words having a MeSH synonym that was assessed as easier, were replaced in a corpus of medical text. According to the readability measure LIX, the replacement resulted in a slightly more difficult text, while the readability increased according to the OVIX measure and to a preliminary reader study.
|Segmentation of patent claims for improving their readability|
Gabriela Ferraro, Hanna Suominen and Jaume Nualart Abstract
Good readability of text is important to ensure efficiency in communication and eliminate risks of misunderstanding. Patent claims are an example of text whose readability is often poor. In this paper, we aim to improve claim readability by a clearer presentation of its content. Our approach consist in segmenting the original claim content at two levels. First, an entire claim is segmented to the components of preamble, transitional phrase and body, using a rule-based approach. Second, a conditional random field is trained to segment the components into clauses. An alternative approach would have been to modify the claim content which is, however, prone to also changing the meaning of this legal text. For both segmentation levels, we report results from statistical evaluation of segmentation performance. In addition, a qualitative error analysis was performed to understand the problems underlying the clause segmentation task. Our accuracy in detecting the beginning and end of preamble text is 1.00 and 0.97, respectively. For the transitional phase, these numbers are 0.94 and 1.00 and for the body text, 1.00 and 1.00. Our precision and recall in the clause segmentation are 0.77 and 0.76, respectively. The results give evidence for the feasibility of automated claim and clause segmentation, which may help not only inventors, researchers, and other laypeople to understand patents but also patent experts to avoid future legal cost due to litigations.
|Improving Readability of Swedish Electronic Health Records through Lexical Simplification: First Results|
Gintaré Grigonyté, Maria Kvist, Sumithra Velupillai and Mats Wirén Abstract
This paper describes part of an ongoing effort to improve the readability of Swedish electronic health records (EHRs). An EHR contains systematic documentation of a single patient's medical history across time, entered by healthcare professionals with the purpose of enabling safe and informed care. Linguistically, medical records exemplify a highly specialised domain, which can be superficially characterised as having telegraphic sentences involving displaced or missing words, abundant abbreviations, spelling variations including misspellings, and terminology. We report results on lexical simplification of Swedish EHRs, by which we mean detecting the unknown, out-of-dictionary words and trying to resolve them either as compounded known words, abbreviations or misspellings.
|An Open Corpus of Everyday Documents for Simplification Tasks|
David Pellow and Maxine Eskenazi Abstract
In recent years interest in creating statistical automated text simplification systems has increased. Many of these systems have used parallel corpora of articles taken from Wikipedia and Simple Wikipedia or from Simple Wikipedia revision histories and generate Simple Wikipedia articles. In this work we motivate the need to construct a large, accessible corpus of everyday documents along with their simplifications for the development and evaluation of simplification systems that make everyday documents more accessible. We present a detailed description of what this corpus will look like and the basic corpus of everyday documents we have already collected. This latter contains everyday documents from many domains including driver's licensing, government aid and banking. It contains a total of over 120,000 sentences. We describe our preliminary work evaluating the feasibility of using crowdsourcing to generate simplifications for these documents. This is the basis for our future extended corpus which will be available to the community of researchers interested in simplification of everyday documents.
|EACL - Expansion of Abbreviations in CLinical text|
Lisa Tengstrand, Beáta Megyesi, Aron Henriksson, Martin Duneld and Maria Kvist Abstract
We present a distributional semantic approach to find candidates of the original form of the abbreviation, and combine this with Levenshtein distance to choose the correct candidate among the semantically related words. We apply the method to radiology reports and medical journal texts, and compare the results to general Swedish.
|A Quantitative Insight into the Impact of Translation on Readability|
Alina Maria Ciobanu and Liviu Dinu Abstract
In this paper we investigate the impact of translation on readability. We propose a quantitative analysis of several shallow, lexical and morpho-syntactic features that have been traditionally used for assessing readability and have proven relevant for this task. We conduct our experiments on a parallel corpus of transcribed parliamentary sessions and we investigate readability metrics for original segments of text, written in the language of the speaker, and their translations.
|Classifying easy-to-read texts without parsing|
Johan Falkenjack and Arne Jonsson Abstract
Document classification using automated linguistic analysis and machine learning (ML) has been shown to be a viable road forward for readability assessment. The best models can be trained to decide if a text is easy to read or not with very high accuracy, e.g. a model using 117 parameters from shallow, lexical, morphological and syntactic analyses achieves 98,9% accuracy. In this paper we compare models created by parameter optimization over subsets of that total model to find out to which extent different high-performing models tend to consist of the same parameters and if it is possible to find models that only use features not requiring parsing. We used a genetic algorithm to systematically optimize parameter sets of fixed sizes using accuracy of a Support Vector Machine classifier as fitness function. Our results show that it is possible to find models almost as good as the currently best models while omitting parsing based features.
|An Analysis of Crowdsourced Text Simplifications|
Marcelo Amancio and Lucia Specia Abstract
We present a study on the text simplification operations undertaken collaboratively by Simple English Wikipedia contributors. The aim is to understand whether complex-simple parallel corpora involving this version of Wikipedia can be used as data source to induce simplification rules, and if it needs to be filtered out to avoid noisy segment pairs, whether this can be done automatically. A subset of the corpus was first manually analysed to identify transformation operations present in the corpus. We then built machine learn ing models to attempt to automatically classify segments based on such operations. Our results show that the most common transformation operations performed by humans are paraphrasing (39.80%) and drop of information (26.76%), which are some of the most difficult operations to generalise from data. They are also the most difficult operations to identify automatically, with the lowest overall classifier accuracy among all operations (73% and 59%, respectively).
|An evaluation of syntactic simplification rules for people with autism|
Richard Evans, Constantin Orasan and Iustin Dornescu Abstract
Syntactically complex sentences constitute an obstacle for some people with Autistic Spectrum Disorders. This paper evaluates a set of simplification rules specifically designed for tackling complex and compound sentences. In total, 127 different rules were developed for the rewriting of complex sentences and 56 for the rewriting of compound sentences. The evaluation assessed the accuracy of these rules individually and revealed that fully automatic conversion of these sentences into a more accessible form is not very reliable.
|+||Guest Poster: a preview of an EACL Main Session Paper: Assessing the Relative Reading Level of Sentence Pairs for Text Simplification. Sowmya Vajjala and Detmar Meurers|
Submissions will be judged on appropriateness, clarity, originality/innovativeness, correctness/soundness, meaningful comparison, thoroughness, significance, contributions to research, and replicability. Each submission will be reviewed by at least three program committee members.
Papers should prepared in EACL format (see instructions and downloadable style files). They may consist of up to eight (8) pages of content, plus two extra pages for references; final versions should take into account reviewers' comments. Papers will be presented as either (i) a talk and a poster, or (ii) a poster only, as determined by the program committee. Decisions on presentation format will be based on the nature rather than the quality of the work. There will be no distinction in the proceedings between papers presented orally and as posters.
Please submit your paper via the online START Conference Manager system.
January 30, 2014: Deadline for paper submission
February 20, 2014: Notification of acceptance
March 3, 2014: Camera-ready deadline
April 27, 2014: Workshop
Stefan Bott, Universitat Pompeu Fabra, Spain
Kevyn Collins-Thompson, University of Michigan, USA
Siobhan Devlin, University of Sunderland, UK
Micha Elsner, Ohio State University, USA
Richard Evans, University of Wolverhampton, UK
Oliver Ferschke, Technische Universität Darmstadt, Germany
Thomas Francois, University of Louvain, Belgium
Caroline Gasperin, SwiftKey, UK
Albert Gatt, University of Malta, Malta
Raquel Hervas, Universidad Complutense de Madrid, Spain
Veronique Hoste, University College Ghent, Belgium
Matt Huenerfauth, The City University of New York (CUNY), USA
David Kauchak, Middlebury College, USA
Annie Louis, University of Edinburgh, UK
Ruslan Mitkov, University of Wolverhampton, UK
Hitoshi Nishikawa, NTT, Japan
Ehud Reiter, University of Aberdeen, UK
Matthew Shardlow, Uni of Manchester, UK
Lucia Specia, University of Sheffield, UK
Ivelina Stoyanova, BAS, Bulgaria
Irina Temnikova, Qatar Computing Research Institute, Qatar
Sowmya Vajjala, Uni Tuebingen, Germany
Ielka van der Sluis, University of Groningen, The Netherlands
Jennifer Williams, MIT, USA
Kristian Woodsend, University of Edinburgh, UK
Last modified: March 2014, S.H.Williams
Prof. Ehud Reiter, University of Aberdeen
Title: Choosing Appropriate Words in Generated Texts for Low-Skill Readers
Ciobanu, Alina Maria
Green, Matthew J.