OU Logo

Home | News | Project | People | Key Ideas | Publications | Corpus | Quotes | Contact Us

NumGen Logo


Generating intelligent descriptions of numerical quantities for people with different levels of numeracy


Photo: The Independent, 4 June 2008

The NumGen Project

NumGen was a project funded by the ESRC and led by Dr Sandra Williams of the Natural Language Generation Group, The Computing Department, The Open University.


NumGen was a one-year scoping study that investigated how to express numerical quantities, especially proportions (fractions, percentages and ratios), for different audiences.

Numerical quantities are extremely common in all kinds of documents. Pick up any newspaper and you will find it packed with them - "Red meat increases the risk of cancer by 67 percent" or "More than a quarter of students were awarded A grades". It is surprising, then, that in Natural Language Generation (the study of computer applications that automatically generate documents) numerical quantities have received little attention beyond the decision of whether to output digits (such as 27) or number words (such as twenty-seven).

Another surprise was the lack of research on numerical proportions in Linguistics. What kinds of variations might we expect in proportion expressions? Suppose we asked 50 people to write a proportion in the sentence "[......................] of people believe they are smarter than average" given data 983/1000? How many different proportion phrases would we get? We reported our results on this question in a journal article (currently under review).

It is important to know more about how to express numerical information because different users have different needs. For example, not all users are numerate. In fact, a 2003 UK Government study found that nearly half of adults have problems understanding mathematical concepts such as percentages.

So what has NumGen achieved over its one-year lifetime? Here is a list of major activities and outputs, pdf (56 KB).

Some other project materials:


NumGen ESRC project started in May 2008 and officially ended in April 2009. However, I am still maintaining my interest in this research area. I am still writing papers. I am keen to develop new bids for funding and to work with research students. Please contact me if you are interested.


Principal Investigator and Research Fellow

Dr Sandra Williams, The Open University

Advisory Committee

Dr Richard Power, The Open University
Professor Donia Scott, University of Sussex

Visiting Student

Susana Bautista, Universidad Complutense de Madrid

External Collaborators

Dr Pablo Gervás, Universidad Complutense de Madrid
Dr Raquel Hervás, Universidad Complutense de Madrid


Key Ideas and Applications

The NumGen project has generated a wealth of research questions, new ideas and hypothesis, only some of which we have published, as yet. We also developed a theory of how to plan a specification for numerical proportions (numbers between 0 and 1) to be used in an Natural Language Generation (NLG) system and a working model that uses Artificial Intelligence techniques. Brief outlines follow:

Precision and mathematical form both change with document structure

Often a particular numerical fact is mentioned several times in a document. If the first mention occurs in the title, abstract, or opening sentences, and subsequent mentions occur further on, the first mention tends to be vaguer (less precise) and expressed in a simpler mathematical form than subsequent mentions. Evidence from our corpus supported these ideas, for example, in the first sentence of an article about exam results: “more than a quarter of A-Level papers were marked A” and in the fifth paragraph of the same article: “25.9% of A-Level papers were awarded an A-grade this summer”. We produced a poster to summarise our findings and a conference paper to further explain them. Our results can be used in the generation of documents with referring expressions that contain numerical facts.

Numerical hedges are a vital part of numerical expressions

Hedges such as more than, less than, around and exactly indicate degrees of vagueness or preciseness in descriptions of numerical quantities such as “around 50%”. No account of numerical quantity expressions is complete without an account of numerical hedges. We analysed our corpus to investigate the range of variation in numerical hedges, to find out whether hedges tend to co-occur with round numbers, and to discover how small or large is the difference between the hedged value and the actual value (when available). Our results are currently being compiled into a journal article and are implemented in our generation grammar.

Planning specifications for numerical proportions

Our planner is based on the idea that three fundamental variables are required to specify a numerical proportion such that a generation grammar can subsequently express it in natural language. These are: mathematical form (fraction, ratio, percentage), degree of precision or roundness, and a variable to specify the semantics of a numerical hedge (e.g., greater than, less than, approximately, exactly). The planner is modelled as a Constraint Logic Program. We are currently describing the underlying theory and the design of the working program in a journal article.

NumGen Corpus

We studied variations in the way that people write numerical quantities by collecting sets of texts written by different people about the same numerical facts. For example, we collected a set of articles from different newspapers about the 2008 A-Level results. Each article mentioned certain numerical facts such as the overall pass rate and the proportions of students who had been awarded A and B grades. We built up a collection of 110 articles on ten topics. For each article, we annotated numerical quantities with:

The collection contains nearly two thousand annotated numerical quantities. We hope to make it available to other researchers and are currently seeking permission from publishers of the original articles.

Example of Annotation

<!DOCTYPE w3c-doctype="numgen">
<TITLE>XML for numerical expressions</TITLE>
<ARTICLE id="001" topic="A-Levels" source="AOL">
   <SENTENCE id="1">Another record year for A-levels</SENTENCE>
   <SENTENCE id="2">
      Last Updated:
      	<numex id="001" type="date" format="digits" units="GMT">Thursday, 14 August 2008, 08:28 GMT</numex> 
   <SENTENCE id="3"> 
      The A-level pass rate rose for the 
      	<numex id="002" type="ordinal" format="digits" units="year" value="26">26th year</numex>  
       in a row as record number of teenagers achieved top grades. 
   <SENTENCE id="4">But figures released by the exam boards highlighted startling discrepancies in Grade A 
      pass rates between regions across England.</sentence> 
   <SENTENCE id="5"> 
      Statistics from the exam boards showed greater improvements in students in the South East getting 
      A grades in the past 
      	<numex id="003" type="cardinal" format="words" units="years" value="6">six years</numex>  
      than those in the North East. 
   <SENTENCE id="6"> 
      The South East has seen a 
      	<numex id="004" type="percentage" format="digits" value="0.061">6.1%</numex>  
       increase in A grades to 
      	<numex id="005" type="percentage" format="digits" value="0.291">29.1%</numex>  
      	<numex id="006" type="date" format="digits">2002</numex>  
       but the North East has seen an improvement of only 
      	<numex id="007" type="percentage" format="digits" value="0.021">2.1%</numex>  
      	<numex id="008" type="percentage" format="digits" value="0.198">19.8%</numex>  
      during the same period. 
   <SENTENCE id="7"> 
      But the percentage of pupils gaining passing E grades is rising quicker in the North East - an 
      improvement of 
      	<numex id="009" type="percentage" format="digits" value="0.34">3.4%</numex>  
      	<numex id="010" type="cardinal" format="words" units="years" value="6">six years</numex>  
      compared with 
      	<numex id="011" type="percentage" format="digits" value="0.28">2.8%</numex>  
      in the South East. 
   <SENTENCE id="8"> 
      Overall figures showed the national pass rate soared 
      	<numex id="012" type="percentage" format="digits" hedge="above" hedge-semantics="greaterthan" value="0.97">above 97%</numex>  
      for the first time this year, while 
      	<numex id="013" type="ratio" format="words" value="0.25">one in four</numex>  
      sixth-formers were awarded A grades ( 
      	<numex id="014" type="percentage" format="digits" value="0.259">25.9%</numex>  
      , up from 
      	<numex id="015" type="percentage" format="digits" value="0.253">25.3%</numex>  
       last year). 


Dr. Sandra Williams
Natural Language Generation Group
The Computing Department
The Open University
Walton Hall
Milton Keynes
United Kingdom

Phone: 00 44 (0)  1908 85 87 80
Fax: 00 44 (0) 1908 65 21 40

Some Quotes

“I'd like to emphasize that the calculations we will do are deliberately imprecise. Simplification is a key to understanding. First, by rounding the numbers, we can make them easier to remember. Second, rounded numbers allow quick calculations. For example, in this book, the population of the United Kingdom is 60 million, and the population of the world is 6 billion. I'm perfectly capable of looking up more accurate figures, but accuracy would get in the way of fluent thought.” (David J.C. MacKay, Sustainable Energy - without the hot air, UIT Cambridge Ltd, p.16, 2009)
“Specialist maths teachers are to be introduced in every school in England.
So can you do a simple sum? The BBC's James Westhead has been to find out.” (BBC News, June 17 2008)
“According to latest figures, around 23.8 million adults have numeracy skills below a C grade GCSE.
This includes 6.8m without even the most basic functional maths skills needed to pay household bills, understand wage slips and read train timetables.” (Telegraph, June 5 2008)
“The number of American children gunned down has doubled every year since 1950" - Sometimes junk statistics are caused simply by lazy wording. Perhaps the best (worst) example came in a prospective PhD student's dissertation, published in 1995. It appeared in the first chapter of Damned Lies and Satistics by Joel Best, who called it "the worst social statistic ever". It read: "Every year since 1950, the number of American children gunned down has doubled." Really? Let's do the maths. Say only one child was gunned down in 1950. According to our student, that number would have doubled every year, so two dead in 1951, four in 1952, eight in 1953... that makes 1,024 in 1960, and so on. By 1995, the year of the report, more than - wait for it - 35 trillion children were gunned down. That's really quite a lot. It turns out that the student had taken the figure from a government report, which stated: "The number of American children killed each year by guns has doubled since 1950." So the figure had doubled over 45 years, not every year. By garbling his words, the student came up with a wildly inaccurate statistic. McConway's verdict: "It's so important to be precise when writing about statistics. There are two conflicting pressures - to keep it simple and to tell the story properly. This case is terrible, but sometimes even statisticians get it wrong.” (Simon Usborne, The Independent, April 9 2008)
“Sometimes when I hear that something or other is selling at a fraction of its normal cost, I comment that the fraction is probably 4/3, and am met with a blank stare.” (J.A. Paulos, Innumeracy: Mathematical Illiteracy and its Consequences, p.164)

Sandra Williams, August 2013

NumGen was funded by the Economic and Social Research Council under Small Grant Ref. RES-000-22-2760