Gordon Rugg's Homepage
Updated: May 4, 2003
You are here > Home > Elicitation Resource > Overview > Retention Study > Predicting Performance
<< Prev | 1 | 2 | 3 | 4 | 5 | 6 | Next >>  
Retention Study
4: Predicting performance
Gordon Rugg, Brendan d’Cruz, Lorraine Foreman-Peck & David Roberts

4.1 Introduction

Although it is desirable to have a clear understanding of the student’s view, this is not necessarily the best way to approach understanding and prediction of retention. There is a substantial body of research showing that people’s introspections into their own behaviour are of low validity. There is also a substantial body of research on prediction and forecasting, which indicates that quite simple models can perform surprisingly well, once the right predictor variables have been identified.

An important issue to bear in mind here is that the variables which can predict failure are not necessarily able to predict success, and vice versa. For instance, if a student is experiencing financial problems, then they are at increased risk of doing badly on their course; however, if a student is doing well financially, this does not mean that they are more likely to perform at above average academic levels.

This means that studies of retention rates cannot simply draw on the literature on predicting academic success, and extrapolate backwards from it. In some ways, this may be just as well, since there has been unease for some time about the value of some standard pedagogic devices as predictors of later academic success. For instance, in the USA the relationship between Scholastic Aptitude Test (SAT) results at high school and later performance at university is not impressively high.

The study described below shows how a simple mathematical approach can be used to test a widespread belief about predicting the likely quality of a student dissertation. This is used as a demonstration of how tacit knowledge can be modelled and tested using explicit models.

This study is used as a simple demonstration of concept. Prediction of retention rates requires large data sets, and is outside the scope of this study. In addition, there are many methods which can be used for prediction and forecasting; optimal prediction of retention rates and performance depends on choice of the most suitable methods, which will form the basis of further work.

4.2 Background

There is a widespread belief among experienced dissertation supervisors that it is possible to make a fair prediction of the quality of a dissertation simply by reading the title. A dissertation with an unoriginal title, such as “The growth of e-commerce” is likely to reflect unoriginal work; a dissertation with a more original title, such as “Modelling quality measures with rough set theory” is likely to reflect more original and better quality work.

If true, this would offer a useful way of identifying potentially weak dissertations in the very early stages – the students’ proposed dissertation titles could be examined before the students began work on the dissertations, and potentially weak students could be offered appropriate support.

Although this topic is in itself of relatively minor importance, it is a useful way of demonstrating the underlying principle of predicting something from variables of which the students themselves are unaware. A more important question relating to this one involves assessment of the academic quality of published work from the title. This has serious implications for on-line searching, where automatically assessing the relevance of records via software is a much-researched field, but automatically assessing the quality of records is much less well understood. From a student’s point of view, it would be extremely helpful to have some method of automatically assessing the quality of a document for purposes such as initial reading about a topic – Internet and library searches on topics such as e-commerce typically find large numbers of potentially relevant documents, but do not give any indication of which documents are worth reading and which are not.

The belief that titles reflect quality is a plausible one, but at first sight concepts such as “unoriginal title” are reflections of tacit knowledge, and of limited use to anyone who has not built up sufficient experience to develop the relevant skill. However, there is one simple method of measuring originality which involves explicit knowledge, and which is well suited to automation.

This method is based on information theory, and is used in many on-line search engines as a basis for deciding which records to display first. The core concept is that the information value of a word is related to its rarity: the rarer the word, the higher the information content. A search engine will typically count the number of occurrences in the database index of each word in the search string, then represent the information value of each word as [1/number of occurrences], then sum the individual scores. The engine then displays the relevant records in order of their information value. Thus, for instance, the word “Internet” in a search string will typically find enormous numbers of hits, and have a negligible information weight as a result; a rarer word such as “stochastic” will typically find many fewer hits, and will have a much higher information value as a result.

The argument is plausible, has implications for academic practice, and is empirically testable. The experiment described below tests this hypothesis. It was performed as part of an MSc dissertation on an MSc course at University College Northampton by David Roberts, and we are grateful to him for permission to use his unpublished data and some of his text.

4.3 Method

The sample consisted of a set of 36 dissertations from University College Northampton, which had already been marked.

The marking process involves several stages: the dissertation is marked by a first marker (usually the supervisor) and by an independent second marker, after which a mark is agreed. In a few cases, where there are disagreements about the mark, a third marker may be brought in. Once a mark has been agreed, the dissertation may be scrutinised by one or both of the external examiners on the course, who can recommend a change of mark. The result of this is that the final mark is something agreed by a set of peers, and should (in principle) be relatively immune to the personal bias of an individual marker. From the viewpoint of this study, this has the advantage of providing a set of documents (the dissertations) which have already been assessed for quality by a rigorous process.

Each dissertation title was assessed for information weight in two ways.

The first way involved assessing weight by frequency of occurrence of terms within the titles in the sample – for instance, how often the term “e-commerce” occurred in the title of other dissertations in the sample. This approach was used as a demonstrator of how larger-scale work should be conducted, since the most important factor is how often a term is used within dissertations, as opposed to how often it is used in the world at large. Since abstracts of PhDs in the UK are already stored on-line, these could be used as a source of data for frequency of occurrence of terms.

The second involved assessing weight by frequency of occurrence of terms on the Internet. This approach was used as a demonstrator of how the Internet itself can be used as a source of data, rather than as a way of finding information. This approach to the Internet is still in its infancy, but offers considerable scope for future work.

The amount of publicly available information on the Web is increasing rapidly. The Web is a gigantic digital library, a searchable 15 billion word encyclopedia which has stimulated research and development in information retrieval and dissemination (Lawrence and Giles, 1999). The volume of electronic text available is staggering: the World Wide Web alone contains over 200 million pages of text comprising nearly 500 gigabytes of data. Moreover, growth is exponential, nearly doubling in size every six months (Baeza-Yates & Ribeiro-Neto, 1999) and the number of users increased from 1 million to 25 million in the five years to January 1997 (Schatz, 1997).

A more recent estimate of the size of the World Wide Web is an estimate at over 800 million indexable pages (Glover et al, 2001). This wealth of data brings a price, because its sheer size poses problems for search engines. There are various reasons for this, some of which involve technical aspects of how the search engine works, and others of which involve practices in Web site design.

Just how relatively poor results Web coverage can be is shown in the table below (Lawrence & Giles, 1999).

Table 4.1: Estimated coverage of each engine with respect to the estimated size of the Web, and the percentage of invalid links returned by each engine (from 575 queries performed December 15-17, 1997).

Search Engine Hotbot AltaVista Northern Light Excite Infoseek Lycos
Coverage with respect to estimated Web size 34% 34% 34% 14% 10% 3%
Percentage of dead links returned 5.3% 2.5% 5.0% 2.0% 2.6% 1.6%

Problems may be caused by keyword spamming or spam indexing where keywords are repeatedly and excessively used in HTML metatags for a site to promote retrieval by search engines. A variation of this is to put keywords into metatags that do not relate to the page’s actual content, such as general keywords that are frequently used in queries, or keywords copied from metatags in popular sites. To make matters worse, some advertisers attempt to gain people’s attention by taking measures meant to mislead automated search engines (Brin & Page, 1998).

Since it is very difficult even for experts to evaluate search engines, search engine bias is particularly insidious. A good example was OpenText, which was reported to be selling companies the right to be listed at the top of the search results for particular queries (Marchiori, 1997).

This contrasts sharply with the situation in information retrieval research, where most work is on small well controlled homogeneous collections such as collections of scientific papers or news stories on a related topic. Indeed, the primary benchmark for information retrieval, the Text Retrieval Conference uses a fairly small, well controlled collection for their benchmarks. Their "Very Large Corpus" benchmark is only 20GB compared to the 147GB from the initial Google crawl of 24 million web pages. Things that work well on TREC often do not produce good results on the Web (Brin & Page, 1998).

There have been numerous experiments carried out on the TREC collection, however in essence most experiments are carried out in several forms, namely Long, Medium and Very Short. (Sparck-Jones et al, 1998).

The Long forms cover all of the title, description and narrative fields, the Medium forms cover the title plus description field and the Very Short forms cover just the title field. Very Short Form requests were considered to be the nearest to the type of brief and perhaps not very well formulated queries that are often encountered in the operational environment e.g. by use of Internet search engines. (Mendelzon et al, 1997) The advantage that Very Short Form data has is that it does relate to “real-world” operational scenarios, but the TREC tests have shown that performance as a whole declines with minimal requests.
Our test collection data comprises 36 student dissertation titles and is therefore “very short form data”. Abstracts are compact and information-dense. Most of the (non-stopword) terms in an abstract are salient for retrieval purposes because these terms tend to mirror the most important topics in the full text (Hearst & Plaunt, 1993).

The method chosen to analyse the data here is inverse frequency weighting. This is based upon having a corpus of text (such as dissertation titles). The frequency of any given term in the corpus can then be compared with the total number of occurrences of all terms in the corpus, to produce a frequency measure.

Two approaches have been identified for the full-text indexing of text databases. Inverted files create a list of every unique word in the database together with the location of each word occurrence. Thus, a search for the term “computer” will first locate the term in the inverted file, and then retrieve the relevant documents based on the occurrence list for “computer”. Signature files divide the text database into a series of blocks. A bit-string is constructed for each block. Searching for a term using this approach involves scanning the signatures for a matching bit pattern. Blocks that match the desired bit pattern are then retrieved (Campbell, 1994).

An inverted file is a word-oriented method mechanism for indexing a text collection in order to speed up the searching task. The inverted file structure is composed of two elements: the vocabulary and the occurrences. The vocabulary is the set of all different words in the text. For each such word a list of all the text positions of where the word appears is stored and the set of those lists is called the occurrences. From our data, selecting an example at random produces the following result.

1 4 13 16 23 34 37 41 44 50 56 An analysis of Sony's domination of the U.K. video games industry

Table 4.2: Occurrences of terms

Term:- Position file:- NumberOccurring
Analysis 4 1
Domination 23 1
Games 50 1
Industry 56 1
Sony’s 16 1
U.K. 41 1
Video 44 1

The keywords are stored alphabetically in the index file and each term appears only once: for each keyword a list of pointers are maintained to the qualifying documents in the postings file. This method is followed by almost all commercial information retrieval systems. (Faloutsos & Oard, 1995), (Salton & McGill, 1983).

Method – experiment 1
The basic data used was that as described above. Stopwords (i.e. words with minimal semantic content such as “a” or “of”) were removed, and the resulting terms were then stemmed (i.e. truncated so that grammatical variants of the same term would be treated as equivalent – see examples below). A relatively “weak” form of stemming was used and the following stemmed pairs were identified in the data:-
change/changing
competitive/competitiveness
effect/effects
effective/effectiveness
market/marketing
retail/retailing
service/services
strategic/strategies
supermarket/supermarkets

The next stage was to analyze the data into its component parts in alphabetical order. This enables the traditional basic weighting functions in the Robertson-Sparck Jones weight as noted earlier in the text to be calculated. Here each term is given a weight and hence each dissertation title is given a score that is the sum of the weights of the matching terms.

For example using the formula:-

ICF = log N-n+0.5
n+0.5

where

N is the number of indexed documents in the collection
n the number of documents containing the term

This produces the following result for title number 1:-
No. of
titles
Term with term ICF
brand 1 1.37
decade 1 1.37
environment 2 1.14
evolved 1 1.37
investigation 4 0.86
last 1 1.37
marks 1 1.37
over 1 1.37
retail 3 0.98
spencer 1 1.37
within 5 0.76
13.36 (to 2 decimal places)

A sample correlation coefficient, r, was then calculated. The results of experiment 1 are discussed later in the text.

Method – experiment 2
Experiment 2 extends the work carried out in experiment 1 insofar as the Internet itself is used as a corpus of data.

Three search engines were selected. Yahoo! was chosen as it was originally constructed by human indexers and an interesting effect would be to see whether the results it produced would be markedly different from those from automatic indexers. The second search engine chosen was Google, due to its superior ranking algorithm, as discussed earlier in the text. Finally a metasearch engine, Ixquick, was chosen. Ixquick is known for the support for searching methods that it allows. The main advantages of metasearch engines are their ability to combine the results of many sources and the fact that the user can pose the same query to various sources through a single common interface. Anecdotal evidence suggests that the new breed of metasearchers should return more relevant pages.

The procedure undertaken was to enter the data into each search engine and record the number of “hits” for Yahoo!, Google and Ixquick respectively. There was the possible risk that search engine updates could be made that could distort the results, hence the data was captured as quickly as possible on a Sunday evening.

4.4: Results

Research results – experiment 1

We are essentially interested in whether there is linear relationship, if any between variable x (the ICF) and variable y (the mark attained for the dissertation).

Therefore we need a number to indicate how much of a linear relationship there is between x and y. This number is the correlation coefficient, denoted by r and is given by the formula:-

r = xy – x * y
n

[x2 – ( x)2 ] [y2 – ( y)2 ]
n n

The value of the correlation coefficient is always between –1 and +1. A value equal to –1 indicates a perfect linear relationship between the sample values of x and y, with the value of y decreasing as the value of x increases, i.e. the larger x becomes, the smaller y becomes and the smaller x becomes, the larger y becomes.

A value of r equal to +1 also indicates a perfect linear relationship between the sample values, but one in which the value of y increases as x increases. If there is no linear relationship between sample values of x and y, then r will have a value near zero. As r increases from 0 to +1 (or decreases from 0 to –1) then the linear relationship between the sample values of x and y becomes more pronounced.

Therefore substituting the figures from the data, the correlation coefficient, r, is obtained for experiment 1.

The result of 0.101 is so near zero that this indicates that there is only a very weak relationship between the title given to the students work and the mark obtained.

Research results – experiment 2

As noted earlier, data was entered each of the three search engines and the number of “hits” was recorded.

As with experiment 1, the correlation coefficient, r, was calculated for Yahoo!, Google and Ixquick and the values obtained were -0.033, -0.036 and –0.192 respectively. As with experiment 1, these results are so near to zero that there is effectively no linear relationship between the titles and the marks obtained.

One interesting point that should be noted is that Ixquick produces a much larger amount of “hits” than the other search engines. This is to be expected, due to the nature of a metasearch engine as opposed to an “ordinary” search engine. However Ixquick did not produce any results whatsoever for titles numbers 3, 5, 19, 22, 23 and 26. Ordinarily consideration should be made to using another search engine or investigating why this only occurs with Ixquick. However the lack of correlation in either experiment 1 or 2 means that there would be little point in this case.


4.5 Discussion and conclusion

The hypothesis that predicting student performance from dissertation titles is not proven on the basis of experiments 1 and 2. By the same token, the hypothesis that it is possible to assess the academic quality of a document automatically from the title is not proven.

However there are areas that require further work. Firstly, there is the issue of sample size. It would be interesting to see results obtained not only using larger samples but perhaps samples from different institutions and different student groups. An obvious candidate would appear to be PhD titles; however, PhDs are not given a numeric mark, so this approach would not be usable with them. A more promising approach is to use MSc thesis results, which in many institutions are given a percentage grade initially, with this grade being translated into a pass/fail/distinction later in the process.

Another area which would merit further investigation is the use of abstracts rather than titles as the basis for the calculation. As discussed above, short records (such as titles) are generally poor sources of data for information retrieval; abstracts are significantly longer, are a standard part of MSc theses, and are short enough to be tractable. If this approach found positive results, it could fairly easily be applied to students’ work as discussed earlier (although applying it to evaluation of records found in on-line searching would be more problematic).

The absence of a correlation between title frequency and mark is interesting, given the widespread belief that this correlation exists. One possible explanation is that the belief is simply mistaken; another is that the correlation involves something more subtle than what was investigated here (for instance, the length of the title, or the presence of giveaway terms indicating weak or strong scholarship).

For the purposes of this study, the truth or falsity of this belief is of secondary importance. The main thing is that this study shows how a widespread “craft skill” belief can be translated into a format which can be represented as explicit knowledge, and then tested for validity. It is tempting to speculate about how many other beliefs about student performance could be tested using the same generic approach, and then used to improve our practice in helping students.

 

<< Prev | 1 | 2 | 3 | 4 | 5 | 6 | Next >>