Retention
Study
4: Predicting performance
Gordon
Rugg, Brendan d’Cruz, Lorraine Foreman-Peck & David Roberts
4.1
Introduction
Although
it is desirable to have a clear understanding of the student’s view,
this is not necessarily the best way to approach understanding and prediction
of retention. There is a substantial body of research showing that people’s
introspections into their own behaviour are of low validity. There is
also a substantial body of research on prediction and forecasting, which
indicates that quite simple models can perform surprisingly well, once
the right predictor variables have been identified.
An important
issue to bear in mind here is that the variables which can predict failure
are not necessarily able to predict success, and vice versa. For instance,
if a student is experiencing financial problems, then they are at increased
risk of doing badly on their course; however, if a student is doing well
financially, this does not mean that they are more likely to perform at
above average academic levels.
This means
that studies of retention rates cannot simply draw on the literature on
predicting academic success, and extrapolate backwards from it. In some
ways, this may be just as well, since there has been unease for some time
about the value of some standard pedagogic devices as predictors of later
academic success. For instance, in the USA the relationship between Scholastic
Aptitude Test (SAT) results at high school and later performance at university
is not impressively high.
The study
described below shows how a simple mathematical approach can be used to
test a widespread belief about predicting the likely quality of a student
dissertation. This is used as a demonstration of how tacit knowledge can
be modelled and tested using explicit models.
This study
is used as a simple demonstration of concept. Prediction of retention
rates requires large data sets, and is outside the scope of this study.
In addition, there are many methods which can be used for prediction and
forecasting; optimal prediction of retention rates and performance depends
on choice of the most suitable methods, which will form the basis of further
work.
4.2
Background
There is
a widespread belief among experienced dissertation supervisors that it
is possible to make a fair prediction of the quality of a dissertation
simply by reading the title. A dissertation with an unoriginal title,
such as “The growth of e-commerce” is likely to reflect unoriginal
work; a dissertation with a more original title, such as “Modelling
quality measures with rough set theory” is likely to reflect more
original and better quality work.
If true,
this would offer a useful way of identifying potentially weak dissertations
in the very early stages – the students’ proposed dissertation
titles could be examined before the students began work on the dissertations,
and potentially weak students could be offered appropriate support.
Although
this topic is in itself of relatively minor importance, it is a useful
way of demonstrating the underlying principle of predicting something
from variables of which the students themselves are unaware. A more important
question relating to this one involves assessment of the academic quality
of published work from the title. This has serious implications for on-line
searching, where automatically assessing the relevance of records via
software is a much-researched field, but automatically assessing the quality
of records is much less well understood. From a student’s point
of view, it would be extremely helpful to have some method of automatically
assessing the quality of a document for purposes such as initial reading
about a topic – Internet and library searches on topics such as
e-commerce typically find large numbers of potentially relevant documents,
but do not give any indication of which documents are worth reading and
which are not.
The belief
that titles reflect quality is a plausible one, but at first sight concepts
such as “unoriginal title” are reflections of tacit knowledge,
and of limited use to anyone who has not built up sufficient experience
to develop the relevant skill. However, there is one simple method of
measuring originality which involves explicit knowledge, and which is
well suited to automation.
This method
is based on information theory, and is used in many on-line search engines
as a basis for deciding which records to display first. The core concept
is that the information value of a word is related to its rarity: the
rarer the word, the higher the information content. A search engine will
typically count the number of occurrences in the database index of each
word in the search string, then represent the information value of each
word as [1/number of occurrences], then sum the individual scores. The
engine then displays the relevant records in order of their information
value. Thus, for instance, the word “Internet” in a search
string will typically find enormous numbers of hits, and have a negligible
information weight as a result; a rarer word such as “stochastic”
will typically find many fewer hits, and will have a much higher information
value as a result.
The argument
is plausible, has implications for academic practice, and is empirically
testable. The experiment described below tests this hypothesis. It was
performed as part of an MSc dissertation on an MSc course at University
College Northampton by David Roberts, and we are grateful to him for permission
to use his unpublished data and some of his text.
4.3
Method
The sample
consisted of a set of 36 dissertations from University College Northampton,
which had already been marked.
The marking
process involves several stages: the dissertation is marked by a first
marker (usually the supervisor) and by an independent second marker, after
which a mark is agreed. In a few cases, where there are disagreements
about the mark, a third marker may be brought in. Once a mark has been
agreed, the dissertation may be scrutinised by one or both of the external
examiners on the course, who can recommend a change of mark. The result
of this is that the final mark is something agreed by a set of peers,
and should (in principle) be relatively immune to the personal bias of
an individual marker. From the viewpoint of this study, this has the advantage
of providing a set of documents (the dissertations) which have already
been assessed for quality by a rigorous process.
Each dissertation
title was assessed for information weight in two ways.
The first
way involved assessing weight by frequency of occurrence of terms within
the titles in the sample – for instance, how often the term “e-commerce”
occurred in the title of other dissertations in the sample. This approach
was used as a demonstrator of how larger-scale work should be conducted,
since the most important factor is how often a term is used within dissertations,
as opposed to how often it is used in the world at large. Since abstracts
of PhDs in the UK are already stored on-line, these could be used as a
source of data for frequency of occurrence of terms.
The second
involved assessing weight by frequency of occurrence of terms on the Internet.
This approach was used as a demonstrator of how the Internet itself can
be used as a source of data, rather than as a way of finding information.
This approach to the Internet is still in its infancy, but offers considerable
scope for future work.
The amount
of publicly available information on the Web is increasing rapidly. The
Web is a gigantic digital library, a searchable 15 billion word encyclopedia
which has stimulated research and development in information retrieval
and dissemination (Lawrence and Giles, 1999). The volume of electronic
text available is staggering: the World Wide Web alone contains over 200
million pages of text comprising nearly 500 gigabytes of data. Moreover,
growth is exponential, nearly doubling in size every six months (Baeza-Yates
& Ribeiro-Neto, 1999) and the number of users increased from 1 million
to 25 million in the five years to January 1997 (Schatz, 1997).
A more recent
estimate of the size of the World Wide Web is an estimate at over 800
million indexable pages (Glover et al, 2001). This wealth of data brings
a price, because its sheer size poses problems for search engines. There
are various reasons for this, some of which involve technical aspects
of how the search engine works, and others of which involve practices
in Web site design.
Just how
relatively poor results Web coverage can be is shown in the table below
(Lawrence & Giles, 1999).
Table 4.1:
Estimated coverage of each engine with respect to the estimated size of
the Web, and the percentage of invalid links returned by each engine (from
575 queries performed December 15-17, 1997).
Search Engine Hotbot AltaVista Northern Light Excite Infoseek Lycos
Coverage with respect to estimated Web size 34% 34% 34% 14% 10% 3%
Percentage of dead links returned 5.3% 2.5% 5.0% 2.0% 2.6% 1.6%
Problems
may be caused by keyword spamming or spam indexing where keywords are
repeatedly and excessively used in HTML metatags for a site to promote
retrieval by search engines. A variation of this is to put keywords into
metatags that do not relate to the page’s actual content, such as
general keywords that are frequently used in queries, or keywords copied
from metatags in popular sites. To make matters worse, some advertisers
attempt to gain people’s attention by taking measures meant to mislead
automated search engines (Brin & Page, 1998).
Since it
is very difficult even for experts to evaluate search engines, search
engine bias is particularly insidious. A good example was OpenText, which
was reported to be selling companies the right to be listed at the top
of the search results for particular queries (Marchiori, 1997).
This contrasts
sharply with the situation in information retrieval research, where most
work is on small well controlled homogeneous collections such as collections
of scientific papers or news stories on a related topic. Indeed, the primary
benchmark for information retrieval, the Text Retrieval Conference uses
a fairly small, well controlled collection for their benchmarks. Their
"Very Large Corpus" benchmark is only 20GB compared to the 147GB
from the initial Google crawl of 24 million web pages. Things that work
well on TREC often do not produce good results on the Web (Brin &
Page, 1998).
There have
been numerous experiments carried out on the TREC collection, however
in essence most experiments are carried out in several forms, namely Long,
Medium and Very Short. (Sparck-Jones et al, 1998).
The Long
forms cover all of the title, description and narrative fields, the Medium
forms cover the title plus description field and the Very Short forms
cover just the title field. Very Short Form requests were considered to
be the nearest to the type of brief and perhaps not very well formulated
queries that are often encountered in the operational environment e.g.
by use of Internet search engines. (Mendelzon et al, 1997) The advantage
that Very Short Form data has is that it does relate to “real-world”
operational scenarios, but the TREC tests have shown that performance
as a whole declines with minimal requests.
Our test collection data comprises 36 student dissertation titles and
is therefore “very short form data”. Abstracts are compact
and information-dense. Most of the (non-stopword) terms in an abstract
are salient for retrieval purposes because these terms tend to mirror
the most important topics in the full text (Hearst & Plaunt, 1993).
The method
chosen to analyse the data here is inverse frequency weighting. This is
based upon having a corpus of text (such as dissertation titles). The
frequency of any given term in the corpus can then be compared with the
total number of occurrences of all terms in the corpus, to produce a frequency
measure.
Two approaches
have been identified for the full-text indexing of text databases. Inverted
files create a list of every unique word in the database together with
the location of each word occurrence. Thus, a search for the term “computer”
will first locate the term in the inverted file, and then retrieve the
relevant documents based on the occurrence list for “computer”.
Signature files divide the text database into a series of blocks. A bit-string
is constructed for each block. Searching for a term using this approach
involves scanning the signatures for a matching bit pattern. Blocks that
match the desired bit pattern are then retrieved (Campbell, 1994).
An inverted
file is a word-oriented method mechanism for indexing a text collection
in order to speed up the searching task. The inverted file structure is
composed of two elements: the vocabulary and the occurrences. The vocabulary
is the set of all different words in the text. For each such word a list
of all the text positions of where the word appears is stored and the
set of those lists is called the occurrences. From our data, selecting
an example at random produces the following result.
1 4 13 16
23 34 37 41 44 50 56 An analysis of Sony's domination of the U.K. video
games industry
Table 4.2:
Occurrences of terms
Term:- Position
file:- NumberOccurring
Analysis 4 1
Domination 23 1
Games 50 1
Industry 56 1
Sony’s 16 1
U.K. 41 1
Video 44 1
The keywords
are stored alphabetically in the index file and each term appears only
once: for each keyword a list of pointers are maintained to the qualifying
documents in the postings file. This method is followed by almost all
commercial information retrieval systems. (Faloutsos & Oard, 1995),
(Salton & McGill, 1983).
Method –
experiment 1
The basic data used was that as described above. Stopwords (i.e. words
with minimal semantic content such as “a” or “of”)
were removed, and the resulting terms were then stemmed (i.e. truncated
so that grammatical variants of the same term would be treated as equivalent
– see examples below). A relatively “weak” form of stemming
was used and the following stemmed pairs were identified in the data:-
change/changing
competitive/competitiveness
effect/effects
effective/effectiveness
market/marketing
retail/retailing
service/services
strategic/strategies
supermarket/supermarkets
The next
stage was to analyze the data into its component parts in alphabetical
order. This enables the traditional basic weighting functions in the Robertson-Sparck
Jones weight as noted earlier in the text to be calculated. Here each
term is given a weight and hence each dissertation title is given a score
that is the sum of the weights of the matching terms.
For example
using the formula:-
ICF = log
N-n+0.5
n+0.5
where
N is the
number of indexed documents in the collection
n the number of documents containing the term
This produces
the following result for title number 1:-
No. of
titles
Term with term ICF
brand 1 1.37
decade 1 1.37
environment 2 1.14
evolved 1 1.37
investigation 4 0.86
last 1 1.37
marks 1 1.37
over 1 1.37
retail 3 0.98
spencer 1 1.37
within 5 0.76
13.36 (to 2 decimal places)
A sample
correlation coefficient, r, was then calculated. The results of experiment
1 are discussed later in the text.
Method –
experiment 2
Experiment 2 extends the work carried out in experiment 1 insofar as the
Internet itself is used as a corpus of data.
Three search
engines were selected. Yahoo! was chosen as it was originally constructed
by human indexers and an interesting effect would be to see whether the
results it produced would be markedly different from those from automatic
indexers. The second search engine chosen was Google, due to its superior
ranking algorithm, as discussed earlier in the text. Finally a metasearch
engine, Ixquick, was chosen. Ixquick is known for the support for searching
methods that it allows. The main advantages of metasearch engines are
their ability to combine the results of many sources and the fact that
the user can pose the same query to various sources through a single common
interface. Anecdotal evidence suggests that the new breed of metasearchers
should return more relevant pages.
The procedure
undertaken was to enter the data into each search engine and record the
number of “hits” for Yahoo!, Google and Ixquick respectively.
There was the possible risk that search engine updates could be made that
could distort the results, hence the data was captured as quickly as possible
on a Sunday evening.
4.4:
Results
Research
results – experiment 1
We are essentially
interested in whether there is linear relationship, if any between variable
x (the ICF) and variable y (the mark attained for the dissertation).
Therefore
we need a number to indicate how much of a linear relationship there is
between x and y. This number is the correlation coefficient, denoted by
r and is given by the formula:-
r = xy –
x * y
n
[x2 –
( x)2 ] [y2 – ( y)2 ]
n n
The value
of the correlation coefficient is always between –1 and +1. A value
equal to –1 indicates a perfect linear relationship between the
sample values of x and y, with the value of y decreasing as the value
of x increases, i.e. the larger x becomes, the smaller y becomes and the
smaller x becomes, the larger y becomes.
A value of
r equal to +1 also indicates a perfect linear relationship between the
sample values, but one in which the value of y increases as x increases.
If there is no linear relationship between sample values of x and y, then
r will have a value near zero. As r increases from 0 to +1 (or decreases
from 0 to –1) then the linear relationship between the sample values
of x and y becomes more pronounced.
Therefore
substituting the figures from the data, the correlation coefficient, r,
is obtained for experiment 1.
The result
of 0.101 is so near zero that this indicates that there is only a very
weak relationship between the title given to the students work and the
mark obtained.
Research
results – experiment 2
As noted
earlier, data was entered each of the three search engines and the number
of “hits” was recorded.
As with experiment
1, the correlation coefficient, r, was calculated for Yahoo!, Google and
Ixquick and the values obtained were -0.033, -0.036 and –0.192 respectively.
As with experiment 1, these results are so near to zero that there is
effectively no linear relationship between the titles and the marks obtained.
One interesting
point that should be noted is that Ixquick produces a much larger amount
of “hits” than the other search engines. This is to be expected,
due to the nature of a metasearch engine as opposed to an “ordinary”
search engine. However Ixquick did not produce any results whatsoever
for titles numbers 3, 5, 19, 22, 23 and 26. Ordinarily consideration should
be made to using another search engine or investigating why this only
occurs with Ixquick. However the lack of correlation in either experiment
1 or 2 means that there would be little point in this case.
4.5 Discussion and conclusion
The hypothesis
that predicting student performance from dissertation titles is not proven
on the basis of experiments 1 and 2. By the same token, the hypothesis
that it is possible to assess the academic quality of a document automatically
from the title is not proven.
However there
are areas that require further work. Firstly, there is the issue of sample
size. It would be interesting to see results obtained not only using larger
samples but perhaps samples from different institutions and different
student groups. An obvious candidate would appear to be PhD titles; however,
PhDs are not given a numeric mark, so this approach would not be usable
with them. A more promising approach is to use MSc thesis results, which
in many institutions are given a percentage grade initially, with this
grade being translated into a pass/fail/distinction later in the process.
Another area
which would merit further investigation is the use of abstracts rather
than titles as the basis for the calculation. As discussed above, short
records (such as titles) are generally poor sources of data for information
retrieval; abstracts are significantly longer, are a standard part of
MSc theses, and are short enough to be tractable. If this approach found
positive results, it could fairly easily be applied to students’
work as discussed earlier (although applying it to evaluation of records
found in on-line searching would be more problematic).
The absence
of a correlation between title frequency and mark is interesting, given
the widespread belief that this correlation exists. One possible explanation
is that the belief is simply mistaken; another is that the correlation
involves something more subtle than what was investigated here (for instance,
the length of the title, or the presence of giveaway terms indicating
weak or strong scholarship).
For the purposes
of this study, the truth or falsity of this belief is of secondary importance.
The main thing is that this study shows how a widespread “craft
skill” belief can be translated into a format which can be represented
as explicit knowledge, and then tested for validity. It is tempting to
speculate about how many other beliefs about student performance could
be tested using the same generic approach, and then used to improve our
practice in helping students.
|