query
problems
effective search
precision and recall
anomalies
example complextity
compare term frequencies per document -- O(M*N)
where M is the number of terms and N is the number
of documents.
Since both M and N can become very large we need
to make an effort to reduce the size of
the frequency table.
reduction
reduction of complexity
Alternatively,the choice of what words are
considered relevant may be determined by
taking into account the area of application
or the interest of a particular group of users.
user-oriented measures
draft version 1 (16/5/2003)
But these must be considered anomalies
(that is, sick cases),
and so the problem is to find an algorithm
that performs optimally with respect to both
precision and recall.
frequency tables
term/document d0 d1 d2 snacks 1 0 0 drinks 1 0 3 rock-roll 0 1 1
Basically, what a frequency table does is, as the name implies,
give a frequency count for particular words or phrases
for a number of documents.
In effect, a complete document database may be summarized
in a frequency table.
In other words, the frequency table may be considered as an
index to facilitate the search for similar documents.
We can, for example, introduce a stop list
to prevent irrelevant words to enter the table,
and we may restrict ourselves to
including word stems only,
to bring back multiple entries to
one canonical form. With some additional effort
we could even deal with synonymy and polysemy
by introducing, respectively equivalence classes,
and alternatives (although we then need a suitable
way for ambiguation).
By the way, did you notice that frequency tables
may be regarded as feature vectors for documents?
research directions -- user-oriented measures
user-oriented measures
Consider a reference collection,
an example information request
and a retrieval strategy to be evaluated.
Then the coverage ratio
may be defined as the fraction of the documents
known to be relevant, or more precisely the number
of (known) relevant documents retrieved divided by the
total number of documents known to be relevant by the user.
[]
readme
preface
1
2
3
4
5
6
7
appendix
checklist
powerpoint
resources
director
eliens@cs.vu.nl