Telechargé par udlfse

10-EvaluationMetrics

publicité
Ranked Retrieval
precision and recall
•
[email protected]: proportion of retrieved
top-K documents that are
relevant
•
[email protected]: proportion of relevant
documents that are retrieved
in the top-K
•
Assumption: the user will only
examine the top-K results
K=
10
..
1
5
Monday, March 25, 13
Ranked Retrieval
precision and recall: exercise
•
Assume 20 relevant documents
K
1
2
3
4
5
6
7
8
9
10
Monday, March 25, 13
[email protected]
(1/1) = 1.0
[email protected]
(1/20) = 0.05
K=
1
1
6
Ranked Retrieval
precision and recall: exercise
•
Assume 20 relevant documents
K
1
2
3
4
5
6
7
8
9
10
Monday, March 25, 13
[email protected]
(1/1) = 1.0
(1/2) = 0.5
[email protected]
(1/20) = 0.05
(1/20) = 0.05
K=
2
1
7
Ranked Retrieval
precision and recall: exercise
•
Assume 20 relevant documents
K
1
2
3
4
5
6
7
8
9
10
Monday, March 25, 13
[email protected]
(1/1) = 1.0
(1/2) = 0.5
(2/3) = 0.67
[email protected]
(1/20) = 0.05
(1/20) = 0.05
(2/20) = 0.10
K=
3
1
8
Ranked Retrieval
precision and recall: exercise
•
Assume 20 relevant documents
K
1
2
3
4
5
6
7
8
9
10
Monday, March 25, 13
[email protected]
(1/1) = 1.0
(1/2) = 0.5
(2/3) = 0.67
(3/4) = 0.75
[email protected]
(1/20) = 0.05
(1/20) = 0.05
(2/20) = 0.10
(3/20) = 0.15
K=
4
1
9
Ranked Retrieval
precision and recall: exercise
•
Assume 20 relevant documents
K
1
2
3
4
5
6
7
8
9
10
Monday, March 25, 13
[email protected]
(1/1) = 1.0
(1/2) = 0.5
(2/3) = 0.67
(3/4) = 0.75
(4/5) = 0.80
[email protected]
(1/20) = 0.05
(1/20) = 0.05
(2/20) = 0.10
(3/20) = 0.15
(4/20) = 0.20
K=
5
2
0
Ranked Retrieval
precision and recall: exercise
•
Assume 20 relevant documents
K
1
2
3
4
5
6
7
8
9
10
Monday, March 25, 13
[email protected]
(1/1) = 1.0
(1/2) = 0.5
(2/3) = 0.67
(3/4) = 0.75
(4/5) = 0.80
(5/6) = 0.83
[email protected]
(1/20) = 0.05
(1/20) = 0.05
(2/20) = 0.10
(3/20) = 0.15
(4/20) = 0.20
(5/20) = 0.25
K=
6
2
1
Ranked Retrieval
precision and recall: exercise
•
Assume 20 relevant documents
K
1
2
3
4
5
6
7
8
9
10
Monday, March 25, 13
[email protected]
(1/1) = 1.0
(1/2) = 0.5
(2/3) = 0.67
(3/4) = 0.75
(4/5) = 0.80
(5/6) = 0.83
(6/7) = 0.86
[email protected]
(1/20) = 0.05
(1/20) = 0.05
(2/20) = 0.10
(3/20) = 0.15
(4/20) = 0.20
(5/20) = 0.25
(6/20) = 0.30
K=
7
2
2
Ranked Retrieval
precision and recall: exercise
•
Assume 20 relevant documents
K
1
2
3
4
5
6
7
8
9
10
Monday, March 25, 13
[email protected]
(1/1) = 1.0
(1/2) = 0.5
(2/3) = 0.67
(3/4) = 0.75
(4/5) = 0.80
(5/6) = 0.83
(6/7) = 0.86
(6/8) = 0.75
[email protected]
(1/20) = 0.05
(1/20) = 0.05
(2/20) = 0.10
(3/20) = 0.15
(4/20) = 0.20
(5/20) = 0.25
(6/20) = 0.30
(6/20) = 0.30
K=
8
2
3
Ranked Retrieval
precision and recall: exercise
•
Assume 20 relevant documents
K
1
2
3
4
5
6
7
8
9
10
Monday, March 25, 13
[email protected]
(1/1) = 1.0
(1/2) = 0.5
(2/3) = 0.67
(3/4) = 0.75
(4/5) = 0.80
(5/6) = 0.83
(6/7) = 0.86
(6/8) = 0.75
(7/9) = 0.78
[email protected]
(1/20) = 0.05
(1/20) = 0.05
(2/20) = 0.10
(3/20) = 0.15
(4/20) = 0.20
(5/20) = 0.25
(6/20) = 0.30
(6/20) = 0.30
(7/20) = 0.35
K=
9
2
4
Ranked Retrieval
precision and recall: exercise
•
Assume 20 relevant documents
K
1
2
3
4
5
6
7
8
9
10
Monday, March 25, 13
[email protected]
(1/1) = 1.0
(1/2) = 0.5
(2/3) = 0.67
(3/4) = 0.75
(4/5) = 0.80
(5/6) = 0.83
(6/7) = 0.86
(6/8) = 0.75
(7/9) = 0.78
(7/10) = 0.70
[email protected]
(1/20) = 0.05
(1/20) = 0.05
(2/20) = 0.10
(3/20) = 0.15
(4/20) = 0.20
(5/20) = 0.25
(6/20) = 0.30
(6/20) = 0.30
(7/20) = 0.35
(7/20) = 0.35
K=
10
2
5
Ranked Retrieval
precision-recall curves
rank (K)
1
2
3
4
5
ranking
[email protected]
[email protected]
0.10
0.10
0.20
0.30
0.40
1.00
0.50
0.67
0.75
0.80
6
7
8
9
10
11
12
13
0.50
0.60
0.60
0.70
0.70
0.80
0.80
0.80
0.83
0.86
0.75
0.78
0.70
0.73
0.67
0.62
14
15
16
17
18
19
0.90
0.90
0.90
0.90
0.90
0.90
0.64
0.60
0.56
0.53
0.50
0.47
20
1.00
0.50
5
1
Monday, March 25, 13
Ranked Retrieval
precision-recall curves
11
0.9
0.9
Precision
0.8
0.8
0.7
0.7
0.6
Precision 0.4
0.5
0.4
0.6
0.3
0.5
0.2
0.3
0.2
0.1
0.1
0
0
0.10.20.30.40.50.60.70.80.9
0.1 0.20.30.40.50.60.70.80.91
1
Recall
Recall
5
2
Monday, March 25, 13
Ranked Retrieval
precision-recall curves
11
0.9
0.9
Precision
0.8
0.8
0.7
0.7
0.6
0.4
0.4
Precision
Precision
0.5
Wrong! When plotting a
PRcurve, we use the
precision
best for a level of
recall R or
greater!
0.4
0.6
0.6
0.3
0.5
0.5
0.2
0.3
0.3
0.2
0.1
0.2
0.1
0.1
0
0
0
0.10.20.30.40.50.60.70.80.9
0.1 0.20.30.40.50.60.70.80.91
0.1 0.20.30.40.50.60.70.80.91
Recall
Recall
Recall
Monday, March 25, 13
1
5
3
Ranked Retrieval
precision-recall curves
11
0.9
0.9
Precision
0.8
0.8
0.7
0.7
0.6
Precision 0.4
0.5
0.4
0.6
0.3
0.5
0.2
0.3
0.2
0.1
0.1
0
0
0.10.20.30.40.50.60.70.80.9
0.1 0.20.30.40.50.60.70.80.91
1
Recall
Recall
5
4
Monday, March 25, 13
Ranked Retrieval
precision-recall curves
•
•
For a single query, a PR curve looks like a step-function
For multiple queries, we can average these curves
‣
•
Average the precision values for different values of
recall (e.g., from 0.01 to 1.0 in increments of 0.01)
This forms a smoother function
5
5
Monday, March 25, 13
Ranked Retrieval
precision-recall curves
•
PR curves can be averaged across multiple queries
1
0.9
0.8
Precision
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.10.20.30.40.50.60.70.80.9
1
Recall
5
6
Monday, March 25, 13
Ranked Retrieval
precision-recall curves
1
0.9
0.8
Precision
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.10.20.30.40.50.60.70.80.9
1
Recall
5
7
Monday, March 25, 13
Ranked Retrieval
precision-recall curves
1
which
system is
better?
0.9
0.8
Precision
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.10.20.30.40.50.60.70.80.9
1
Recall
5
8
Monday, March 25, 13
Ranked Retrieval
precision-recall curves
1
for what type of task
would the blue
system be better?
0.9
0.8
Precision
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.10.20.30.40.50.60.70.80.9
1
Recall
5
9
Monday, March 25, 13
Ranked Retrieval
precision-recall curves
1
for what type of task
would the blue
system be better?
0.9
0.8
Precision
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.10.20.30.40.50.60.70.80.9
1
Recall
6
0
Monday, March 25, 13
Ranked Retrieval
discounted-cumulative gain
•
In some retrieval tasks, we really want to focus on
precision at the top of the ranking
•
A classic example is web-search!
‣
users rarely care about recall
users rarely navigate beyond the first page of results
‣
users may not even look at results below the “fold”
‣
•
Are any of the metrics we’ve seen so far appropriate for
web-search?
6
1
Monday, March 25, 13
Ranked Retrieval
discounted-cumulative gain
6
2
Monday, March 25, 13
Ranked Retrieval
discounted-cumulative gain
•
•
•
We could potentially evaluate using [email protected] with several
small values of K
But, this has some limitations
What are they?
6
3
Monday, March 25, 13
Ranked Retrieval
discounted-cumulative gain
•
Which retrieval is better?
6
4
Monday, March 25, 13
Ranked Retrieval
discounted-cumulative gain
•
Evaluation based on [email protected] can be too coarse
K=
K=
1
1
K=
K=
5
5
K=1
0
K=1
0
6
5
Monday, March 25, 13
Ranked Retrieval
discounted-cumulative gain
•
[email protected] (and all the metrics we’ve seen so far) assume
binary relevance
K=
K=
1
1
K=
K=
5
5
K=1
0
K=1
0
6
6
Monday, March 25, 13
Ranked Retrieval
discounted-cumulative gain
•
•
DCG: discounted cumulative gain
Assumptions:
‣
There are more than two levels of relevance (e.g.,
perfect, excellent, good, fair, bad)
‣
A relevant document’s usefulness to a user decreases
rapidly with rank (more rapidly than linearly)
6
7
Monday, March 25, 13
Ranked Retrieval
discounted-cumulative gain
•
Let RELi be the relevance associated with the document
at rank i
‣ perfect ➔ 4
‣
‣
excellent ➔ 3
good ➔ 2
‣
fair ➔ 1
‣
bad ➔ 0
6
8
Monday, March 25, 13
Ranked Retrieval
discounted-cumulative gain
•
DCG: discounted cumulative gain
K
[email protected] =
Â
i=1
RELi
log2 (max(i, 2))
6
9
Monday, March 25, 13
Ranked Retrieval
discounted-cumulative gain
a relevant
document’s
usefulness
1
decreases
log2(i)
exponentially
with
rank
1.2
1
Discount
0.8
0.6
0.4
0.2
0
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19
20
Rank
7
0
Monday, March 25, 13
Ranked Retrieval
discounted-cumulative gain
K
[email protected] =
Â
i=1
rank (i) REL_i
1
2
3
4
5
6
7
8
9
10
Monday, March 25, 13
4
3
4
2
0
0
0
1
1
0
RELi
log2(max(i, 2))
This is given!
the result at rank 1 is
perfect the result at rank 2
is excellent the result at
rank 3 is perfect
...
the result at rank 10 is bad
7
1
Ranked Retrieval
discounted-cumulative gain
K
[email protected] =
Â
i=1
rank (i)
1
2
3
4
5
6
7
8
9
10
Monday, March 25, 13
REL_i
4
3
4
2
0
0
0
1
1
0
RELi
log2(max(i, 2))
discount factor
1.00
1.00
0.63
0.50
0.43
0.39
0.36
0.33
0.32
0.30
Each rank is
associated with a
discount factor
log 2 (max
(
i,
2
))
1
rank 1 is a special
case!
7
2
Ranked Retrieval
discounted-cumulative gain
K
[email protected] =
rank (i)
1
2
3
4
5
6
7
8
9
10
Monday, March 25, 13
REL_i
4
3
4
2
0
0
0
1
1
0
 i=l1
RELi
og2(max(i, 2))
discount factor
1.00
1.00
0.63
0.50
0.43
0.39
0.36
0.33
0.32
0.30
gain
4.00
3.00
2.52
1.00
0.00
0.00
0.00
0.33
0.32
0.00
multiply
RELi by the
discount
factor
associated
with the
rank!
7
3
Ranked Retrieval
discounted-cumulative gain
K
[email protected] =
Â
i=1
rank (i)
1
2
3
4
5
6
7
8
9
10
Monday, March 25, 13
REL_i
4
3
4
2
0
0
0
1
1
0
RELi
log2(max(i, 2))
discount factor
1.00
1.00
0.63
0.50
0.43
0.39
0.36
0.33
0.32
0.30
gain
4.00
3.00
2.52
1.00
0.00
0.00
0.00
0.33
0.32
0.00
DCG_i
4.00
7.00
9.52
10.52
10.52
10.52
10.52
10.86
11.17
11.17
74
Ranked Retrieval
discounted-cumulative gain
DCG10 = 11.17
rank (i)
1
2
3
4
5
6
7
8
9
10
Monday, March 25, 13
REL_i
4
3
4
2
0
0
0
1
1
0
discount factor
1.00
1.00
0.63
0.50
0.43
0.39
0.36
0.33
0.32
0.30
gain
4.00
3.00
2.52
1.00
0.00
0.00
0.00
0.33
0.32
0.00
DCG_i
4.00
7.00
9.52
10.52
10.52
10.52
10.52
10.86
11.17
11.17
75
Ranked Retrieval
discounted-cumulative gain
DCG10 = 11.17
rank (i)
REL_i discount factorgain
1
3
1.00
3.00
2
3
1.00
3.00
3
4
0.63
2.52
4
2
0.50
1.00
5
0
0.43
0.00
6
0
0.39
0.00
7
0
0.36
0.00
8
1
0.33
0.33
9
1
0.32
0.32
10
0
0.30
0.00
changed top result from perfect instead of
excellent
Monday, March 25, 13
DCG_i
3.00
6.00
8.52
9.52
9.52
9.52
9.52
9.86
10.17
10.17
7
6
Ranked Retrieval
discounted-cumulative gain
DCG10 = 11.17
rank (i)
1
2
3
4
5
6
7
8
9
10
Monday, March 25, 13
REL_i discount factorgain
4
1.00
4.00
3
1.00
3.00
4
0.63
2.52
2
0.50
1.00
0
0.43
0.00
0
0.39
0.00
0
0.36
0.00
1
0.33
0.33
1
0.32
0.32
3
0.30
0.90
changed 10th result from bad to
excellent
DCG_i
4.00
7.00
9.52
10.52
10.52
10.52
10.52
10.86
11.17
12.08
7
7
Ranked Retrieval
normalized discounted-cumulative gain
•
•
•
•
•
DCG is not ‘bounded’
In other words, it ranges from zero to ..... Makes it
problematic to average across queries NDCG:
normalized discounted-cumulative gain
“Normalized” is a fancy way of saying, we change it so
that it ranges from 0 to 1
7
8
Monday, March 25, 13
Ranked Retrieval
normalized discounted-cumulative gain
•
NDCGi: normalized discounted-cumulative gain
•
For a given query, measure DCGi
•
Then, divide this DCGi value by the best possible DCGi for
that query
•
Measure DCGi for the best possible ranking for a given
value i
7
9
Monday, March 25, 13
Ranked Retrieval
normalized discounted-cumulative gain
•
•
•
Given: a query has two 4’s, one 3, and the rest are 0’s
Question: What is the best possible ranking for i = 1 All
these are equally good:
‣
4, 4, 3, ....
‣
4, 3, 4, ....
‣
4, 0, 0, ....
‣
... anything with a 4 as the top-ranked result
8
0
Monday, March 25, 13
Ranked Retrieval
normalized discounted-cumulative gain
•
•
•
Given: the query has two 4’s, one 3, and the rest are 0’s
Question: What is the best possible ranking for i = 2 All
these are equally good:
‣
4, 4, 3, ....
‣
4, 4, 0, ....
8
1
Monday, March 25, 13
Ranked Retrieval
normalized discounted-cumulative gain
•
•
•
Given: the query has two 4’s, one 3, and the rest are 0’s
Question: What is the best possible ranking for i = 3 All
these are equally good:
‣
4, 4, 3, ....
8
2
Monday, March 25, 13
Ranked Retrieval
normalized discounted-cumulative gain
•
NDCGi: normalized discounted-cumulative gain
•
For a given query, measure DCGi
•
Then, divide this DCGi value by the best possible DCGi for
that query
•
Measure DCGi for the best possible ranking for a given
value i
8
3
Monday, March 25, 13
Metric Review
•
set-retrieval evaluation: we want to evaluate the set of
documents retrieved by the system, without considering
the ranking
•
ranked-retrieval evaluation: we want to evaluate the
ranking of documents returned by the system
8
4
Monday, March 25, 13
Metric Review
set-retrieval evaluation
•
precision: the proportion of retrieved documents that are
relevant
•
recall: the proportion of relevant documents that are
retrieved
•
f-measure: harmonic-mean of precision and recall
‣
a difficult metric to “cheat” by getting very high
precision and abysmal recall (and vice-versa)
8
5
Monday, March 25, 13
Metric Review
ranked-retrieval evaluation
•
[email protected]: precision under the assumption that the top-K
results is the ‘set’ retrieved
•
[email protected]: recall under the assumption that the top-K results
is the ‘set’ retrieved
•
average-precision: considers precision and recall and
focuses mostly on the top results
•
DCG: ignores recall, considers multiple levels of
relevance, and focuses very must on the top ranks
•
NDCG: trick to make DCG range between 0 and 1
8
6
Monday, March 25, 13
Which Metric Would You Use?
8
7
Monday, March 25, 13
Téléchargement
Explore flashcards