We are aware of no large bilingual collection for which
relevance judgments are available, and large collections are needed
for evaluation of adaptive text filtering systems. Furthermore,
construction of topics and relevance judgments would have been well
beyond our resources, so creation of such a test collection using the
Linguistic Data Consortium's United Nations collection would have been
impractical. Fortunately, over the course of five years the Text
Retrieval Conferences have developed large monolingual collections
with associated topics and relevance judgments, that can be used to
augment the United Nations collection and create a useful (although
somewhat awkward) test collection with the required
characteristics.
Because none of the three partitions must be both bilingual and scored, it is possible to use three separate collections to approximate the results that would be achieved using an ideal test collection. The collections we have chosen are shown in Table 1. TREC generates approximately 50 new topics each year, and the evaluation process produces a set of known relevant documents from parts of the TREC collection which vary from year to year. There are presently 300 TREC topics. One part of the TREC collection (a set of Wall Street Journal articles from what is known as TIPSTER disk 2) has been used each year, so there are relevance judgments for that ``Wall Street Journal collection'' for every available topic. These Wall Street Journal articles are mostly from 1990 and 1991, with a small number form 1992. We chose to restrict our language training collection to a similar portion of the United Nations collection in order to maximize the chances that terms associated with temporally specific events would be common to the two collections.
Table 1: Evaluation using existing collections.
The most recent three TREC evaluations have included a ``multilingual'' evaluation in which monolingual text retrieval in languages other than English is evaluated. There are presently 50 topics available for a collection of 1992 Spanish language articles from the Mexican newspaper El Norte. We chose this collection because it was in Spanish (a language for which we had machine translation available) and because the time period overlapped with the temporal span of the other two collections we are using. Both the Wall Street Journal and the El Norte collections were preprocessed by Rank Xerox in the same manner as the United Nations collection.
The 50 Spanish (El Norte) topics were not a subset of the 300 English
(Wall Street Journal) topics, but we were able to identify sufficient
overlap between the formal TREC topic descriptions in several cases.
We performed topic alignment manually, examining each of the 50
Spanish topics and then scanning the list of 300 English topics in
order to identify possible matches. The detailed topic descriptions
were then compared and a set of topic pairs which appeared to be
closely aligned were selected. Table 2 shows the four
Spanish topics for which we have found closely corresponding English
topics.
Although the topic
descriptions in each pair have some differences, there is sufficient
apparent overlap to suggest that a minimal adjustment to the sets of
relevant documents would result in comparable sets of documents in the
two languages. In fact, our experimental results confirm that it is
possible to use the relevance judgments without any adjustment when
the goal is to compare different cross-language mapping techniques.
Table 2: Closely related English and Spanish TREC topics.
Two difficulties can arise when three existing collections are used in place of an ideal test collection. The first is that the subjects addressed by the UN, the Wall Street Journal and El Norte would be expected to differ significantly. We refer to this problem as a ``domain shift,'' between the collections since it is caused by differences in the domain of discourse of the three collections. A potentially even more serious problem is that the Wall Street Journal and El Norte articles were judged against topics which are similar but not identical. We call this second problem ``topic shift.''
The domain shift between the UN documents and the El Norte articles is fairly easy to evaluate by running the Text Translation experiment a second time. In the second run we simply use the El Norte documents for language training instead of the Spanish UN documents. Since the rank-reducing mapping constructed using the left singular vectors will then be better suited to way language is used in the El Norte articles, the difference in filtering effectiveness will reveal the effect of the domain shift between the UN collection and the El Norte collection. We have not developed any similar technique to reveal the effect of the topic shift between either of those collections and the Wall Street Journal collection, however.
We can also estimate the severity of the topic shift effect, but the procedure is considerably more complex. Given the nature of the available test collections, the topic shift is an unavoidable consequence of introducing a second language. So the key to evaluating the impact of the topic shift is to compare the cross-language and within-language filtering effectiveness of the same adaptive text filtering technique. Again we base our approach on a modification to the standard Text Translation experiment. We first partition the El Norte collection into a training collection and an evaluation collection and then perform a monolingual evaluation. That removes the effect of the topic shift completely, but it also removes the effect of errors introduced by the machine translation step. So the second thing we must do is to measure the effect of these translation errors in isolation. We use a modification of the standard CL-LSI experiment to do this. Recall that with CL-LSI, LSI feature vectors can be produced from either English or Spanish documents. If the Wall Street Journal articles are translated into Spanish before being used for CL-LSI profile training, the observed drop in filtering effectiveness would be entirely attributable to errors introduced by the machine translation step. Since these are exactly the same errors that affect the Text Translation experiment, this result will indicate how much of the difference between Text Translation and the monolingual run is attributable to translation errors.
Performing the complete topic shift experiment as we have described would require that both runs be evaluated using only half of the El Norte collection. We have not yet configured our software to accommodate this, so the results we report below were collected using the entire El Norte collection for both training and evaluation when attempting to measure the topic shift. Those results overstate the effect of the topic shift because they evaluate memory rather than prediction accuracy (in this one case we are performing a filtering effectiveness evaluation on the training set). But the results do provide an upper bound on the magnitude of the topic shift, and that upper bound proved to be adequate to recognize one case in which an extreme topic shift made an apparently well-aligned topic pair unusable. In every other case we have been careful to provide separate training and evaluation sets, so the limitations of our topic shift experiments to not extend to the other aspects of our evaluation.