TREC 2002 Arabic/English Cross-Language IR track

Overview

The National Institute of Standards and Technology (NIST) will conduct an evaluation of Cross-Language Information Retrieval (CLIR) technology in conjunction with the Text Retrieval Conference (TREC-2002). As it was in 2001, the focus for 2002 will be retrieval of Arabic language newswire documents from topics in English. Additional information is available in the track guidelines.

Resources Designed for Track Participants

Track Guidelines
The track guidelines contain information about obtaining the test collection, running the experiments, and submitting the results.
The Document Collection
A description of the documents, which are available from the Linguistic Data Consortium (LDC). Information about obtaining the collection is also available. Participants in TREC-2002 may also be able to obtain an evaluation-only license -- see the track guidelines for details.
Standard Resources
A set of linguistic resources that participants are asked to use (to the extent that the are compatible with their system design) for at least one run in order to facilitate cross-site comparisons. Use of these resources is encouraged, but not required.
TREC-2001 Topic Descriptions
These topic descriptions from the TREC-2001 CLIR track can be used by participants for system development and by anyone for post-hoc experiments with the collection. This is a link to the public TREC Web site. Topic descriptions for the TREC-2002 track are distributed to participants on a password protected site.
TREC-2001 Relevance Judgments
The relevance jusgments that match the TREC-2001 topic descriptiopns, from NIST.
LCDR TREC Web Site
This site includes an annotation guide and a table that together describe how the relevance judgments were created.

Other Useful resources

Arabic/English Parallel Web Pages
An automatically assembled corpus of 2,190 translation-equivalent Web page pairs from the Internet Archive.
Character set conversion
A Perl script provided by the LDC for converting the Arabic topic descriptions from ISO 8859-6 encoding to the UTF8 encoding that is used by the document collection for use in monolingual Arabic retrieval experiments. This script is freely redistributable.
TDT-2002
The Topic Detection and Tracking evaluation will also include Arabic in 2002. Arabic documents will be made available for both the TDT-3 collection (for system training) and the TDT-4 collection (for the TDT-2002 evaluation). These collections are time-constrained subsets of the TREC Arabic collection. Two features make them interesting. First, machine-producd English translations are available for the Arabic documents. Second, relevance judgments are available for topics in the TDT-3 collection using an event-oriented definition for relevance. An email from Dave Graff explains how to obtain the TDT collections. Dave has agreed to provide the collection to TREC CLIR track participants as well under the same conditions as TDT participants.
Other Arabic IR and Computational Linguistics Resources
A collection of links to resources that participants in the track might find useful.

Fred Gey
Doug Oard
Last modified: Fri Jul 19 19:23:12 2002