TREC 2002 Arabic/English Cross-Language IR track
Overview
The National Institute of Standards and Technology (NIST) will conduct
an evaluation of Cross-Language Information Retrieval (CLIR)
technology in conjunction with the Text Retrieval Conference
(TREC-2002). As it was in 2001, the focus for 2002 will be retrieval
of Arabic language newswire documents from topics in English.
Additional information is available in the track guidelines.
Resources Designed for Track Participants
- Track Guidelines
- The track guidelines contain information about obtaining the test
collection, running the experiments, and submitting the results.
- The
Document Collection
- A description of the documents, which are available from the
Linguistic Data Consortium (LDC). Information about obtaining the
collection is also available. Participants in TREC-2002
may also be able to obtain an evaluation-only license -- see
the track guidelines for details.
- Standard Resources
- A set of linguistic resources that participants are asked to
use (to the extent that the are compatible with their system design)
for at least one run in order to facilitate cross-site
comparisons. Use of these resources is encouraged, but not
required.
- TREC-2001
Topic Descriptions
- These topic descriptions from the TREC-2001 CLIR track can be
used by participants for system development and by anyone for
post-hoc experiments with the collection. This is a link to
the public TREC Web site. Topic descriptions for the TREC-2002
track are distributed to participants on a password protected site.
- TREC-2001
Relevance Judgments
- The relevance jusgments that match the TREC-2001
topic descriptiopns, from NIST.
- LCDR TREC Web
Site
- This site includes an annotation guide and a table that
together describe how the relevance judgments were created.
Other Useful resources
- Arabic/English
Parallel Web Pages
- An automatically assembled corpus of 2,190 translation-equivalent
Web page pairs from the Internet Archive.
- Character set conversion
- A Perl script provided by the LDC for converting the Arabic
topic descriptions from ISO 8859-6 encoding to the UTF8
encoding that is used by the document collection for use in
monolingual Arabic retrieval experiments. This script is
freely redistributable.
- TDT-2002
- The Topic Detection and Tracking evaluation will also include
Arabic in 2002. Arabic documents will be made available for
both the TDT-3 collection (for system training) and the TDT-4
collection (for the TDT-2002 evaluation). These collections
are time-constrained subsets of the TREC Arabic collection.
Two features make them interesting. First, machine-producd English
translations are available for the Arabic documents. Second,
relevance judgments are available for topics in the TDT-3 collection
using an event-oriented definition for relevance. An email from Dave Graff explains how to obtain the
TDT collections. Dave has agreed to provide the collection
to TREC CLIR track participants as well under the same
conditions as TDT participants.
- Other Arabic IR and Computational
Linguistics Resources
- A collection of links to resources that participants
in the track might find useful.
Fred Gey
Doug Oard
Last modified: Fri Jul 19 19:23:12 2002