TREC-2002 Cross Language Information Retrieval (CLIR) Track Guidelines
|
|---|
|
|
The National Institute of Standards and
Technology (NIST) will conduct an evaluation of Cross-Language
Information Retrieval (CLIR) technology in conjunction with the Text
Retrieval Conference (TREC-2002). The focus this year will be
retrieval of Arabic language newswire documents from topics in
English. Participation is open to all TREC participants (information
on joining TREC is available here. Corpus: 383,872 Arabic documents (896 MB), AFP newswire, in Unicode (encoded as UTF-8), with SGML markup. The corpus is available now from the Linguistic Data Consortium (LDC) Catalog Number LDC2001T55 using one of three arrangements:
Fifty topics are being developed in English by the Linguistic Data Consortium, in the same format as typical TREC topics (title, description, and narrative). Arabic translations of the topics will also be available for use in monolingual runs. We do not plan to provide French topics this year. Result submission: Results will be submitted to NIST for pooling, relevance assessment, and scoring in the standard TREC format (top 1000 documents in rank order for each query). Participants may submit up to 5 runs, and may score additional runs locally using the relevance judgments that will be provided after relevance assessment is completed. It may not be possible to include all submitted runs in the document pools that serve as a basis for relevance assessment, so participants submitting more than one run should specify the order of preference for scoring that would result in the most diverse possible pools. Categories of runs: Participants will submit results for runs in one or more of the following categories. The principal focus of CLIR track discussions at TREC-2002 will be on results in the Automatic CLIR and Manual CLIR categories, but submission of results in the Monolingual category are also welcome since they both enrich the relevance assessment pools and provide the opportunity to for comparison to CLIR approaches. Automatic CLIR: Automatic CLIR systems formulate queries from the English topic content (Title, Description, Narrative fields) with no human intervention, and produce ranked lists of documents completely automatically based on those queries. In general, any portion of the topic description may be used by automatic systems, but participants that submit any automatic CLIR run are required to submit at least one automatic CLIR run in which only terms from the title and description fields are used to facilitate cross-system comparison under similar conditions. Participants are encouraged (but not required) to use a set of standard resources (specified on the track Web page) to the extent practical given their system design as a way of further enhancing comparability. Manual CLIR: Manual CLIR runs are any runs in which a user that has no practical knowledge of Arabic intervenes in any way in the process of query formulation and/or production of the ranked list for one or more topics. The intervention might be as simple as manual removal of stop structure ("a relevant document will contain...") or as complex as manual query reformulation after examining translations of retrieved documents using an initial query. A "practical knowledge of Arabic" is defined for this purpose as the ability to understand the gist of an Arabic news story or to carry on a simple conversation in Arabic. Knowledge of a few Arabic words or an understanding of Arabic linguistic characteristics such as morphology or grammar does not constitute a "practical knowledge of Arabic" for this purpose. Monolingual Arabic: Monolingual runs are any runs in which use is made of the Arabic version of the topic description or in which a user who has a practical knowledge of Arabic intervenes in the process of query formulation and/or production of the ranked list. Monolingual runs can be either automatic (no human intervention in the process of query development and no changing of system structure or parameters after examining the topics) or manual (any other human intervention) and should be appropriately tagged as such upon submission. Resources: Links to resources for track participants can be found on the track Web site at http://www.glue.umd.edu/~dlrg/clir/trec2002. Participants are invited to submit additional resources that they wish to have linked from that page (by email to oard@glue.umd.edu). Communications: All communications between participants is conducted by email. The track mailing list (xlingual@nist.gov) is open to anyone with an interest in the track, regardless of whether they plan to participate in 2002. To join the list, send email to listproc@nist.gov with the single line in the body (not the subject) "subscribe xlingual Track Meeting: Track results will be discussed at three sessions during the TREC-2002 meeting in Gaithersburg, MD:
Now: Documents available from the LDC ASAP: Join the xlingual@nist.gov mailing list Jun. 10: English and Arabic Topics available from NIST Aug. 1: Results due to NIST Oct. 1: Relevance judgments available from NIST Oct. 1: Scored results returned to participants Nov. 19-22: TREC-2002 Meeting, Gaithersburg, MD Track Coordinators: Fred Gey (gey@ucdata.berkeley.edu) Doug Oard (oard@glue.umd.edu) |
|
Last updated: Wednesday, 22-Apr-02 21:44:36 Date created: Wednesday, 22-April-02 trec@nist.gov |