TREC-2002 Cross Language Information Retrieval (CLIR) Track Guidelines

Return to the TREC home page TREC home    Return to the TREC Active Participant home page Active Participants home Return to the TREC Guidelines home page TREC Guidelines home Return to the Track Guidelines home page Track Guidelines home          National Institute of Standards and Technology Home Page

The National Institute of Standards and Technology (NIST) will conduct an evaluation of Cross-Language Information Retrieval (CLIR) technology in conjunction with the Text Retrieval Conference (TREC-2002). The focus this year will be retrieval of Arabic language newswire documents from topics in English. Participation is open to all TREC participants (information on joining TREC is available here.

Corpus:
383,872 Arabic documents (896 MB), AFP newswire, in Unicode (encoded as UTF-8), with SGML markup. The corpus is available now from the Linguistic Data Consortium (LDC) Catalog Number LDC2001T55 using one of three arrangements:

  1. Organizations with membership in the Linguistic Data Consortium (for 2001) may order the corpus at no additional charge. If your research group is not a member, the LDC can check and tell you if another part of your organization already has a membership for this year. If so (and if you are geographically colocated), it may be possible for that group to order the corpus without additional charge through their membership. Membership in the Linguistic Data Consortium costs $2,000 per year for nonprofit organizations (profit-making organizations that are not currently members will likely prefer the next option) and provides rights to research use (that do not expire) for all materials released by the LDC during that year.

  2. Non-members may purchase rights to use the corpus for research purposes for $800. These rights do not expire, and are described in more detail here.

  3. The Linguistic Data Consortium can negotiate an evaluation-only license at no cost for research groups that are unable to pay the $800 fee. An evaluation-only license permits use of the data only for the duration of the TREC-2002 CLIR evaluation. Please contact ldc@ldc.upenn.edu if you need further information on evaluation-only licenses.

Topics:
Fifty topics are being developed in English by the Linguistic Data Consortium, in the same format as typical TREC topics (title, description, and narrative). Arabic translations of the topics will also be available for use in monolingual runs. We do not plan to provide French topics this year.

Result submission:
Results will be submitted to NIST for pooling, relevance assessment, and scoring in the standard TREC format (top 1000 documents in rank order for each query). Participants may submit up to 5 runs, and may score additional runs locally using the relevance judgments that will be provided after relevance assessment is completed. It may not be possible to include all submitted runs in the document pools that serve as a basis for relevance assessment, so participants submitting more than one run should specify the order of preference for scoring that would result in the most diverse possible pools.

Categories of runs:
Participants will submit results for runs in one or more of the following categories. The principal focus of CLIR track discussions at TREC-2002 will be on results in the Automatic CLIR and Manual CLIR categories, but submission of results in the Monolingual category are also welcome since they both enrich the relevance assessment pools and provide the opportunity to for comparison to CLIR approaches.

Automatic CLIR:
Automatic CLIR systems formulate queries from the English topic content (Title, Description, Narrative fields) with no human intervention, and produce ranked lists of documents completely automatically based on those queries. In general, any portion of the topic description may be used by automatic systems, but participants that submit any automatic CLIR run are required to submit at least one automatic CLIR run in which only terms from the title and description fields are used to facilitate cross-system comparison under similar conditions. Participants are encouraged (but not required) to use a set of standard resources (specified on the track Web page) to the extent practical given their system design as a way of further enhancing comparability.

Manual CLIR:
Manual CLIR runs are any runs in which a user that has no practical knowledge of Arabic intervenes in any way in the process of query formulation and/or production of the ranked list for one or more topics. The intervention might be as simple as manual removal of stop structure ("a relevant document will contain...") or as complex as manual query reformulation after examining translations of retrieved documents using an initial query. A "practical knowledge of Arabic" is defined for this purpose as the ability to understand the gist of an Arabic news story or to carry on a simple conversation in Arabic. Knowledge of a few Arabic words or an understanding of Arabic linguistic characteristics such as morphology or grammar does not constitute a "practical knowledge of Arabic" for this purpose.

Monolingual Arabic:
Monolingual runs are any runs in which use is made of the Arabic version of the topic description or in which a user who has a practical knowledge of Arabic intervenes in the process of query formulation and/or production of the ranked list. Monolingual runs can be either automatic (no human intervention in the process of query development and no changing of system structure or parameters after examining the topics) or manual (any other human intervention) and should be appropriately tagged as such upon submission.

Resources:
Links to resources for track participants can be found on the track Web site at http://www.glue.umd.edu/~dlrg/clir/trec2002. Participants are invited to submit additional resources that they wish to have linked from that page (by email to oard@glue.umd.edu).

Communications:
All communications between participants is conducted by email. The track mailing list (xlingual@nist.gov) is open to anyone with an interest in the track, regardless of whether they plan to participate in 2002. To join the list, send email to listproc@nist.gov with the single line in the body (not the subject) "subscribe xlingual " (note: please send this to listproc, not to xlingual!). The track coordinators can help out if you have trouble subscribing.

Track Meeting:
Track results will be discussed at three sessions during the TREC-2002 meeting in Gaithersburg, MD:

  • Plenary session: (time TBA) Presentation of a track summary by the organizers and a few presentations by track participants that are selected for their potential interest to all conference attendees. Participants who are interested in presenting at the plenary session should respond to the call for presentation proposals from the TREC program committee when it is released.

  • Poster Session: (time TBA) An opportunity for all track participants to present their work as in poster form. A "boaster session" will provide an opportunity to introduce the subject of your poster to the conference attendees. Track participants may present a poster even if they present a plenary session talk.

  • Track Planning Session: (time TBA, near the end of the conference) This will provide an opportunity to informally discuss what has been learned and to discuss plans for future evaluations. There will not be a CLIR track at TREC-2003, but there will likely be opportunities for continued work with Arabic in other venues. There may also be opportunities to integrate cross-language techniques in other TREC tracks at some point in the future.

Schedule
Now: Documents available from the LDC
ASAP: Join the xlingual@nist.gov mailing list
Jun. 10: English and Arabic Topics available from NIST
Aug. 1: Results due to NIST
Oct. 1: Relevance judgments available from NIST
Oct. 1: Scored results returned to participants
Nov. 19-22: TREC-2002 Meeting, Gaithersburg, MD

Track Coordinators:
Fred Gey (gey@ucdata.berkeley.edu)
Doug Oard (oard@glue.umd.edu)

Last updated: Wednesday, 22-Apr-02 21:44:36
Date created: Wednesday, 22-April-02
trec@nist.gov