next up previous
Next: Conclusions Up: Adaptive Filtering of Previous: Results

Future Work

We have good reason to believe that the results reported here can be substantially improved upon by applying some of the lessons that are emerging from current research in cross-language text retrieval. Several teams working on that problem have reported dramatic improvements in performance when phrases are also used for building document vectors, presumably because translation ambiguity is limited when working with phrases rather than just isolated words(c.f., [Hull96,Radw95]). In some initial experiments with the SP22/008 topic pair we have increased the precision at 0.1 recall for Vector Translation from 0.12 to 0.20 by using phrases and also constraining the possible alignments using part-of-speech information.

We are particularly interested in further investigating the performance of Vector Translation because it offers the potential to integrate the corpus-based and dictionary-based approaches. We have observed that many of the candidate alignments produced by statistical techniques make no semantic sense, so we believe that it would be productive to use the set of known translations in an existing bilingual dictionary (and perhaps any information that is available about the relative predominance of those translations) as a stronger constraint on the alignment process. We call this idea ``seeding'' the distribution with the dictionary since the probability mass is constrained to accumulate only on the seeds that we have provided. While seeding the distribution would be expected to drive the translation matrix from one tailored to a domain towards one suitable for more general application, the improvement in alignment accuracy (and hence in the effectiveness of the VT technique) could be significant.

Another way of adding linguistic knowledge to the translation matrix would be to adjust the matrix by hand after it has been constructed. If performance analysis were to reveal unusually poor performance for the cross-language component of a Vector Translation system on a particular set of topics, the translation probabilities for terms associated with those topics could be examined by a domain expert who is fluent in both languages. If the values in the matrix appear to be counterintuitive, it would be possible to adjust them manually. Such a process is not likely not prove economically feasible for many applications, however, unless automated tools are developed to identify potentially poor translation probabilities and either suggest improvements or apply those improvements without human intervention.

Other techniques for term alignment have been proposed as well, so an experiment in which different term alignment techniques were applied might yield some interesting insights. Brown, et al. at IBM have collected term translation statistics using vastly more sophisticated techniques which directly handle word to phrase translation and take advantage of information encoded in word order [Brow93]. Their technique produces a translation matrix which is conditioned on sequences of English words rather than on a single word. Such a distribution is easily converted to one conditioned on the final term in the sequence by summing across the possible prefixes of that term, although it is not clear whether the result would be any more accurate than Shen and Dorr's simpler technique.

Another issue that we would like to investigate is automatic construction of comparable document collections. The documents arriving in a multilingual document stream have all of the characteristics of a comparable document collection except for the known alignment of documents on similar topics. If such alignments could be discovered automatically and then used to construct a translation matrix, periodic updates to the translation matrix could be performed using a collection with extremely similar characteristics to the documents which are likely to arrive in the near future. Sheridan and Ballerini have demonstrated a technique for accomplishing this when structured topic labels are associated with news articles arriving on a newswire [Sher96]. If dictionary-based techniques can be used to do this when such tags are lacking (and if corpus-based techniques which exploit comparable corpora prove practical), the resulting ease of updates could dramatically improve the performance and practicality of adaptive multilingual text filtering.

The other critical need is for better test collections. Although we have been able to estimate (or at least bound) the effects topic and domain shifts, it would clearly be better if what we have described as an ideal test collection were available with relevance judgments for documents in several languages with respect to an identical set of topics. There will be a new cross-language text retrieval ``pre-track'' in the 1997 TREC-6 evaluation, and in future years that track may produce such a collection. TREC provides an excellent venue for developing such a collection because documents which are good candidates for relevance assessment (those with a significant likelihood of being relevant) can be identified by a wide variety of systems that apply a broad array of techniques in both monolingual and multilingual settings. This can significantly reduces the costs associated relevance assessment while minimizing the likelihood that systematic errors will result in a large set of relevant documents being missed.



next up previous
Next: Conclusions Up: Adaptive Filtering of Previous: Results



Douglas W. Oard
Tue May 13 20:29:24 EDT 1997