We have described three approaches to adaptive multilingual text filtering and characterized the relative effectiveness of each. Of the three techniques, CL-LSI alone develops a language-independent representation for each document. Both Text Translation and the more practical Vector Translation require that the user (or system designer) choose a preferred language into which the documents or document vectors will be translated. Each technique introduces different sources of errors, so it is interesting to compare their relative performance. In Text Translation the principal sources of error are failure to recognize a Spanish word and incorrect resolution of translation ambiguity. In Cross-Language Latent Semantic Indexing the principal source of error is the inability to separate polysemous uses of a term in the training collection, which makes it difficult to conflate terms that have different sets of polysemous senses across languages into a single concept representation. In Vector Translation the largest source of error results from the use of inaccurate alignments when constructing the translation matrix.
Another interesting basis for comparison is the source of the information that is used to perform the translation. In CL-LSI this information is extracted automatically from documents which are aligned only at the document (or passage) level. Because the individual components of the CL-LSI translation matrix lack any understandable individual interpretation, human assistance with the construction of the CL-LSI translation matrix is precluded. The VT translation matrix, on the other hand, is constructed from bilingual document collections in which individual terms have been aligned. In addition to exploiting more fine-grained information, this approach gives the elements of the VT translation matrix a natural interpretation. Each is the probability that a specific English word will be translated to a specific Spanish word for documents in the domain of interest. Such information can be collected automatically from bilingual document collections, but it can also be constrained and corrected using additional information that is available from dictionaries. Since corpus-based and dictionary-based approaches introduce different types of errors, this joint construction approach could substantially improve filtering effectiveness.
We have also presented an evaluation methodology that can be used to gain insight into the performance of adaptive multilingual text filtering techniques using existing document collections. Since the domain shift effect we have described is inherent in any corpus-based technique (unless precisely the right language training collections are available), the ability to characterize the magnitude of the domain shift effect that we have demonstrated will be important whenever dictionary-based and corpus-based techniques are being compared. The topic shift effect, on the other hand, is strictly an artifact of our experiment design, an effect which can be eliminated by investing in the construction of test collections tailored for evaluating the performance of adaptive multilingual text filtering techniques.
While it would be unreasonable to try to draw broadly applicable conclusions from the three aligned topic pairs that we had available, our results have demonstrated that adaptive multilingual text filtering techniques are presently available which perform well enough for some applications. As additional lessons from the growing body of cross-language text retrieval research are combined with further advances in corpus linguistics and improved test collections, the range of applications to which these techniques can be productively applied can be expected to expand significantly. There clearly is a need for adaptive multilingual text filtering techniques, and our work has led us to conclude that there are a number of promising avenues to be explored which offer the potential to satisfy that demand.