next up previous
Next: Adaptive Multilingual Text Up: Adaptive Filtering of Previous: Introduction

Background

We are not aware of any prior work on adaptive multilingual text filtering systems, but the closely related problem of cross-language text retrieval has an extensive research heritage [Oard97a]. In text retrieval the goal is to respond to an unforeseen information need (a ``query'') with information from a relatively static document collection. In the cross-language text retrieval problem the query need not be expressed in the same language as the documents.

The first practical approach to cross-language text retrieval required that the documents be manually indexed using a controlled vocabulary and that the user express the query using terms drawn from that same vocabulary. In such systems a multilingual thesaurus is used to relate the selected terms from each language to a common set of language-independent concept identifiers, and document selection was based on concept identifier matching. In the hands of an skilled user who is familiar with controlled vocabulary search techniques, such systems can be remarkably effective. Of particular note, if well designed, controlled vocabulary cross-language text retrieval systems can be just as effective as similar techniques would be in monolingual applications. Controlled vocabulary cross-language text retrieval systems are presently widely used in commercial and government applications for which the number of concepts (and hence the size of the indexing vocabulary) is manageable. Unfortunately, the requirement to manually index the document collection makes controlled vocabulary text retrieval techniques unsuitable for large-volume applications in which the documents are generated from diverse sources that are not easily standardized.

This limitation has motivated the search for approaches which are amenable to less well structured situations. Two types of techniques have been investigated: dictionary-based approaches and corpus-based approaches. Dictionary-based approaches essentially seek to extend the fundamental idea of a multilingual thesaurus by using bilingual dictionaries to translate the query into every language in which a document might be found. Two factors limit the performance of this approach. The first is that many words do not have a unique translation, and sometimes the alternate translations have very different meanings. Monolingual text retrieval systems face similar challenges from polysemy (multiple meanings for a single word), but this translation ambiguity significantly exacerbates the problem. This problem is particularly severe in view of the observed tendency of untrained users to enter such short queries (often a single word) that it would not even be possible for a human to determine the intended meaning (and hence the proper query translation) from the available context.

The second problem with a dictionary-based approach is that the dictionary may lack some terms that are essential for a correct interpretation of the query. This may occur either because the query deals with a technical topic which is outside the scope of the dictionary or because the user has entered some form of abbreviation or slang which is not included in the dictionary. As dictionaries specifically designed for query translation are developed, the effect of this limitation may be reduced. But it is unlikely to be completely eliminated completely because language use is a creative activity, with new terms entering the lexicon all the time. There will naturally be a lag between the introduction of a term and its incorporation into a standard reference work such as a dictionary.

Corpus-based approaches seek to overcome these limitations by constructing query translation techniques which are appropriate for the way language is used in a specific application. Because it would be impractical to construct large tailored bilingual dictionaries manually, corpus-based approaches instead analyze large collections of existing text and automatically extract the information needed to construct these application-specific translation techniques. The collections which are analyzed may contain existing translations and the documents that were translated (a ``parallel'' collection), or they may be composed of documents on similar subjects which are written in different languages (a ``comparable'' collection).

Present corpus-based approaches are limited by two factors. The most significant limitation is that a parallel document collection which uses language in a manner similar to that found in the application may not be available in a suitable form. Techniques based on comparable document collections may eventually overcome this limitation, but research on the use of comparable document collections for text retrieval is presently at a very early stage [Picc96]. While a translation technique developed from a parallel document collection can be used for unrelated applications, significant reductions in retrieval effectiveness should be expected.

The other limitation of corpus-based techniques is that even when a suitable document collection is available, the methods presently used to extract the information on which the translation technique will be based introduce errors as well. Much of the initial research on corpus-based techniques has emphasized statistical analysis and made little use of linguistic theory. This approach has led to remarkable success. In machine translation, for example, statistical approaches have demonstrated performance equal to that achieved by linguistically motivated approaches, and they have done so with considerably less manual effort [Brow93]. But it also introduces errors that no human translator would make because the statistical approaches which have been applied are based on word cooccurrence and sometimes words which are not translations of each other exhibit the same patterns of cooccurrence as words which are translations of each other. There is some evidence that the incorporation of relatively simple linguistic information can significantly improve the performance of corpus-based techniques, and this appears to be a promising direction for future research.

Dictionary-based approaches and corpus-based approaches can both be applied to large unstructured document collections. One technique, Cross-Language Latent Semantic Indexing (CL-LSI), has even demonstrated cross-language text retrieval effectiveness that is on a par with the within-language performance of that same technique [Duma96]. This result is significant because an adaptive text filtering system based on Latent Semantic Indexing achieved a selection effectiveness nearly equal to that of the best participating systems at the third Text Retrieval Conference (TREC-3) [Duma95]. But the reported retrieval effectiveness results for CL-LSI were achieved with an experiment design that matched the retrieval application to the characteristics of the parallel document collection that was used to develop the translation technique.

No corpus-based system that we know of has yet demonstrated cross-language text retrieval effectiveness on a par with the within-language effectiveness of the same underlying retrieval techniques in the absence of a perfectly matched parallel document collection. We have thus chosen CL-LSI as the foundation for a corpus-based adaptive multilingual text filtering approach and sought to measure the effect of introducing a mismatch between the parallel document collection and the way language is used in the document stream which must be filtered.

In order to provide a basis for comparison, we have also experimentally determined the performance of a dictionary-based adaptive multilingual text filtering system. Our results show that in this application CL-LSI was able to achieve filtering effectiveness measures that are competitive with those achieved by a dictionary-based system for the same application. Since the types of errors made by a corpus-based system may differ significantly from those made by dictionary-based systems, we have developed a new corpus based technique which improves on CL-LSI, achieving similar performance when used by itself, but designed with an internal representation that is better suited to integration with existing bilingual dictionaries. We believe that this is a first step towards combining corpus-based and dictionary-based approaches to produce an adaptive multilingual text filtering that can achieve performance closer that achieved by monolingual system than would be possible using either technique alone.



next up previous
Next: Adaptive Multilingual Text Up: Adaptive Filtering of Previous: Introduction



Douglas W. Oard
Tue May 13 20:29:24 EDT 1997