next up previous
Next: Ideal Experiment Design Up: Adaptive Filtering of Previous: Background

Adaptive Multilingual Text Filtering

For consistency we have based all three of our approaches a technique developed by Dumais for adaptive monolingual text filtering in which Latent Semantic Indexing (LSI) is used to develop relatively short feature vectors that describe the relevant training documents, and the mean of the relevant documents' feature vectors is used as the profile [Duma95]. LSI feature vectors describing newly arrived documents are then used to rank order the newly arrived documents in order of decreasing similarity with the profile using the cosine similarity measure.

LSI feature vectors are constructed by counting the frequency with which each term occurs in a document and then using those values as input to a function which reduces the number of features by accounting for similarities in word usage. This function is automatically constructed using a static document collection which is representative of the documents which are expected to arrive. Initially, a matrix is formed in which the frequency with which a single term (word, phrase, etc.) occurs in a document is stored at the row representing the term and the column representing the document. These weights are then adjusted to estimate the importance of each term for retrieval by applying row and column operations that emphasize terms which occur frequently in an individual document but rarely in the representative collection. gif A rank revealing matrix transformation (the singular value decomposition) is then used to construct a low-rank approximation to this matrix in which it has been observed by Deerwester, et al. that the vector components which remain in each column appear to represent the conceptual content of the documents fairly well while suppressing the adverse effects of variations in the author's choice of terms to represent those concepts in any particular document [Deer90]. This is accomplished by retaining (in our application) only the largest 200 singular values. A useful side effect of the singular value decomposition is that matrix formed from 200 retained singular values and their corresponding left singular vectors is a linear function which maps any column of the original matrix to the corresponding low-rank (200 element) representation that we call a LSI feature vector. Additional details of the technique can be found in [Berr96].

With this matrix it is possible to produce LSI feature vectors for previously unseen documents by simply forming a surrogate for the column that would have been constructed for original matrix if the document had been available when the singular value decomposition was computed and then performing a single matrix multiplication. This is important in text filtering applications, because even approximate incremental computation of the singular value decomposition is an expensive process. Dumais has shown that explicit positive feedback about some of the documents in a collection can be used to form a useful profile by simply forming the arithmetic mean of the feature vectors which represent those documents [Duma95]. This ``LSI-mean'' profile can then be used to rank order newly arrived documents so that those which are most similar to the documents which were used to build the profile will appear near the top of the list. This is done by creating LSI feature vectors for the newly arrived documents, computing the cosine of the angle between the profile and each document, and then sorting the cosines (and their associated documents) in decreasing order. Each cosine computation requires about 400 arithmetic operations (in general, about twice the number of retained singular values), and this is the slowest part of the LSI-mean filtering operation that must be performed for each newly arrived document. With the unoptimized code that we used in our experiments a SPARC 20 can compute about 50,000 cosines per hour for vectors of this length.

Extending LSI to CL-LSI is extremely straightforward. The only change which must be made is that the representative documents used to form the original matrix are formed by adjoining a translation of each document to the document itself. This requires that a parallel document collection be used for the initial step, so CL-LSI clearly satisfies our definition of a corpus-based technique. For consistency with our other approaches, we call this the ``language training'' step. The ``profile training'' and ``filtering'' steps that we described above can accept documents written in either a single language or in a combination of the two languages since no distinction is made between terms in the two languages when constructing the matrix of left singular vectors. Language-independent profile learning and text filtering is a natural consequence of combining CL-LSI with the LSI-mean adaptive text filtering technique in this way because LSI feature vectors are an inherently language-independent representation. Figure 1 illustrates adaptive multilingual text filtering using CL-LSI and the LSI-mean technique.

  
Figure 1: Adaptive multilingual text filtering using Cross-Language LSI.

We used a more modular approach to construct our dictionary-based approach, but we were careful to retain as many features from our corpus-based design as possible in order to maximize the comparability of the results. In particular, we used a monolingual version of the LSI-mean adaptive filtering technique, training on only a single-language version of the same representative document collection. The language-independent nature of LSI feature vectors makes the query translation approach that is used in cross-language text retrieval impractical for text filtering applications based on the LSI-mean technique, so we chose instead to translate the documents themselves. In many real-world applications this approach would require that language identification be performed so that appropriate documents could be submitted for translation. This is not a significant obstacle, however, since language identification techniques with better than 95% accuracy are available [Gref95].

In our experiments it was not necessary to perform language identification because each document used only a single language and the language of every document was known in advance. We used a fully-automatic machine translation system provided by the Logos corporation for our experiments.gif This system contains a broad-coverage dictionary that can be augmented with application specific dictionaries, and software tools are provided to assist with their creation. In our experiments we used only the dictionaries delivered with release 7.0 of the Logos machine translation system. The Logos system also makes extensive use of linguistic information when performing translations. While some of this processing is unnecessary for text retrieval applications (e.g., word order choices), the effect of this information on word choice is significant. Logos generates only a single candidate translation regardless of the degree of ambiguity encoded in the dictionary, both syntactic information discovered during parsing and semantic information available in the dictionary (or ``lexicon'') is exploited when making these choices.

  
Figure 2: Adaptive multilingual text filtering using Text Translation.

While such sophisticated processing is quite resource expensive, our experiment design required that we translate a fairly small number of documents. The within-language performance of the LSI-mean adaptive text filtering filtering technique has been well characterized, so we designed our experiments to measure the cross-language performance of our techniques. To do this we chose a monolingual profile training collection in one language and a separate monolingual evaluation collection in another language. A third, bilingual, language training collection was also used. With this design it was only necessary to translate documents in the monolingual profile training collection. Furthermore, since the LSI-mean technique exploits only positive feedback, it was only necessary to translate documents in that collection which were known to be relevant to the information need. For our profile training collection and the available relevance information this amounted to approximately 1000 documents. Because the Logos system we used was configured to translate documents from English into Spanish, we used an English language collection for profile training, a Spanish language collection for evaluation, and a bilingual English/Spanish collection for language training. The same collections and (where appropriate) relevance information were used in all of our experiments experiments.

In practice, the required translation effort would be much more extensive and it would need to be performed under tight time constraints. Thus, we believe that the approach we have chosen, which we call ``Text Translation'' (TT) is best viewed as an upper bound on the performance of practical dictionary-based approaches which exploit linguistic knowledge to select a single translation. It is possible that better integrated approaches such as the partial-translation approach used in the European Multilingual Information Retrieval project (EMIR) [Radw95] may eventually better this performance, but we believe our TT results presently provide a useful benchmark for the performance of a dictionary-based adaptive multilingual text filtering systems. Figure 2 illustrates adaptive multilingual text filtering using CL-LSI and the LSI-mean technique.

We developed our third technique, ``Vector Translation'' (VT), in order to overcome the computational bottleneck that is inherent in the TT approach. Our goal was to construct an effective corpus-based technique that is amenable to the introduction of linguistic knowledge as well. Unlike TT and CL-LSI, the VT technique is not inspired by an existing multilingual text retrieval technique. For that reason, we describe its motivation and operation in some detail.

Like TT, in VT every document is used to produce a Spanish vector. But in VT it is the document vector, rather than the document itself, that is translated. VT is essentially term-by-term translation applied to the vectors which represent documents (the columns in the term-document matrix that was described above). Since each element of a document vector is associated with a single term in a single language, term-by-term translation can be applied to vectors as easily as it can be applied to documents. Document vectors typically encode no term-order information, so deeper analysis is precluded by the representation. Term-by-term translation is quite fast, but as we observed above except in narrow domains where polysemy effects can be suppressed the resulting translations are generally not suitable for display to users.

  
Figure 3: Assignment of English term weights to Spanish terms.

Fortunately, the vector representation has two features which mitigate the adverse effects of this problem. The first is that it is not necessary to select a single translation target for each term, since the vector representation is based on the frequency with which a term occurs. For example, if there are two possible Spanish translations for some English term, it would be possible (although perhaps not wise) to simply divide the weight associated with the English term equally to produce a weight for each of the Spanish terms. The left side of Figure 3 illustrates this for two senses of the English word ``bank,'' one of which (a financial institution) translates to ``banco'' and the other (a river bank) translates to ``orilla.''

The second helpful feature of vector representations is that a kind of ``reverse polysemy'' effect reduces the adverse impact of associating some of the weight with the wrong term in the other language. Consider the case in which two English terms both translate to the same Spanish term. The weights contributed by each English term can simply be added together to find the weight that should be assigned to the Spanish term. Word choice variation is a common stylistic device in many types of documents, typically introduced to avoid the monotony of repeated use of a single term. When different English terms are used for the same concept, it is likely that the intersection of their Spanish translations will be quite small, perhaps even a single word. Thus, term weight will tend to accumulate on the ``consensus translation,'' and that consensus is likely to be correct. The right side of Figure 3 illustrates this consensus translation effect for the English terms ``credit union'' and ``bank.''

Figure 4 shows how this translation matrix is used for text filtering. Separate modules are used to construct vectors for English and Spanish documents because term lists and collection-wide statistics differ for the two languages. Vector translation is performed for vectors based originally on English documents so that all of the vectors passed to the translation module approximate those which would have been constructed if the original documents had been in Spanish. The remainder of the filtering process then proceeds as in the other two techniques. Vector Translation is considerably more efficient than Text Translation, requiring only a single matrix multiplication for each document that is not already in the preferred language. So once the translation matrix has been constructed, Vector Translation can be applied just as quickly as Cross-Language Latent Semantic Indexing

The translation matrix itself could be extracted from a bilingual dictionary if translation probabilities were associated with each word. Such information is lacking in presently available bilingual dictionaries, but in some cases the translations are listed in the order of their relative predominance. Even if this information could be interpreted in a useful way, however, this general order of predominance may not be very informative in specific applications for which only a few of the translations for any particular word would be appropriate. For this reason we have chosen to begin our work on vector translation using a corpus-based approach, and defer for future work the integration of information contained in bilingual dictionaries.

The corpus-based approach we use to build the translation matrix is based on the observed frequency of aligned terms that are extracted from a parallel bilingual document collection. Term alignment in bilingual document collections is a challenging problem which has been extensively studied as one important aspect of what is known as ``corpus linguistics.'' Term alignment typically proceeds in three stages:

  1. Document alignment. Corresponding documents in each language are identified.
  2. Sentence alignment. Sentences and other similar units in each text are identified and corresponding pairs of sentences are associated.
  3. Term alignment. Corresponding terms (which may be words, word stems, morphological roots appearing in a dictionary, and/or multi-word phrases) in aligned sentence pairs are identified.

In our experiments we use a parallel bilingual collection of United Nations documents that has been aligned at the document level by the Linguistic Data Consortium at the University of Pennsylvania using document numbers assigned when each document was originally prepared.gif. David Hull of the Rank Xerox Research Corporation then used a developmental version of morphological analysis software that is being developed at Rank Xerox to preprocess each document used in this experiment, converting each word to its morphological root and identifying commonly appearing phrases.

  
Figure 4: Adaptive multilingual text filtering using Vector Translation.

Shen and Dorr have developed a term alignment technique which Shen has used to produce aligned pairs from these preprocessed documents [Shen96]. They first use dynamic programming to optimize the alignment of sentences based on their length. Once the documents are aligned to at the sentence level, term-level alignment is performed. Their technique is based on the cooccurrence frequency of terms in aligned sentence pairs, with greater weight placed on cooccurrences that appear at similar locations within each sentence. For example, two words would be assigned a greater cooccurrence value if they both occur as the first word in a pair of aligned sentences than if one appears first and the other appears last. Every term pair with a cumulative cooccurrence value that exceeds a specified threshold is considered to be aligned. The number of such cooccurrences is then used to compute an empirical distribution on the Spanish translation of every term appearing in the English versions of the United Nation documents, and those distributions are stored as a translation matrix.gif

It is the use of a threshold on the correlation values which induces some measure of term alignment, and that is what distinguishes this approach from the vector translation technique of Davis and Dunning which was based solely on sentence alignment [Davi95]. A high threshold results in a sparse translation matrix and highly focused vector translations, a lower threshold produces more translation targets for each term and hence a somewhat more diffused translation in which the same amount of term weight is spread across a larger number of terms. One important goal of our experiments was to find a threshold value which produces enough diffusion to exploit the consensus translation effect, but which remains sufficiently focused to avoid introducing an overwhelming number of spurious translations.



next up previous
Next: Ideal Experiment Design Up: Adaptive Filtering of Previous: Background



Douglas W. Oard
Tue May 13 20:29:24 EDT 1997