The explosive growth of the Internet and other sources of networked information have made automatic mediation of access to networked information sources an increasingly important problem. Much of this information is expressed as electronic text, and it is becoming practical to automatically convert some printed documents and recorded speech to into electronic text as well. Thus, automated systems capable of detecting useful documents are finding widespread application.
One important type of automated text detection system is what we call a text filtering system. As described by Belkin and Croft, information filtering systems seek to sift through large volumes of newly generated information, passing on to the user only those which might be useful [Belk92]. This is essentially the same concept that Luhn earlier called ``Selective Dissemination of Information'' (SDI) [Luhn58], but the term ``information filtering'' is now more commonly used when the information in question is arriving over a computer network. The vast majority of information filtering research has been focused on filtering electronic text, but interesting work has been done with music, home videos, and other media as well [Oard97b].
Many of the existing text filtering systems require that the user provide an explicit ``profile'' which specifies their information needs. What we call ``adaptive'' text filtering systems seek to minimize or eliminate this burden by learning the profile automatically. In many research systems, users are allowed to provide ratings for documents that they have examined. We have adopted this approach for our experiments because it allows for straightforward implementation and it suits our evaluation methodology well. In the future, adaptive text filtering systems will likely also exploit the sort of ``over the shoulder'' observations of user behavior investigated by Morita and Shinoda [Mori94]. Regardless of the approach chosen, present adaptive text filtering systems are most effective when used to satisfy relatively stable and specific information needs because a substantial quantity of consistent training data can be accumulated over time.
By ``multilingual'' text filtering systems, we mean systems which can select useful documents from document streams that may contain several languages (English, French, Chinese, ...). This formulation allows for the possibility that individual documents contain more than one language, a common occurrence in many applications. Multilingual text filtering systems can be useful even if the user is able to read only a single language. When sufficient resources are available to translate selected documents, for example, performing filtering before translation can be significantly more economical than performing translation before performing filtering. But even when translation is not available, there are circumstances in which multilingual text filtering could be useful to a monolingual user. A researcher, for example, might find a research paper published in an unfamiliar language useful if that paper contains references to works by the same author that are in the researcher's native language. But the most significant applications of multilingual text filtering will undoubtedly be those which involve multilingual users.
Text filtering systems for which users specify the profiles manually
can easily accommodate multilingual filtering if the character set
used for text representation is appropriate for the desired languages.
All that is needed is either manual or semi-automatic facilities to
translate the user-provided profiles into each language in which
documents might be detected. This is the approach used by Paracel's
Fast Data Finder system.
Development of an effective adaptive multilingual text filtering system is considerably more challenging, however. The most straightforward approach, providing separate adaptive monolingual text filtering systems for each language, would only provide acceptable performance in languages for which an adequate quantity of training data could be observed or provided by the user. The techniques we have investigated, by contrast, are all capable of using a profile that was learned from material in any language (or that contain several languages in the same document) to select documents in any language.
In summary, applications for which adaptive multilingual text filtering systems are appropriate can be characterized by the following four features: