Machine translation

From Free net encyclopedia

Machine translation, sometimes referred to by the acronym MT, is a sub-field of computational linguistics that investigates the use of computer software to translate text or speech in between natural languages.

At its basic level, MT performs simple substitution of atomic words in one natural language for words in another. Using corpus techniques, more complex translations can be performed, allowing for better handling of differences in linguistic typology, phrase recognition, and translation of idioms, as well as the isolation of anomalies. However, current systems are unable to produce output of the same quality as a human translator, particularly where the text to be translated uses casual language.

Modern machine translation software, such as that produced by SYSTRAN or IBM, allows for customization by domain or profession (such as weather reports) — improving output by limiting the scope of allowable substitutions. This technique is particularly effective in domains where formal or formulaic language is used. Improved output quality can also be achieved by human intervention: for example, some systems are able to translate more accurately if the user has unambiguously identified which words in the text are names. With the assistance of these techniques, MT has proven useful as a tool to assist human translators, and in some cases can even produce output that can be used "as is".

In the words of the European Association for Machine Translation (EAMT):

Machine translation (MT) is the application of computers to the task of translating texts from one natural language to another. One of the very earliest pursuits in computer science, MT has proved to be an elusive goal, but today a number of systems are available which produce output which, if not perfect, is of sufficient quality to be useful in a number of specific domains. Template:Ref (1997)

1 Introduction
2 Approaches
3 History
4 Users
5 Evaluation
6 See also
7 Notes
8 References
9 External links

[edit]

Introduction

The translation process, whether for translation, can be stated simply as:

Decoding the meaning of the source text, and
Re-encoding this meaning in the target language.

Behind this simple procedure there lies a complex cognitive operation. For example, to decode the meaning of the source text in its entirety, the translator must interpret and analyse all the features of the text, a process which requires in-depth knowledge of both the grammar, semantics, syntax, idioms and the like of the source language, as well as the culture of its speakers. The translator needs the same in-depth knowledge to re-encode the meaning in the target language.

Therein lies the challenge in machine translation: how to program a computer to "understand" a text as a human being does and also to "create" a new text in the source language that "sounds" as if it has been written by a human.

This problem can be tackled in a number of ways.

[edit]

Approaches

Image:Direct translation and transfer translation pyramind.svg Machine translation can use a method based on linguistic rules, which means that words will be translated in a linguistic way — the most suitable (orally speaking) words of the target language will replace the ones in the source language.

It is often argued that the success of machine translation requires the problem of natural language understanding to be solved first.

A number of heuristic methods of machine translation are also used for machine translation, including:

Rule-based methods:
- Lexical lookup methods
- Grammar based methods
- Semantics based methods — Knowledge-based machine translation
Statistical methods — Statistical machine translation
Example-based methods — Example-based machine translation
Dictionary-entry based methods
Linguistic rules based methods

Generally, rule-based methods parse a text, usually creating an intermediary, symbolic representation, from which the text in the target language is generated. According to the nature of the intermediary representation, an approach is described as interlingual machine translation or transfer-based machine translation. These methods require extensive lexicons with morphological, syntactic, and semantic information, and large sets of rules.

Statistical-based and example-based methods eschew manual lexicon building and rule-writing and instead try to generate translations based on bilingual text corpora, such as the Canadian Hansard corpus, the English-French record of the Canadian parliament. Where such corpora are available, impressive results can be achieved translating texts of a similar kind, but such corpora are still very rare.

Given enough data, machine translation programs often work well enough for a native speaker of one language to get the approximate meaning of what is written by the other native speaker. The difficulty is getting enough data of the right kind to support the particular method. For example, the large multilingual corpus of data needed for statistical methods to work is not necessary for the grammar-based methods. But then, the grammar methods need a skilled linguist to carefully design the grammar that they use.

To translate between closely related languages, a technique referred to as shallow-transfer machine translation may be used.

[edit]

History

The first attempts at machine translation were conducted after World War II. It was assumed at this time that the newly invented computers would have no trouble in translating texts. The reasoning was that computers were able to do complex mathematics quickly, something that humans did with more difficulty. On the other hand, even young children were able to learn to understand human language; therefore, computers could do the same. In actual fact, this belief was soon shown to be incorrect.

On 7 January 1954, the Georgetown-IBM experiment, the first public demonstration of a MT system, was held in New York at the head office of IBM. The demonstration was widely reported in the newspapers and received much public interest. The system itself, however, was no more than what today would be called a "toy" system, having just 250 words and translating just 49 carefully selected Russian sentences into English — mainly in the field of chemistry. Nevertheless it encouraged the view that MT was imminent — and in particular stimulated the financing of MT research, not just in the US but worldwide.

The first serious MT systems were used during the Cold War to parse texts in Russian scientific journals. The rough translations produced were sufficient to understand the "gist" of the articles. If an article discussed a subject deemed to be of security interest, it was sent to a human translator for a complete translation; if not, it was discarded. The governmental support was however cut down in 1966, after the report of ALPAC, a committee established in order to review the investments, which considered that machine translation, despite the expenses, was not likely to reach the quality of a human translator.

Although the ALPAC report had tremendous impact on research in machine translation, there were notable exceptions; SYSTRAN, for example, managed to attract commercial and defence/security customers and survived the decrease of direct governmental funding. Limited field of use systems have also been successful in a number of specialized applications, for instance the METEO System has been used in Canada since 1977 to translate weather forecasts from English to French and now translates close to 80,000 words a day or 30 million words a year.

The advent of low-cost and more powerful computers towards the end of the 20th century brought MT to the masses, as did the availability of sites on the Internet. They are of particular interest to countries in East Asia wishing to export to the North American and European markets.

Much of the effort previously spent on MT research, however, has shifted to the development of computer-assisted translation (CAT) systems, such as translation memories, which are seen to be more successful and profitable. Although the two concepts are similar, machine translation (MT) should not be confused with computer-assisted translation (CAT) (also known as machine-assisted translation (MAT)).

In machine translation, the translator supports the machine, that is to say that the computer or program translates the text, which is then edited by the translator, whereas in computer-assisted translation, the computer program supports the translator, who translates the text himself, making all the essential decisions involved.

[edit]

Users

Despite their inherent limitations, MT programs are currently used by various organisations around the world. Probably the largest institutional user is the European Commission, which uses a highly customised version of the commercial MT system SYSTRAN to handle the automatic translation of a large volume of preliminary drafts of documents for internal use.

A Danish translation agency, Lingtech A/S Template:Ref, has been translating patent applications from English to Danish since 1993 using a proprietary rule-based machine translation system, PaTrans, working together with the translation memory based Trados commercial CAT tool. The system requires manual pre- and post-editing, but the monthly output is still approx. 400,000 words per operator.

In Spain, the magazine entitled Periódico de Catalunya, a daily newspaper, is translated into Catalan or Spanish through a MT System.

Google that promising results were obtained using proprietary statistic machine translation engine Template:Ref. However, it seems that the machine translation system they are currently using is still based on SYSTRAN engine.

It has been reported that in April 2003 Microsoft began using a hybrid MT system for the translation of a database of technical support documents from English to Spanish. The system was developed internally by Microsoft's Natural Language Research group. The group is currently testing an English-Japanese system as well as bringing English-French and English-German systems online. The latter two systems use a learned language generation component, whereas the first two have manually developed generation components. The systems were developed and trained using translation memory databases with over a million sentences each.

With the recent focus on terrorism, the military sources in US invest significant amounts of money in natural language engineering. In-Q-Tel Template:Ref (a venture capital fund, largely funded by the US Intelligence Community, to stimulate new technologies through private sector entrepreneurs) brought up companies like Language Weaver. Information Processing Technology Office in DARPA hosts programs like TIDES and Babylon. US Air Force has awarded a $1 million contract to develop a language translation technology. Template:Ref

Currently the military community is interested in translation and processing of languages like Arabic, Pashto, and Dari.

[edit]

Evaluation

There are various methods for evaluating the performance of machine translation systems, the oldest is by using human judges to tell the quality of a translation, newer, automated methods include BLEU, NIST and METEOR.

Currently, the product of machine translation is sometimes called a "gisting translation" — unless one is proficient in both languages, MT will often produce only a rough translation that will at best allow the reader to "get the gist" of the source text, but is unlikely to convey a complete understanding of it. The user may find the raw translation sufficiently useful as it is.

[edit]