Over the past year, it has been difficult not to come across content created by generative AI.You may have seen a LinkedIn post where OpenAI’s AI-based ChatGPT has been asked to give good tips on how to get a salary raise. Perhaps you have encountered a chatbot when you have needed help with booking an appointment online. You may have even laughed at images created by DALL·E, where the faces of familiar celebrities have been digitally added to classic works of art.
Generative AI is developing at a tremendous pace and language technology is also benefiting from these leaps of development. From the perspective of language technology, the most important role is played by large language models (LLMs).
Large language models need massive amounts of data
Generative AI solutions created for producing and processing language are called large language models. They are statistical models that calculate the probabilities of the occurrence of words or parts of words. The simplest language models calculate the probability simply by calculating how many times the word appears in the given text materials, without taking the surrounding words and their meanings into account.
Large language models, on the other hand, are advanced language models that are based on neural networks and, as their name suggests, are trained with large amounts of data. The training of the language models of previous neural machine translation engines often requires a dataset of hundreds of millions of variables but large language models require billions of them.
Current neural network-based language models, including large language models, are most often based on transformers. Transformers are neural network architectures that consist of many overlapping layers. Previous machine learning models focused on individual words in their training data, while transformers are capable of simultaneously processing and taking into account words within a sentence or a longer text and analysing, among other things, broader relationships within the sentence.
Large language models use huge text materials to learn which words appear in which context and which words often follow each other. This allows large language models to predict what words and phrases would appear in an answer suitable to a certain type of question, how a beginning of a poem could continue or how a certain matter would be expressed in another language in the context in question.
A one-size-fits-all solution is not suitable for all cases
Since the amount of data needed for training large language models is enormous, a lot of generic content will inevitably have to be used. Large language models can also be customised, or personalised, to make them more suitable for different tasks and needs. This can be done by adding additional training data in a more limited subject area and modifying the model’s weightings. It is also important to choose a suitable customisation strategy, such as building a ready-to-use model or learning through use with the aid of prompt engineering.
Especially when large language models are used for translation, it is useful to personalise language model with high-quality, fully matching bilingual materials. It is especially good if the texts used in personalisation and the new texts to be translated are from as similar subject areas as possible. For this purpose, training language models with existing translation memories is a good idea: all of the customer’s translations are saved in translation memories by language pair so these texts are always bilingual, are good matches and use the company’s desired style and terminology.
In this way, the language model learns what the customer’s texts are like and is able to imitate their style better in the future. This improves, among other things, how machine translation produced by large language models is aligned with the company’s style. This, in turn, speeds up translating as higher-quality machine translation that is customised according to the customer’s needs reduces the time involved in any post-editing required.
The downsides of AI
Although continuous learning and adaptability associated with large language models are in many respects good, they also have their downsides. Language model-based solutions in general use learn as they are running so they can also be intentionally misled and taught false information. If you repeatedly teach a language model that a pike is a bird instead of a fish, it will start believing this to be true. Large language models also suffer from so-called hallucinations, which you can read more about in our article on getting the most out of machine translation.
In addition to invented facts, another risk associated with language models trained with a large amount of data is that confidential information ends up in the wrong place. Large language models in general use are open to everyone and, when using them, it is not possible to be sure whether your data will be made available to others. Therefore, it is important to choose a reliable partner for whom data security issues are a priority.
When training large language models with translation memories, it is necessary to take care of data security and ensure that customer-specific data is used only in each customer’s own model. This way, you can avoid the situation in which the language model would introduce another customer’s confidential information into the translation. We can build a customer-specific language model directly as part of LanguageWire’s highly secure ecosystem. Read more about LanguageWire’s insights into the data security in large language models here.
Large language models as part of the future
Training large language models for translation needs is a giant leap forward as they can provide more accurate and natural translations into multiple languages. These models make it possible to produce translations quickly while taking into account the desired style and terminology. In the future, this development can break down language barriers and foster interaction between people around the world by strengthening understanding and cooperation on a global scale.