[UE4] Localisation and Internationalisation

Back to Localisation Overview.

Introduction

Localisation and Internationalisation (L10N and I18N) are two concepts that often get lumped together as simply “localisation” (and even I’m guilty of that), but they are in fact two distinct things, and UE4 handles them in different ways.

The localisation system is UE4 is all home grown and centered around our ‘text’ type, whereas our internationalisation support makes use of the International Components for Unicode (ICU) library. While they are separate, in UE4 you cannot have localisation at runtime without the appropriate internationalisation support.

What is text?

Text in UE4 can be thought of as the primitive component for localisation. It is in essence a very specialised string, represented by the FText type in C++, and should be used whenever you have user-facing text that needs to be localised. If you take nothing else from this, remember that you use FText for all user-facing text.

Internally FText is implemented as a TSharedRef to an ITextData; this makes them very cheap to copy (unlike FString). The FTextSnapshot utility provides an efficient way to detect if a cached FText value has changed; this type is used to great effect in Slate to avoid the costly process of recreating a text layout until the user-facing text (which may be bound to a delegate) actually changes.

The data held within an FText varies depending upon how the FText was created. This variance is handled by the internal “text history” (FTextHistory, a slightly odd name that has nothing to do with translation history, but instead tells the text where it came from and how it can be rebuilt if needed). Text histories support the culture-correct rebuilding of text, and form the key component to allowing live culture switching, sending FText over the network, and the creation of culture invariant sources.

While FText takes a lot of the pain of localisation away from you, there are still some ‘gotcha’s’ with text formatting that you should be aware of.

  • When injecting a number that affects the sentence, handle these variances via “Plural Forms” rather than branching in code.
    • This allows the sentence to be correctly translated for languages that don’t share the plural rules of your source language.
  • When injecting a personal noun, make sure you include an argument for the gender of the person.
    • This is important for languages with grammatical gender as it allows your translators to switch their translation based on the gender (see “Gender Forms”).
  • Avoid injecting non-personal nouns (or be prepared to do a lot of work to make them localisable).
    • Unlike personal nouns (which have a fixed gender between languages), non-personal nouns may have different genders in different languages. This makes the format pattern string impossible to accurately localise without a lot of per-culture meta-data (of which gender is only one part).
    • This per-culture meta-data would be used to build up the correct formatting pattern to use when formatting, and would be custom for your particular problem domain. I’m not aware of any text formatting system that handles this out-of-the-box, so you’d have to implement this part yourself.
    • Ideally you should prefer to provide full sentences rather than inject non-personal nouns (even if it’s redundant in your source language), as this will ensure that you get accurate translations out-of-the-box.
  • Avoid concatenating partial sentences.
    • This has much the same pitfalls as injecting non-personal nouns. Each part of the sentence could be localised, but the combination of them may be incorrect.
    • As before, you should prefer to provide full sentences to ensure that you get accurate translations out-of-the-box.

There’s also a couple of ‘gotcha’s’ that you should be aware of with FString (as FText ultimately stores its display string as an FString).

  • FString in UE4 is an array of TCHAR. TCHAR is, by default, wchar_t. wchar_t is not a fixed size between platforms.
    • On Microsoft platforms TCHAR is 2-bytes (for UTF-16).
    • On all other currently supported platforms TCHAR is 4-bytes (for UTF-32).
  • FString assumes that a TCHAR always contains a complete character, and that strings can be split at any TCHAR boundary.
    • There are two minor exceptions to this; ICU (which always stores and processes strings as UTF-16), and complex text shaping (which uses ICU internally to iterate the string).
    • This means that the UTF-16 support on Microsoft platforms is essentially only UCS-2 since we can’t guarantee that characters outside the Basic Multilingual Plane (BMP) will be handled correctly, therefore you should limit your text to characters within the BMP.

What is ICU?

ICU is a mature and robust internationalisation library (and is probably the de facto internationalisation library for C/C++).

UE4 uses it to deal with anything involving culture specific data or processing, including the following:

  • Obtaining the current culture for the platform/OS.
  • Handling the prioritised fallback of cultures.
  • Handling the culture correct formatting of numbers (including percentages and currency), and dates and times (including timezone data).
  • Handling the culture correct plurarity of numbers (during text formatting).
  • Handling Unicode compliant transformation of text (eg, ToUpper, ToLower).
  • Handling Unicode compliant comparison and collation of text.
  • Handling Unicode compliant boundary analysis (characters, words, and line-breaks).
  • Handling Unicode compliant bi-directional (BiDi) text detection.

We originally used icu::DecimalFormat for this but it was far too slow for our needs, so instead we just extract the per-culture number formatting rules from ICU and pass them to our own FastDecimalFormat functions. A similar thing happened to icu::MessageFormat which was replaced by our own FTextFormatter.

The culture specific data that ICU needs to function is stored outside of ICU itself, and UE4 provides some coarse sets that you can use to minimise your project size:

  • English (~1.77MB)
  • EFIGS - English, French, Italian, German, and Spanish (~2.38MB)
  • EFIGSCJK - English, French, Italian, German, Spanish, Chinese, Japanese, and Korean (~5.99MB)
  • CJK - Chinese, Japanese, and Korean (~5.16MB)
  • All (~15.3MB)

Which one of these you pick depends on what languages you need to localise your game for, and that is the topic we will cover in the next post.

 
comments powered by Disqus