Skip to main content

Using UTF-8 encoding in GNAT

Nowadays, UTF-8 is the de facto standard for source code representation. Section 2.1 Character Set of the reference manual says that the compiler must understand texts in UTF-8 encoding:

An Ada implementation shall accept Ada source code in UTF-8 encoding, with or without a BOM (see A.4.11), where every character is represented by its code point.

This does not mean that the compiler must do so by default. In the case of the GNAT compiler, the compiler by default uses the Latin-1 (ISO-8859-1) encoding and a build switch is needed to enable UTF-8 encoding of string literals and identifiers.

String literals

For example, the following program has a string literal containing UTF-8 characters:

with Ada.Text_IO;

procedure Main is
begin
Ada.Text_IO.Put_Line ("Привіт");
end Main;

In some IDEs, such as GNAT Studio, building or even saving the file may generate an error like:

This buffer contains UTF-8 characters which could not be translated to ISO-8859-1.

Some data may be missing in the saved file: check the Locations View.

You may change the character set of this file through the "Properties..." contextual menu.

If you change the encoding of the file to UTF-8, the program can be build and may seem to work as expected (assuming that the terminal understands UTF-8).

tip

The default encoding of all files can be changed in GNAT Studio by going to "Edit" -> "Preferences...". Then go to "General" and change "Character set" to "Unicode UTF-8". Re-create the file "main.adb" and it will save successfully.

UTF-8 encoding of string literals

However, from the compiler's point of view the string "Привіт" contains not 6, but 12 characters! The reason for this is that, according to Section 3.5.2 Character Types of the reference manual, the type Character includes only 256 values from the Latin-1 set:

The predefined type Character is a character type whose values correspond to the 256 code points of Row 00 (also known as Latin-1) of the ISO/IEC 10646:2017 Basic Multilingual Plane (BMP).

As far as the compiler is concerned, the string literal does not contain any Cyrillic characters. The compiler sees the string literal as "Ð_Ñ_ивеÑ_" (_ indicates non-printable characters in Latin-1). The compiler does not know that you have set the encoding in the IDE to UTF-8. It still uses Latin-1 as its default encoding.

You can tell the compiler the encoding of the source code is UTF-8 by adding the -gnatW8 option to the list of compiler switches. If your project is an Alire crate, edit the .gpr file in the root of your project and add -gnatW8 to the default switches in the package Compiler:

package Compiler is
for Default_Switches ("Ada") use Main_Config.Ada_Compiler_Switches & ("-gnatW8");
end Compiler;

Now the compiler will see the Cyrillic alphabet and correctly refuses to build the program:

     5.    Ada.Text_IO.Put_Line ("Привіт");
|
>>> literal out of range of type Standard.Character

6 lines: 1 error
tip

If your project does not use Alire, the switch can be added in GNAT Studio as follows: select in the menu "Edit" -> "Project Properties...". Then go to "Build" -> "Switches" -> "Ada" and write -gnatW8 in the options bar at the bottom of the window.

Printing text containing UTF-8 characters

To work with text containing UTF-8 characters, the type String is insufficient. A different type is needed.

Before the first version of the Ada language, only 127 characters of the ASCII set could fit in the type Character. This was quickly corrected by expanding Character to 256 Latin-1 values. The next version of the standard, around the time of Java's appearance (whose character size is 16 bits) introduced the type Wide_Character, which contains 65536 characters. Later the 32-bit type Wide_Wide_Character was added for Unicode with its repertoire of 1114112 "code points".

The types String and Wide_String were never deprecated. In fact, the type String is still widely used in the standard library. For example, in file names in packages related to I/O and packages related to environment variables.

To print the text with the UTF-8 characters in the program above, the type Wide_String should be used instead. In order to print text of this type, the package Ada.Wide_Text_IO is needed:

with Ada.Wide_Text_IO;

procedure Main is
Hello : constant Wide_String := "Привіт";
begin
Ada.Wide_Text_IO.Put_Line (Hello);
end Main;

Now the program will compile and print a string of 6 characters to the screen.

Identifiers

By default the GNAT compiler recognizes only the Latin-1 character set in identifiers. The current Ada standard, however, defines the identifier lexical element in Unicode terms, not in Latin-1 characters. That is, the standard allows the use of non-Latin-1 characters in identifiers. With -gnatW8 the compiler will follow the standard with respect to identifiers and allow the use of UTF-8 characters:

with Ada.Text_IO;

procedure Main is
π : constant := 3.14;
begin
Ada.Text_IO.Put_Line (π'Image);
end Main;

The use of such names is not particularly encouraged, but can be found useful in certain cases.

caution

Keep compilation units and file names in ASCII to avoid problems with the compiler and tools.

When -gnatW8 isn't allowed

Some style guides prohibit the use of non-ASCII characters in source code. Without -gnatW8, a Wide_String containing UTF-8 characters (in this case "Привіт") can be constructed as follows:

with Ada.Wide_Text_IO;

procedure Main is
Hello : constant Wide_String :=
Wide_Character'Val (1055) &
Wide_Character'Val (1088) &
Wide_Character'Val (1080) &
Wide_Character'Val (1074) &
Wide_Character'Val (1110) &
Wide_Character'Val (1090);
begin
Ada.Wide_Text_IO.Put_Line (Hello);
end Main;

The result is, well, non-obvious:

["041F"]["0440"]["0438"]["0432"]["0456"]["0442"]

The output is the so-called "brackets encoding", invented by the GNAT authors in the early days of Wide_Character and Wide_String. To avoid brackets encoding and make the compiler use UTF-8 even without -gnatW8, add -W8 to the list of switches in package Binder in the .gpr file of your project:

package Binder is
for Switches ("ada") use ("-W8");
end Binder;

Using this switch, the program will correctly print the string "Привіт" instead of producing output using brackets encoding.

tip

In GNAT Studio, go to the tab "Build" -> "Switches" -> "Binder" and add -W8 to the options bar.

note

When you use -gnatW8, the binder will use the -W8 switch automatically. But you can specify both of them, it won't hurt.

info

Brackets encoding was used, for example, in GNAT's implementation of package Ada.Numerics to provide the constant Pi as the greek letter π:

["03C0"] : constant := Pi;

(Chapter A.5 of the reference manual does not use bracket encoding and uses the actual UTF-8 character π)

Opening files using UTF-8

Files using UTF-8 can be opened using the parameter Form => "WCEM=8":

with Ada.Wide_Text_IO;

procedure Main is
Hello : constant Wide_String :=
Wide_Character'Val (1055) &
Wide_Character'Val (1088) &
Wide_Character'Val (1080) &
Wide_Character'Val (1074) &
Wide_Character'Val (1110) &
Wide_Character'Val (1090);
Output : Ada.Wide_Text_IO.File_Type;
begin
Ada.Wide_Text_IO.Create (Output, Name => "hello.txt", Form => "WCEM=8");
Ada.Wide_Text_IO.Put_Line (Output, Hello);
end Main;

Environment variables

Do not expect your program to pay attention to locale settings (like LANG), but watch out for environment variables GNAT_CCS_ENCODING, GNAT_CODE_PAGE on Windows.

Enabling UTF-8 in other tools

GNAT programs

Other GNAT programs like gnatpp and gnatstub can use UTF-8 by using the flag --wide-character-encoding=8.

Ada Language Server

The Ada Language Server (part of the Ada extension for Visual Studio Code) uses iso-8859-1 by default. This can be changed to UTF-8 via the parameter defaultCharset in the settings:

"defaultCharset": "UTF-8"

Alire crates for handling Unicode strings

For some time, the Ada standard was unambiguous in that the types Character and String could only contain Latin-1 characters. But at some point it faltered under the onslaught of "lovers of simple solutions" and there appeared functions to convert Wide_String/Wide_Wide_String to UTF-8, which use String type instead of an array of bytes to represent UTF-8.

The GNAT authors take full advantage of this, allowing, for example, to pass to Ada.Text_IO.Create the file name in UTF-8 encoding via a parameter of type String.

Introducing the type Wide_Wide_String does not really solve the problem of using Unicode, since this standard does not manipulate "characters", but combinations of characters. There are variants where several "code points" form a single glyph when printed or displayed. It is often more convenient for the user to work in such concepts. This is logical for specifying the line or column position in a text. The type Wide_Wide_String does not help here.

Various Alire crates exist which provide proper support for Unicode strings:

  • vss. Introduces its own type for Unicode strings with handy methods to work with. It can be used to find character boundaries, grapheme clusters, character offsets in UTF-8/UTF-16 encoding, etc.

  • matreshka_league. Allows you to operate on strings in terms of Unicode "code points". It has a set of transcoders to different encoding systems (like Windows-1251, KOI8-R), support for JSON, XML, databases, regular expressions, XML template engine, etc.