UTF-8 encoding in GNAT
Nowadays, UTF-8 is the de facto standard for source code representation. Section 2.1 Character Set of the reference manual says that the compiler must understand texts in UTF-8 encoding:
An Ada implementation shall accept Ada source code in UTF-8 encoding, with or without a BOM (see A.4.11), where every character is represented by its code point.
This does not mean that the compiler must do so by default. In the case of the GNAT compiler, the compiler by default uses the Latin-1 (ISO-8859-1) encoding and a build switch is needed to enable UTF-8 encoding of string literals and identifiers.
String literals
For example, the following program has a string literal containing UTF-8 characters:
with Ada.Text_IO;
procedure Main is
begin
Ada.Text_IO.Put_Line ("Привіт");
end Main;
In some IDEs, such as GNAT Studio, building or even saving the file may generate an error like:
This buffer contains UTF-8 characters which could not be translated to ISO-8859-1.
Some data may be missing in the saved file: check the Locations View.
You may change the character set of this file through the "Properties..." contextual menu.
If you change the encoding of the file to UTF-8, the program can be build and may seem to work as expected (assuming that the terminal understands UTF-8).
The default encoding of all files can be changed in GNAT Studio by going to "Edit" -> "Preferences...". Then go to "General" and change "Character set" to "Unicode UTF-8". Re-create the file "main.adb" and it will save successfully.
UTF-8 encoding of string literals
However, from the compiler's point of view the string "Привіт" contains not 6, but 12 characters!
The reason for this is that, according to
Section 3.5.2 Character Types
of the reference manual,
the type Character
includes only 256 values from the Latin-1 set:
The predefined type Character is a character type whose values correspond to the 256 code points of Row 00 (also known as Latin-1) of the ISO/IEC 10646:2017 Basic Multilingual Plane (BMP).
As far as the compiler is concerned, the string literal does not contain
any Cyrillic characters.
The compiler sees the string literal as "Ð_Ñ_ивеÑ_"
(_
indicates non-printable characters in Latin-1).
The compiler does not know that you have set the encoding in the IDE
to UTF-8. It still uses Latin-1 as its default encoding.
You can tell the compiler the encoding of the source code is UTF-8 by adding the -gnatW8
option
to the list of compiler switches.
If your project is an Alire crate, edit the .gpr file in the root of your project and
add -gnatW8
to the default switches in the package Compiler
:
package Compiler is
for Default_Switches ("Ada") use Main_Config.Ada_Compiler_Switches & ("-gnatW8");
end Compiler;
Now the compiler will see the Cyrillic alphabet and correctly refuses to build the program:
5. Ada.Text_IO.Put_Line ("Привіт");
|
>>> literal out of range of type Standard.Character
6 lines: 1 error
If your project does not use Alire, the switch can be added in GNAT Studio
as follows: select in the menu "Edit" -> "Project Properties...". Then go to
"Build" -> "Switches" -> "Ada" and write -gnatW8
in the options bar at the bottom
of the window.
Printing text containing UTF-8 characters
To work with text containing UTF-8 characters, the type String
is insufficient.
A different type is needed.
Before the first version of the Ada language, only 127 characters of the ASCII set could fit in
the type Character
. This was quickly corrected by expanding Character
to 256 Latin-1 values. The next version of the standard, around the
time of Java's appearance (whose character size is 16 bits)
introduced the type Wide_Character
, which contains 65536 characters.
Later the 32-bit type Wide_Wide_Character
was added for Unicode with its
repertoire of 1114112 "code points".
The types String
and Wide_String
were never deprecated.
In fact, the type String
is still widely used in the standard library.
For example, in file names in packages related to I/O and packages related to
environment variables.
To print the text with the UTF-8 characters in the program above,
the type Wide_String
should be used instead.
In order to print text of this type, the package Ada.Wide_Text_IO
is needed:
with Ada.Wide_Text_IO;
procedure Main is
Hello : constant Wide_String := "Привіт";
begin
Ada.Wide_Text_IO.Put_Line (Hello);
end Main;
Now the program will compile and print a string of 6 characters to the screen.
Identifiers
By default the GNAT compiler recognizes only the Latin-1 character set in identifiers.
The current Ada standard, however, defines the
identifier lexical element in Unicode terms, not in Latin-1 characters.
That is, the standard allows the use of non-Latin-1 characters in identifiers.
With -gnatW8
the compiler will follow the standard with
respect to identifiers and allow the use of UTF-8 characters:
with Ada.Text_IO;
procedure Main is
π : constant := 3.14;
begin
Ada.Text_IO.Put_Line (π'Image);
end Main;
The use of such names is not particularly encouraged, but can be found useful in certain cases.
Keep compilation units and file names in ASCII to avoid problems with the compiler and tools.
When -gnatW8
isn't allowed
Some style guides prohibit the use of non-ASCII characters in source code.
Without -gnatW8
,
a Wide_String
containing UTF-8 characters (in this case "Привіт") can be constructed
as follows:
with Ada.Wide_Text_IO;
procedure Main is
Hello : constant Wide_String :=
Wide_Character'Val (1055) &
Wide_Character'Val (1088) &
Wide_Character'Val (1080) &
Wide_Character'Val (1074) &
Wide_Character'Val (1110) &
Wide_Character'Val (1090);
begin
Ada.Wide_Text_IO.Put_Line (Hello);
end Main;
The result is, well, non-obvious:
["041F"]["0440"]["0438"]["0432"]["0456"]["0442"]
The output is the so-called "brackets encoding", invented by the GNAT
authors in the early days of Wide_Character
and Wide_String
.
To avoid brackets encoding and make the compiler use UTF-8 even without -gnatW8
,
add -W8
to the list of switches in package Binder
in the .gpr file of your project:
package Binder is
for Switches ("ada") use ("-W8");
end Binder;
Using this switch, the program will correctly print the string "Привіт" instead of producing output using brackets encoding.
In GNAT Studio, go to the tab "Build" -> "Switches" -> "Binder" and add -W8
to
the options bar.
When you use -gnatW8
, the binder will use the -W8
switch automatically.
But you can specify both of them, it won't hurt.
Brackets encoding was used, for example, in GNAT's implementation of package Ada.Numerics
to provide the constant Pi
as the greek letter π:
["03C0"] : constant := Pi;
(Chapter A.5 of the reference manual does not use bracket encoding and uses the actual UTF-8 character π)
Opening files using UTF-8
Files using UTF-8 can be opened using the parameter Form => "WCEM=8"
:
with Ada.Wide_Text_IO;
procedure Main is
Hello : constant Wide_String :=
Wide_Character'Val (1055) &
Wide_Character'Val (1088) &
Wide_Character'Val (1080) &
Wide_Character'Val (1074) &
Wide_Character'Val (1110) &
Wide_Character'Val (1090);
Output : Ada.Wide_Text_IO.File_Type;
begin
Ada.Wide_Text_IO.Create (Output, Name => "hello.txt", Form => "WCEM=8");
Ada.Wide_Text_IO.Put_Line (Output, Hello);
end Main;
Environment variables
Do not expect your program to pay attention to locale settings (like LANG
),
but watch out for environment variables
GNAT_CCS_ENCODING
, GNAT_CODE_PAGE
on Windows.
Enabling UTF-8 in other tools
GNAT programs
Other GNAT programs like gnatpp
and gnatstub
can use UTF-8
by using the flag --wide-character-encoding=8
.
Ada Language Server
The Ada Language Server
(part of the Ada extension for Visual Studio Code) uses iso-8859-1
by default.
This can be changed to UTF-8 via the parameter
defaultCharset in the settings:
"defaultCharset": "UTF-8"
Alire crates for handling Unicode strings
For some time, the Ada standard was unambiguous in that the types Character
and String
could only contain Latin-1 characters. But at some point it faltered under
the onslaught of "lovers of simple solutions" and there appeared functions
to convert Wide_String
/Wide_Wide_String
to UTF-8, which use String
type instead of an array of bytes to represent UTF-8.
The GNAT authors take full advantage of this, allowing, for example, to pass
to Ada.Text_IO.Create
the file name in UTF-8 encoding via a parameter of
type String
.
Introducing the type Wide_Wide_String
does not really solve the problem of
using Unicode, since this standard does not manipulate "characters", but
combinations of characters. There are variants where several "code points"
form a single glyph when printed or displayed. It is often more convenient for
the user to work in such concepts. This is logical for specifying the
line or column position in a text.
The type Wide_Wide_String
does not help here.
Various Alire crates exist which provide proper support for Unicode strings:
-
vss. Introduces its own type for Unicode strings with handy methods to work with. It can be used to find character boundaries, grapheme clusters, character offsets in UTF-8/UTF-16 encoding, etc.
-
matreshka_league. Allows you to operate on strings in terms of Unicode "code points". It has a set of transcoders to different encoding systems (like Windows-1251, KOI8-R), support for JSON, XML, databases, regular expressions, XML template engine, etc.