write an intro for all OS?
Since Windows 2000, Windows offers a nice Unicode API and supports non-BMP characters <bmp>
. It uses Unicode strings <str>
implemented as :cwchar_t*
strings (LPWSTR). :cwchar_t
is 16 bits long on Windows and so it uses UTF-16 <utf16>
: non-BMP <bmp>
characters are stored as two :cwchar_t
(a surrogate pair
<surrogates>
), and the length of a string is the number of UTF-16 units and not the number of characters.
Windows 95, 98 and Me had also Unicode strings, but were limited to BMP
characters <bmp>
: they used UCS-2 <ucs2>
instead of UTF-16.
And Windows CE?
A Windows application has two encodings, called code pages (abbreviated "cp"): ANSI and OEM code pages. The ANSI code page, :cCP_ACP
, is used for the ANSI version of the Windows API <win_api>
to decode byte strings <bytes>
to character strings <str>
and has a number between 874 and 1258. The OEM code page or "IBM PC" code page, :cCP_OEMCP
, comes from MS-DOS, is used for the Windows console <win_console>
, contains glyphs to create text interfaces (draw boxes) and has a number between 437 and 874. Example of a French setup: ANSI is cp1252
and OEM is cp850.
There are code page constants:
- :c
CP_ACP
: Windows ANSI code page- :c
CP_MACCP
: Macintosh code page- :c
CP_OEMCP
: ANSI code page of the current process- :c
CP_SYMBOL
(42): Symbol code page- :c
CP_THREAD_ACP
: ANSI code page of the current thread- :c
CP_UTF7
(65000):UTF-7 <utf7>
- :c
CP_UTF8
(65001):UTF-8 <utf8>
Functions.
Wikipedia article: Windows code page.
Encode and decode functions of <windows.h>
.
Note
:cMultiByteToWideChar
and :cWideCharToMultiByte
functions are similar to :cmbstowcs
and :cwcstombs
functions.
Document NormalizeString()
Document the replacement character?
Windows has two versions of each function of its API: the ANSI version using byte strings <bytes>
(A
suffix) and the ANSI code page
<codepage>
, and the wide version (W
suffix) using character strings
<str>
. There are also functions without suffix using :cTCHAR*
strings: if the C <c>
define :c_UNICODE
is defined, :cTCHAR
is replaced by :cwchar_t
and the Unicode functions are used; otherwise :cTCHAR
is replaced by :cchar
and the ANSI functions are used. Example:
- :c
CreateFileA()
: bytes version, usebyte strings <bytes>
encoded to the ANSI code page- :c
CreateFileW()
: Unicode version, usewide character strings <str>
- :c
CreateFile()
: :cTCHAR
version depending on the :c_UNICODE
define
Always prefer the Unicode version to avoid encoding/decoding errors, and use directly the W
suffix to avoid compiling issues.
Note
There is a third version of the API: the MBCS API (multibyte character string). Use the TCHAR functions and define :c_MBCS
to use the MBCS functions. For example, :c_tcsrev
is replaced by :c_mbsrev
if :c_MBCS
is defined, by :c_wcsrev
if :c_UNICODE
is defined, or by :c_strrev
otherwise.
- LPSTR (LPCSTR):
byte string <bytes>
, :cchar*
(:cconst char*
)- LPWSTR (LPCWSTR):
wide character string <str>
, :cwchar_t*
(:cconst wchar_t*
)- LPTSTR (LPCTSTR): byte or wide character string depending of
_UNICODE
define, :cTCHAR*
(:cconst TCHAR*
)
Windows stores filenames as Unicode in the filesystem. Filesystem wide character POSIX-like API:
POSIX functions, like :cfopen()
, use the ANSI code page
<codepage>
to encode/decode strings.
Console functions.
document ReadConsoleW()?
To improve the Unicode support <support>
of the console, set the console font to a TrueType font (e.g. "Lucida Console") and use the wide character API
If the console is unable to render a character, it tries to use a character with a similar glyph <translit>
. For example, with OEM code page <codepage>
850, Ł (U+0141) is replaced by L (U+0041). If no replacment character can be found, "?" (U+003F) is displayed instead.
In a console (cmd.exe
), chcp
command can be used to display or to change the OEM code page <codepage>
(and console code page). Changing the console code page is not a good idea because the ANSI API of the console still expects characters encoded to the previous console code page.
Conventional wisdom is retarded, aka What the @#%&* is _O_U16TEXT? (Michael S. Kaplan, 2008) and the Python bug report #1602: windows console doesn't print or input Unicode.
Note
Set the console code page <codepage>
to cp65001 (UTF-8
) doesn't improve Unicode support, it is the opposite: non-ASCII are not rendered correctly and type non-ASCII characters (e.g. using the keyboard) doesn't work correctly, especially using raster fonts.
:c_setmode
and :c_wsopen
are special functions to set the encoding of a file:
- :c
_O_U8TEXT
:UTF-8
withoutBOM <bom>
- :c
_O_U16TEXT
:UTF-16 <utf16>
without BOM- :c
_O_WTEXT
: UTF-16 with BOM
:cfopen
can use these modes using ccs=
in the file mode:
ccs=UNICODE
: :c_O_WTEXT
ccs=UTF-8
: :c_O_UTF8
ccs=UTF-16LE
: :c_O_UTF16
Consequences on TTY and pipes?
Mac OS X uses UTF-8
for the filenames. If a filename is an invalid UTF-8 byte string, Mac OS X returns an error <strict>
. The filenames are decomposed <normalization>
to an incompatible variant of the Normal Form D (NFD). Extract of the Technical Q&A QA1173: "For example, HFS Plus uses a variant of Normal Form D in which U+2000 through U+2FFF, U+F900 through U+FAFF, and U+2F800 through U+2FAFF are not decomposed."
To support different languages and encodings, UNIX and BSD operating systems have "locales". Locales are process-wide: if a thread or a library change the locale, the whole process is impacted.
Locale categories:
- :c
LC_COLLATE
: compare and sort strings- :c
LC_CTYPE
: decodebyte strings <bytes>
and encodecharacter strings <str>
- :c
LC_MESSAGES
: language of messages- :c
LC_MONETARY
: monetary formatting- :c
LC_NUMERIC
: number formatting (e.g. thousands separator)- :c
LC_TIME
: time and date formatting
:cLC_ALL
is a special category: if you set a locale using this category, it sets the locale for all categories.
Each category has its own environment variable with the same name. For example, LC_MESSAGES=C
displays error messages in English. To get the value of a locale category, LC_ALL
, LC_xxx
(e.g. LC_CTYPE
) or LANG
environment variables are checked: use the first non empty variable. If all variables are unset, fallback to the C locale.
Note
The gettext library reads LANGUAGE
, LC_ALL
and LANG
environment variables (and some others) to get the user language. The LANGUAGE
variable is specific to gettext and is not related to locales.
When a program starts, it does not get directly the user locale: it uses the default locale which is called the "C" locale or the "POSIX" locale. It is also used if no locale environment variable is set. For :cLC_CTYPE
, the C locale usually means ASCII
, but not always (see the locale encoding section). For :cLC_MESSAGES
, the C locale means to speak the original language of the program, which is usually English.
For Unicode, the most important locale category is LC_CTYPE
: it is used to set the "locale encoding".
To get the locale encoding:
- Copy the current locale:
setlocale(LC_CTYPE, NULL)
- Set the current locale encoding to the user preference:
setlocale(LC_CTYPE, "")
- Use
nl_langinfo(CODESET)
if available- or
setlocale(LC_CTYPE, NULL)
write a full example in C
For the C locale, nl_langinfo(CODESET)
returns ASCII
, or an alias to this encoding (e.g. "US-ASCII" or "646"). But on FreeBSD, Solaris and Mac OS X <osx>
, codec functions (e.g. :cmbstowcs
) use ISO-8859-1
even if nl_langinfo(CODESET)
announces ASCII encoding. AIX uses ISO-8859-1
for the C locale (and nl_langinfo(CODESET)
returns "ISO8859-1"
).
<locale.h>
functions.
setlocale("") means user preference
<langinfo.h>
functions.
<stdlib.h>
functions.
mbstowcs() and wcstombs() are strict <strict>
and don't support error handlers <errors>
.
Note
"mbs" stands for "multibyte string" (byte string) and "wcs" stands for "wide character string".
On Windows, the "locale encoding" are the ANSI and OEM code pages
<codepage>
. A Windows program uses the user preferred code pages at startup, whereas a program starts with the C locale on UNIX.
CD-ROM uses the ISO 9660 filesystem which stores filenames as byte
strings <bytes>
. This filesystem is very restrictive: only A-Z, 0-9, _ and "." are allowed. Microsoft has developed the Joliet extension: store filenames as UCS-2 <ucs2>
, up to 64 characters (BMP <bmp>
only). It was first supported by Windows 95. Today, all operating systems are able to read it.
UDF (Universal Disk Format) is the filesystem of DVD: it stores filenames as character strings.
UDF encoding?
MS-DOS uses the FAT filesystems (FAT 12, FAT 16, FAT 32): filenames are stored as byte strings <bytes>
. Filenames are limited to 8+3 characters (8 for the name, 3 for the extension) and displayed differently depending on the code page <codepage>
(mojibake issue <mojibake>
).
Microsoft extended its FAT filesystem in Windows 95: the Virtual FAT (VFAT) supports "long filenames", filenames are stored as UCS-2 <ucs2>
, up to 255 characters (BMP only). Starting at Windows 2000, non-BMP characters
<bmp>
can be used: UTF-16 <utf16>
replaces UCS-2 and the limit is now 255 UTF-16 units.
The NTFS filesystem stores filenames using UTF-16 encoding.
HFS stores filenames as byte strings.
HFS+ stores filenames as UTF-16 <utf16>
: the maximum length is 255 UTF-16 units.
JFS and ZFS also use Unicode.
The ext family (ext2, ext3, ext4) store filenames as byte strings.
Linux: mount options (FAT, NFSv3)
USB keys, camera, memory cards
Network fileystems like NFS (NFS4 supports Unicode?)