Details
-
Task
-
Resolution: Unresolved
-
Major
-
None
-
Windows 7 x64, system language: zh-HK, non-unicode program codepage: 950 (Big5)
Description
This issue is an attempt to replace CORE-7415
MSVC uses the system ANSI character set when compiling non-UTF-8 source files. This usually is a non-issue for SBCS (single-byte character set) systems. The code is in ASCII range, and any non-ASCII characters only appears within string/char literals and comments. Most non-ASCII characters map to valid chars in any SBCS codepages, but even if one doesn't map, it is either simply ignored (in comments) or compiled as-is (string literals), emitting a mostly harmless C4819 warning.
However on DBCS (double-byte character set) systems, the situation gets much more complicated. With DBCS, part of the non-ASCII range (0x80 - 0xFF) is mapped as a lead-byte, indicating that the following character is also a part of the same character. The main problem is that when the source contains non-ASCII characters intended for use with SBCS, it is treated as a lead byte on DBCS systems, which causes the character immediately following it to be "eaten". In cases where, for example, the "eaten" character indicates the end of a string literal, it results in a broken source file which fails to compile.
Therefore, in order to allow successful compilation across all systems, source files containing non-ASCII chars should be fixed.
From my observation, there are several cases where this situation can occur:
1. Comments in the header containing the author name with accented characters, e.g. Herv*é* Poussineau in `base/system/autochk/autochk.c`
Most of these cases won't cause any errors. They can be ignored, but should be fixed if C4819 is to be avoided.
2. Character literals in keyboard layout files, as mentioned in CORE-7417, e.g. line 178 of `dll/keyboard/kbdhu/kbdhu.c`
This can be fixed by simply replacing the char literal with an integer literal.
3. Translations for the command-line setup utility, e.g. `base/setup/usetup/lang/bg-BG.h`
These files are tricky because the string literals in them are intended to be compiled as-is, bit-exact, and is intended to be displayed with the corresponding SBCS codepage (note that there're no translations for DBCS codepages, except for Japanese though it only supports the single-byte chars for cp932). For this reason, these files cannot simply be converted into unicode since the compiler wouldn't convert them into their corresponding codepages. The only way to ensure proper compilation is to convert them into escape sequences, but that will cause serious annoyance when doing translations.
4. Resource files, which should have been fixed completely with CORE-9021
Additional References:
All codepages used in Windows: https://msdn.microsoft.com/en-us/goglobal/bb964654