Description
While installing ReactOS in russian, warnings such as the one below appear in the debug log during 2nd-stage installation:
(sdk\lib\rtl\unicode.c:556) \u0434 is not valid for OEM
This happens when invoking RtlIsValidOemCharacter(). This function can be invoked, for example, by RtlGenerate8dot3Name() from FatSelectNames(), when creating new directories/files on a FAT(12/16/32) volume, and the filesystem driver tries to generate a 8.3 short file name.
In this example, the RtlIsValidOemCharacter() is invoked with *Char == 0x0434 (note: 16-bit WCHAR), and the global NlsMbOemCodePageTag is FALSE. Execution then goes to:
546 /* Receive Unicode character from the table */ |
547 UnicodeChar = RtlpUpcaseUnicodeChar(NlsOemToUnicodeTable[(UCHAR)NlsUnicodeToOemTable[*Char]]); |
548
|
549 /* Receive OEM character from the table */ |
550 OemChar = NlsUnicodeToOemTable[UnicodeChar];
|
Doing some investigations, I confirm that both NlsUnicodeToOemTable and NlsOemToUnicodeTable arrays appear to contain valid values, corresponding to the CP866 codepage generated from this file (in turn, coming from: http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP866.TXT )
and the conversions appear to be correct:
nt!RtlIsValidOemCharacter+0xdd:
|
8058c20d 0fb755fc movzx edx,word ptr [ebp-4]
|
kd> ??*Char
|
wchar_t 0x434 'д'
|
kd> ??NlsUnicodeToOemTable[*Char]
|
char 0xa4 ''
|
kd> ??NlsOemToUnicodeTable[0xa4]
|
wchar_t 0x434 'д'
|
kd> ??NlsOemToUnicodeTable[(UCHAR)NlsUnicodeToOemTable[*Char]]
|
wchar_t 0x434 'д'
|
corresponding to this line:
0xa4 0x0434 #CYRILLIC SMALL LETTER DE
|
Furthermore, we can note, above in this same file:
0x84 0x0414 #CYRILLIC CAPITAL LETTER DE
|
i.e., that the uppercase of Cyrillic small letter De (0x434) should be Cyrillic capital letter De (0x0414).
However, the upper-casing: UnicodeChar = RtlpUpcaseUnicodeChar(...) returns the WRONG result:
kd> ??UnicodeChar
|
wchar_t 0x40c 'Ќ'
|
The upper-case data comes from the l_intl.nls file, that is MANUALLY generated by using our create_nls host-tool, using Unicode's http://www.unicode.org/Public/12.0.0/ucd/UnicodeData.txt
(similar auto-generated code can be found there from Wine).
Looking in UnicodeData.txt how this returned "0x40c" character came to be, the only explanation I have is that the generator code for l_intl.nls did a wrong parsing of the following lines:
1040C;DESERET CAPITAL LETTER AY;Lu;0;L;;;;;N;;;;10434;
|
...
|
10434;DESERET SMALL LETTER AY;Ll;0;L;;;;;N;;;1040C;;1040C
|
This can happen if the numerical values (that are LARGER than the NT 16-bit WCHAR), are truncated: 0x1040C --> 0x040C and 0x10434 --> 0x0434.
A hint appears to be here in create_nls.c
where appears such suspicious (WCHAR) casts:
701 /* 0. Code value */ |
702 code = (WORD)strtol(p, &p, 16); |
etc. and similar for the upper/lower-case mappings (variable case_mapping).
Converting these variables to full 32-bit unsigned int, removing these casts, AND performing boundary checks (the table doesn't appear to store Unicode points > 0xFFFF) would appear to be a valid solution.
The similar code for the Wine-generated files appears to be in this Perl script.
(IMPORTANT NOTE: Updated script can be found at: https://github.com/wine-mirror/wine/blob/master/tools/make_unicode )
A final question is, how all this code can support multi-point UTF-16 values, in the future...