Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Component/s: Tools
Labels:
None

Module:
- nls
- rtl

Description

While installing ReactOS in russian, warnings such as the one below appear in the debug log during 2nd-stage installation:
(sdk\lib\rtl\unicode.c:556) \u0434 is not valid for OEM

This happens when invoking RtlIsValidOemCharacter(). This function can be invoked, for example, by RtlGenerate8dot3Name() from FatSelectNames(), when creating new directories/files on a FAT(12/16/32) volume, and the filesystem driver tries to generate a 8.3 short file name.

In this example, the RtlIsValidOemCharacter() is invoked with *Char == 0x0434 (note: 16-bit WCHAR), and the global NlsMbOemCodePageTag is FALSE. Execution then goes to:

 546         /* Receive Unicode character from the table */

 547         UnicodeChar = RtlpUpcaseUnicodeChar(NlsOemToUnicodeTable[(UCHAR)NlsUnicodeToOemTable[*Char]]);

 549         /* Receive OEM character from the table */

 550         OemChar = NlsUnicodeToOemTable[UnicodeChar];

Doing some investigations, I confirm that both NlsUnicodeToOemTable and NlsOemToUnicodeTable arrays appear to contain valid values, corresponding to the CP866 codepage generated from this file (in turn, coming from: http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP866.TXT )
and the conversions appear to be correct:

nt!RtlIsValidOemCharacter+0xdd:

8058c20d 0fb755fc        movzx   edx,word ptr [ebp-4]

kd> ??*Char

wchar_t 0x434 'д'

kd> ??NlsUnicodeToOemTable[*Char]

char 0xa4 ''

kd> ??NlsOemToUnicodeTable[0xa4]

wchar_t 0x434 'д'

kd> ??NlsOemToUnicodeTable[(UCHAR)NlsUnicodeToOemTable[*Char]]

wchar_t 0x434 'д'

corresponding to this line:

0xa4	0x0434	#CYRILLIC SMALL LETTER DE

Furthermore, we can note, above in this same file:

0x84	0x0414	#CYRILLIC CAPITAL LETTER DE

i.e., that the uppercase of Cyrillic small letter De (0x434) should be Cyrillic capital letter De (0x0414).

However, the upper-casing: UnicodeChar = RtlpUpcaseUnicodeChar(...) returns the WRONG result:

kd> ??UnicodeChar

wchar_t 0x40c 'Ќ'

The upper-case data comes from the l_intl.nls file, that is MANUALLY generated by using our create_nls host-tool, using Unicode's http://www.unicode.org/Public/12.0.0/ucd/UnicodeData.txt
(similar auto-generated code can be found there from Wine).

Looking in UnicodeData.txt how this returned "0x40c" character came to be, the only explanation I have is that the generator code for l_intl.nls did a wrong parsing of the following lines:

1040C;DESERET CAPITAL LETTER AY;Lu;0;L;;;;;N;;;;10434;

...

10434;DESERET SMALL LETTER AY;Ll;0;L;;;;;N;;;1040C;;1040C

This can happen if the numerical values (that are LARGER than the NT 16-bit WCHAR), are truncated: 0x1040C --> 0x040C and 0x10434 --> 0x0434.

A hint appears to be here in create_nls.c
where appears such suspicious (WCHAR) casts:

 701         /* 0. Code value */

 702         code = (WORD)strtol(p, &p, 16);

etc. and similar for the upper/lower-case mappings (variable case_mapping).
Converting these variables to full 32-bit unsigned int, removing these casts, AND performing boundary checks (the table doesn't appear to store Unicode points > 0xFFFF) would appear to be a valid solution.
The similar code for the Wine-generated files appears to be in this Perl script.
(IMPORTANT NOTE: Updated script can be found at: https://github.com/wine-mirror/wine/blob/master/tools/make_unicode )

A final question is, how all this code can support multi-point UTF-16 values, in the future...

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: hbelusca

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 2025-09-09 16:09

Updated:: 2025-09-09 16:25