IsTextUnicode
The IsTextUnicode
function determines whether a buffer probably contains a form of Unicode text.
The function uses various statistical and deterministic methods to make its
determination, under the control of flags passed via lpi. When the
function returns, the results of such tests are reported via lpi. If all
specified tests are passed, the function returns TRUE; otherwise, it returns
FALSE.
DWORD IsTextUnicode(
CONST LPVOID lpBuffer, |
// pointer to an
input buffer to be examined |
int cb, |
// the size in
bytes of the input buffer |
LPINT lpi |
// pointer to flags
that condition text examination and receive results |
); |
|
Parameters
lpBuffer
Pointer to
the input buffer to be examined.
cb
Specifies the
size, in bytes, of the input buffer pointed to by lpBuffer.
lpi
Pointer to an
int that, upon entry to the function, contains a set of flags that
specify the tests to be applied to the input buffer text. Upon exit from the
function, that same int contains a set of bit flags indicating the
results of the specified tests: 1 if the contents of the buffer pass a test, 0
for failure. Only flags that are set upon entry to the function are significant
upon exit.
If lpi is NULL, the function uses all available tests to
determine whether the data in the buffer is probably Unicode text.
Here are the constants used with *lpi s bit flags:
Value |
Meaning |
IS_TEXT_UNICODE_ASCII16 |
The text is
Unicode, and contains nothing but zero-extended ASCII values/characters. |
IS_TEXT_UNICODE_REVERSE_ASCII16 |
Same as the
preceding, except that the Unicode text is byte-reversed. |
IS_TEXT_UNICODE_STATISTICS |
The text is
probably Unicode, with the determination made by applying statistical
analysis. Absolute certainty is not guaranteed. See the note in the following
Remarks section. |
IS_TEXT_UNICODE_REVERSE_STATISTICS |
Same as the
preceding, except that the probably-Unicode text is byte-reversed. |
IS_TEXT_UNICODE_CONTROLS |
The text
contains Unicode representations of one or more of these non-printing
characters: RETURN, LINEFEED, SPACE, CJK_SPACE, TAB. |
IS_TEXT_UNICODE_REVERSE_CONTROLS |
Same as the
preceding, except that the Unicode characters are byte-reversed. |
IS_TEXT_UNICODE_BUFFER_TOO_SMALL |
There are
too few characters in the buffer for meaningful analysis (fewer than two
bytes). |
IS_TEXT_UNICODE_SIGNATURE |
The text
contains the Unicode byte-order mark (BOM) 0xFEFF as its first character. |
IS_TEXT_UNICODE_REVERSE_SIGNATURE |
The text
contains the Unicode byte-reversed byte-order mark (Reverse BOM) 0xFFFE as
its first character. |
IS_TEXT_UNICODE_ILLEGAL_CHARS |
The text
contains one of these Unicode-illegal characters: embedded Reverse BOM,
UNICODE_NUL, CRLF (packed into one WORD), or 0xFFFF. |
IS_TEXT_UNICODE_ODD_LENGTH |
The number
of characters in the string is odd. A string of odd length cannot (by
definition) be Unicode text. |
IS_TEXT_UNICODE_NULL_BYTES |
The text
contains null bytes, which indicate non-ASCII text. |
IS_TEXT_UNICODE_UNICODE_MASK |
This flag
constant is a combination of IS_TEXT_UNICODE_ASCII16,
IS_TEXT_UNICODE_STATISTICS, IS_TEXT_UNICODE_CONTROLS,
IS_TEXT_UNICODE_SIGNATURE. |
IS_TEXT_UNICODE_REVERSE_MASK |
This flag
constant is a combination of IS_TEXT_UNICODE_REVERSE_ASCII16,
IS_TEXT_UNICODE_REVERSE_STATISTICS, IS_TEXT_UNICODE_REVERSE_CONTROLS,
IS_TEXT_UNICODE_REVERSE_SIGNATURE. |
IS_TEXT_UNICODE_NOT_UNICODE_MASK |
This flag
constant is a combination of IS_TEXT_UNICODE_ILLEGAL_CHARS,
IS_TEXT_UNICODE_ODD_LENGTH, and two currently unused bit flags. |
IS_TEXT_UNICODE_NOT_ASCII_MASK |
This flag
constant is a combination of IS_TEXT_UNICODE_NULL_BYTES and three currently
unused bit flags. |
Return Values
The function
returns nonzero if the data in the buffer passes the specified tests.
The function
returns zero if the data in the buffer does not pass the specified tests.
In either
case, the int pointed to by lpi contains the results of the
specific tests the function applied to make its determination.
Remarks
As noted in
the preceding table of flag constants, the IS_TEXT_UNICODE_STATISTICS and
IS_TEXT_UNICODE_REVERSE_STATISTICS tests use statistical analysis. These tests
are not foolproof. The statistical tests assume certain amounts of variation
between low and high bytes in a string, and some ASCII strings can slip
through. For example, if lpBuffer points to the ASCII string 0x41, 0x0A,
0x0D, 0x1D (A\n\r^Z), the string passes the IS_TEXT_UNICODE_STATISTICS test,
though failure would be preferable.