Original

About extended character sets

Windows and Linux both support single-byte and multibyte character sets as well as Unicode. With a single-byte character set (SBCS), each byte in a string represents one character. The ANSI character set used by many Western operating systems is a single-byte character set.

In a multibyte character set (MBCS), some characters are represented by one byte and others by more than one byte. The first byte of a multibyte character is called the lead byte. In general, the lower 128 characters of a multibyte character set map to the 7-bit ASCII characters, and any byte whose ordinal value is greater than 127 is the lead byte of a multibyte character. Only single-byte characters can contain the null value (#0). Multibyte character sets especially double-byte character sets (DBCS) are widely used for Asian languages, while the UTF-8 character set used by Linux is a multibyte encoding of Unicode.

In the Unicode character set, each character is represented by two bytes. Thus a Unicode string is a sequence not of individual bytes but of two-byte words. Unicode characters and strings are also called wide characters and wide character strings. The first 256 Unicode characters map to the ANSI character set. The Windows operating system supports Unicode (UCS-2). The Linux operating system supports UCS-4, a superset of UCS-2. Delphi/Kylix supports UCS-2 on both platforms.

Object Pascal supports single-byte and multibyte characters and strings through the Char, PChar, AnsiChar, PAnsiChar, and AnsiString types. Indexing of multibyte strings is not reliable, since S[i] represents the ith byte (not necessarily the ith character) in S. However, the standard string-handling functions have multibyte-enabled counterparts that also implement locale-specific ordering for characters. (Names of multibyte functions usually start with Ansi-. For example, the multibyte version of StrPos is AnsiStrPos.) Multibyte character support is operating-system dependent and based on the current locale.

Object Pascal supports Unicode characters and strings through the WideChar, PWideChar, and WideString types.

Topic groups

译文

关于扩展字符集

Windows和Linux都支持单字节和多字节字符集，也支持Unicode。就单字节字符集（DBCS）来说，串种的每个字节表示一个字符。很多西方的操作系统所使用的标准（ANSI）字符集就是单字节字符集。

在多字节字符集（MBCS）中，一些字符用一个字节表示，其他的用多于一个字节表示。多字节字符的第一个字节叫做引导字节（lead byte）。通常，多字节字符集中最低的128个字符对应于7位ASCII字符，其他任何序数值大于127的字节都是多字节字符的引导字节。仅有单字节字符可以包括空值（#0）。多字节字符集特别是双字节字符集（DBCS）广泛用于亚洲语言，如用于Linux的UTF-8字符就是Unicode的多字节编码。

在Unicode字符集中，每个字符用两个字节表示。因此，一个Unicode串是一个双字节的单字（word）序列而非单个字节的序列。Unicode字符和Unicode串也叫做宽字符（wide characters）和宽串（wide strings）。Unicode字符中的前256个字符对应于标准（ANSI）字符集。Windows操作系统支持Unicode（UCS-2）。Linux操作系统支持USC-4（UCS-2的超集）。Delphi/Kylix在两个平台中都支持UCS-2。

Object Pasca;支持单字节和多字节字符和Char、PChar、AnsiChar、PAnsiChar、AnsiString等类型构成的串。对多字节串的索引是不可靠的，因为S[i]表示的是S中第i个字节（未必是第i个字符）。不过，Borland提供了与标准的串处理函数对应的例程，用于实现对多字节字符和串在本地的特别命令。（多字节函数的名字通常以Ansi-开始。如，对应于StrPos的多字节函数是AnsiStrPos。）对多字节字符的支持依赖于操作系统，并基于当前环境。

Object Pascal支持Unicode字符集和WideChar、PWideChar、WideString等类型构成的串。

主题组

Original

About extended character sets

See also

译文

关于扩展字符集

相关主题