Windows
and Linux both support single-byte and multibyte character sets
as well as Unicode. With a single-byte character set (SBCS), each byte
in a string represents one character. The ANSI character set used by many
Western operating systems is a single-byte character set.
In a
multibyte character set (MBCS), some characters are represented by one byte and
others by more than one byte. The first byte of a multibyte character is called
the lead byte. In general, the lower 128 characters of a
multibyte character set map to the 7-bit ASCII characters, and any byte whose
ordinal value is greater than 127 is the lead byte of a multibyte character.
Only single-byte characters can contain the null value (#0). Multibyte
character sets especially double-byte character sets (DBCS) are widely used for
Asian languages, while the UTF-8 character set used by Linux is a multibyte
encoding of Unicode.
In the
Unicode character set, each character is represented by two bytes. Thus a
Unicode string is a sequence not of individual bytes but of two-byte words.
Unicode characters and strings are also called wide characters
and wide character strings. The first 256 Unicode
characters map to the ANSI character set. The Windows operating system supports
Unicode (UCS-2). The Linux operating system supports UCS-4, a superset of
UCS-2. Delphi/Kylix supports UCS-2 on both platforms.
Object
Pascal supports single-byte and multibyte characters and strings through the Char,
PChar, AnsiChar, PAnsiChar, and AnsiString types.
Indexing of multibyte strings is not reliable, since S[i] represents the
ith byte (not necessarily the ith character) in S. However, the
standard string-handling functions have multibyte-enabled counterparts that
also implement locale-specific ordering for characters. (Names of multibyte
functions usually start with Ansi-. For example, the multibyte version
of StrPos is AnsiStrPos.) Multibyte character support is
operating-system dependent and based on the current locale.
Object Pascal supports Unicode characters and strings through the WideChar, PWideChar, and WideString types.
Handling
null-terminated strings
Windows和Linux都支持单字节和多字节字符集,也支持Unicode。就单字节字符集(DBCS)来说,串种的每个字节表示一个字符。很多西方的操作系统所使用的标准(ANSI)字符集就是单字节字符集。
在多字节字符集(MBCS)中,一些字符用一个字节表示,其他的用多于一个字节表示。多字节字符的第一个字节叫做引导字节(lead byte)。通常,多字节字符集中最低的128个字符对应于7位ASCII字符,其他任何序数值大于127的字节都是多字节字符的引导字节。仅有单字节字符可以包括空值(#0)。多字节字符集特别是双字节字符集(DBCS)广泛用于亚洲语言,如用于Linux的UTF-8字符就是Unicode的多字节编码。
在Unicode字符集中,每个字符用两个字节表示。因此,一个Unicode串是一个双字节的单字(word)序列而非单个字节的序列。Unicode字符和Unicode串也叫做宽字符(wide characters)和宽串(wide strings)。Unicode字符中的前256个字符对应于标准(ANSI)字符集。Windows操作系统支持Unicode(UCS-2)。Linux操作系统支持USC-4(UCS-2的超集)。Delphi/Kylix在两个平台中都支持UCS-2。
Object
Pasca;支持单字节和多字节字符和Char、PChar、AnsiChar、PAnsiChar、AnsiString等类型构成的串。对多字节串的索引是不可靠的,因为S[i]表示的是S中第i个字节(未必是第i个字符)。不过,Borland提供了与标准的串处理函数对应的例程,用于实现对多字节字符和串在本地的特别命令。(多字节函数的名字通常以Ansi-开始。如,对应于StrPos的多字节函数是AnsiStrPos。)对多字节字符的支持依赖于操作系统,并基于当前环境。
Object Pascal支持Unicode字符集和WideChar、PWideChar、WideString等类型构成的串。
编者注
Unicode是指统一的字符编码标准,采用双字节对字符进行编码(1988-1991年建立的标准)。