Note: This archive is from the project's previous web site, ethereal.com. This list is no longer active.
Hi all,
I've been studying Unicode some time ago and here are some important things to know:
Unicode or UCS
A standard representation of almost all known glyphs used by mankind. Every glyph has a 32-bit representation in Unicode called an Unicode code point.
UTF
Unicode (or UCS) Transformation Format, meaning a means of writing/reading Unicode text. Currently there are 3 UTFs: UTF-32, UTF-16 and UTF-8. They all share the following properties:
- UTF-n uses n-bit structures as basic data type. One Unicode glyph, represented by an Unicode code point, will be represented by 1 or more n-bit entities in UTF-n format (n = 16 or 8).
- The basic UTF is UTF-32 where there is an 1-to-1 identity mapping of Unicode code points to their binary representation.
- Multi n-bit character entities (n = 16 or 8) are encoded in such way that it is always possible to know the start of a glyph's code. In order to implement this, the Unicode code points are mapped to so-called "surrogate code points" that differ from the original Unicode code points.
- Byte-ordering in UTF-16 and UTF-32 is recorded by means of a byte-ordering sequence called "Byte Order Mark" (BOM = U+FEFF where U+FFFE purposely does not exist, thus allowing byte order detection)
- Unicode character code space (32-bit)
- 8-bit representation of the UCS glyphs (some glyphs require multiple bytes)
-----Original Message-----
From: Guy
Harris
Subject: Re: [Ethereal-dev] Syntax for frame contains
On
Wednesday, August 27, 2003, at 11:55 AM, Gilbert Ramirez wrote:
> I
don't know Unicode very well, so I don't know all the different types
> of
Unicode encodings, so I won't even guess as to what the names for
> those
"functions" would be, but they would follow the above example.
(For now,
we don't support non-ASCII characters very well in Ethereal,
so I'll assume
only ASCII in search strings for now.)
The encodings we'll probably have
to deal with are:
1)
little-endian UCS-2 - 2-byte characters, with the lower 8 bits
first and the
upper 8 bits after that (used in SMB and various DCE RPC
protocols from
Microsoft)
2) big-endian UCS-2
- (I don't know whether there are any protocols
that do that - perhaps some
DCE RPC-based protocols if the sender
is
big-endian);
3) UTF-8 -
ASCII characters map to 1 byte containing the character,
other characters map
to multiple bytes (note that UTF-8 can encode
4-byte characters, so it gets
ISO 10646 in its entirety, not just the
Basic Multilingual Plane subset
that's handled by UCS-2).
Unicode has a "byte order mark", which is a
character that's a "zero
width no-break space" (i.e., a space character that
takes no space :-))
- the byte-swapped version of it is not a legal Unicode
character (and
never will be, as far as I know), so a Unicode string can
start with a
byte order mark, and something scanning it can infer the byte
order
from that byte order mark. Not all Unicode strings necessarily
begin
with a byte order mark, however; Microsoft don't use it in SMB or
their
RPCs, for example. (The byte order is implicitly little-endian
for
SMB; it's presumably the byte order from the DCE RPC header in
the
RPCs, although, in practice, little-endian might even be used
on
big-endian machines, at least for the Microsoft
RPCs.)
- Prev by Date: Re: [Ethereal-dev] new release?
- Next by Date: [Ethereal-dev] Accessing TCP sequence number from another dissector
- Previous by thread: Re: [Ethereal-dev] Syntax for frame contains
- Next by thread: [Ethereal-dev] Multiple highlight bug
- Index(es):