Ethereal-dev: RE: [Ethereal-dev] Syntax for frame contains

Note: This archive is from the project's previous web site, ethereal.com. This list is no longer active.

From: Biot Olivier <Olivier.Biot@xxxxxxxxxxx>
Date: Thu, 28 Aug 2003 13:03:31 +0200
Title: Message

Hi all,

I've been studying Unicode some time ago and here are some important things to know:

Unicode or UCS

A standard representation of almost all known glyphs used by mankind. Every glyph has a 32-bit representation in Unicode called an Unicode code point.

UTF

Unicode (or UCS) Transformation Format, meaning a means of writing/reading Unicode text. Currently there are 3 UTFs: UTF-32, UTF-16 and UTF-8. They all share the following properties:

  1. UTF-n uses n-bit structures as basic data type. One Unicode glyph, represented by an Unicode code point, will be represented by 1 or more n-bit entities in UTF-n format (n = 16 or 8).
  2. The basic UTF is UTF-32 where there is an 1-to-1 identity mapping of Unicode code points to their binary representation.
  3. Multi n-bit character entities (n = 16 or 8) are encoded in such way that it is always possible to know the start of a glyph's code. In order to implement this, the Unicode code points are mapped to so-called "surrogate code points" that differ from the original Unicode code points.
  4. Byte-ordering in UTF-16 and UTF-32 is recorded by means of a byte-ordering sequence called "Byte Order Mark" (BOM = U+FEFF where U+FFFE purposely does not exist, thus allowing byte order detection)
This means that UTF-8 means:
  1. Unicode character code space (32-bit)
  2. 8-bit representation of the UCS glyphs (some glyphs require multiple bytes)
For UTF-16 and UTF-32, byte ordering is important (b0 b1 is different from b1 b0)!
 
I highly recommend the reading of at least section 2.5 of the latest Unicode specification (http://www.unicode.org/book/preview/ch02.pdf). It's a pity the PDF version doesn't print (it is print-and-select protected).
 
Hope this helps to clarify Unicode and UTF!
 
Regards,
 
Olivier



-----Original Message-----
From: Guy Harris
Subject: Re: [Ethereal-dev] Syntax for frame contains


On Wednesday, August 27, 2003, at 11:55 AM, Gilbert Ramirez wrote:

> I don't know Unicode very well, so I don't know all the different types
> of Unicode encodings, so I won't even guess as to what the names for
> those "functions" would be, but they would follow the above example.

(For now, we don't support non-ASCII characters very well in Ethereal,
so I'll assume only ASCII in search strings for now.)

The encodings we'll probably have to deal with are:

        1) little-endian UCS-2 - 2-byte characters, with the lower 8 bits
first and the upper 8 bits after that (used in SMB and various DCE RPC
protocols from Microsoft)

        2) big-endian UCS-2 - (I don't know whether there are any protocols
that do that - perhaps some DCE RPC-based protocols if the sender is
big-endian);

        3) UTF-8 - ASCII characters map to 1 byte containing the character,
other characters map to multiple bytes (note that UTF-8 can encode
4-byte characters, so it gets ISO 10646 in its entirety, not just the
Basic Multilingual Plane subset that's handled by UCS-2).

Unicode has a "byte order mark", which is a character that's a "zero
width no-break space" (i.e., a space character that takes no space :-))
- the byte-swapped version of it is not a legal Unicode character (and
never will be, as far as I know), so a Unicode string can start with a
byte order mark, and something scanning it can infer the byte order
from that byte order mark.  Not all Unicode strings necessarily begin
with a byte order mark, however; Microsoft don't use it in SMB or their
RPCs, for example.  (The byte order is implicitly little-endian for
SMB; it's presumably the byte order from the DCE RPC header in the
RPCs, although, in practice, little-endian might even be used on
big-endian machines, at least for the Microsoft RPCs.)