This page describes the Unicode support in Object Icon. This provides an additional string type, which behaves like a conventional string, but can contain any unicode character. Csets have also been enhanced in Object Icon.
ucs (standing for Unicode character string) is a new builtin type, whose behaviour closely mirrors that of the conventional Icon string. It operates by providing a wrapper around a conventional conventional Icon string, which must be in utf-8 format. This has several advantages, and only one serious disadvantage, namely that a utf-8 string is not randomly accessible, in the sense that one cannot say where the representation for unicode character
i begins. To alleviate this disadvantage, the
ucs type maintains an index of offsets into the utf-8 string to make random access faster. The size of the index is only a few percent of the total allocation for the
Another potential disadvantage of utf-8 as an internal format, namely that it is awkward to edit a string (since a new character can have a different length to the one it's replacing), happily doesn't apply in Object Icon, since strings are immutable.
Two new escape sequences are provided to represent the utf-8 character sequences of unicode characters. These are
\u, which is followed by up to 4 hex digits, and
\U, which is followed by up to six hex digits. Each expands to between one and four characters, depending on the unicode character concerned. So, for example, the line
Note that this is still just an ordinary string, rather than a
ucs string can be created at compile-time, as a literal, or at runtime, via the builtin
ucs function. To create a literal, prefix a
u to an ordinary string literal, which must be valid utf-8; for example :-
s := u"\u0001*\u00ff*\u1234*\U10ffff"
ucs, just call it like any other function.
s := ucs(x)
The parameter to
ucs must be something which can be converted to a string which must be valid utf-8 (otherwise
ucs fails). Note that all plain ascii strings (ie, those with only characters less than 128) are in utf-8 format.
ucs type supports all of the familiar string operations, with the same semantics as the conventional string type.
String operations which take two parameters can usually mix
ucs types, although some care is needed. The general rule is: if either parameter is a
ucs, then the other parameter must be convertible to a
ucs. For example, consider string catenation. The expression
"abc" || u"\u1234"
is valid, and has the result
"abc" is valid utf8 and hence convertible to a
ucs. However, the expression
"\xff" || u"\u1234"
is invalid and will cause a runtime error because
"\xff" is not valid utf8, and hence cannot be converted to a
ucs back to a normal string produces the utf8 representation. This is the internal representation, so this operation is very fast.
In Icon, csets can only represent characters in the range 0 to 255. Object icon extends this range to cover all possible unicode characters (0 up to 0x10FFFF).
\U escape sequences can be used to specify characters greater than 255. For example
specifies a cset with four characters. You can also specify a range of characters by using a hyphen. Thus
'a-zA-Z' has the lower and upper case characters,
'0-9' has the digits, and so on. A new keyword,
&uset, contains all of the possible characters and is equivalent to
Generating the elements of a cset with
! will produce one-character strings for those elements less than 256, and one-character
ucs strings for those elements greater than or equal to 256. So for example the expression
produces the following four results
"\x01" "\xff" u"\u1234" u"\U10ffff"
Indexing a cset will produce either a normal string or a
ucs string, depending on whether any of the elements in the range are greater than or equal to 256. For example the expression
'\x01\xff\u1234\U10ffff'[1:1 to 5]
produces the following results
"" "\x01" "\x01\xff" u"\x01\u00ff\u1234" u"\x01\u00ff\u1234\U10ffff"
ord() function can be used to access the numerical values of some or all of the characters in a cset (see below for a full explanation of
Any cset can be converted to a
ucs string, but only one containing only characters less than 256 can be converted to a normal string.
ord function expands on its Icon predecessor. The first parameter can be a string, a
ucs string, or a cset. The optional parameters
j specify a range within
x, and default to
0 respectively. The result sequence is the integer character values of the specified range. For example
1 255 4660 1114111
ord(&ucase, 5, 10)
69 70 71 72 73
This is the
ucs equivalent of the
char function. It produces a one-character
ucs string containing character number
This function will try to convert
x to either a
ucs or a conventional string as appropriate. If
x is a string or
ucs, it is just returned. If
x is a cset then it is converted to a string if its highest char is < 256; otherwise it is converted to a
ucs. For any other type, normal string conversion is attempted.
This class has a some static methods which may prove useful.
has_ord(c, x) tests whether character number
x is in cset
utf8_seq(i) produces the utf-8 string representation of character
i. This is useful for building up a utf-8 string which can then be passed to
ucs. For example, consider the problem of converting an iso-8859-1 format string to a
ucs. One way to do this would be :-
procedure iso8859_to_ucs(s) local t t := u"" every t ||:= uchar(ord(s)) return t end
The drawback with this method is that it is creating lots of temporary
ucs values in the every loop (
uchar produces one, and the old value of
t is thrown away).
A quicker way is to create a utf-8 string first, and then create the
ucs result at the end :-
import lang(Text) procedure iso8859_to_ucs(s) local t t := "" every t ||:= Text.utf8_seq(ord(s)) return ucs(t) end
Source code files can be edited in non-ASCII format.
To specify a file's encoding, a preprocessor directive,
$encoding is used. The directive is followed by the encoding name, which at present can take one of three possible values :-
Each source file is processed as a sequence of codepoints, which are converted from the input bytes, based on the encoding. For ASCII encoding and ISO-8859-1 encoding, each codepoint is the same as each input byte. The only difference is that ASCII restricts the range of codepoints to 0-127, as opposed to 0-255 for ISO-8859-1. For UTF-8 encoding each codepoint may correspond to several input bytes, and may be any valid Unicode codepoint.
Other than escape sequences, each codepoint within a string, ucs or cset literal will correspond to exactly one character in that literal. For a string, the codepoint must be in the range 0-255; otherwise a compile-time error is signalled.
import io $encoding UTF-8 procedure main() local s s := u"Министры иностранных дел Европейского союза утвердили" s ? every write(upto('ив')) end
This program produces the output
2 4 10 27 47 51 53