Unicode support in Object Icon

Introduction

This page describes the Unicode support in Object Icon. This provides an additional string type, which behaves like a conventional string, but can contain any unicode character. Csets have also been enhanced in Object Icon.

The `ucs` type

ucs (standing for Unicode character string) is a new builtin type, whose behaviour closely mirrors that of the conventional Icon string. It operates by providing a wrapper around a conventional conventional Icon string, which must be in utf-8 format. This has several advantages, and only one serious disadvantage, namely that a utf-8 string is not randomly accessible, in the sense that one cannot say where the representation for unicode character i begins. To alleviate this disadvantage, the ucs type maintains an index of offsets into the utf-8 string to make random access faster. The size of the index is only a few percent of the total allocation for the ucs object.

Another potential disadvantage of utf-8 as an internal format, namely that it is awkward to edit a string (since a new character can have a different length to the one it’s replacing), happily doesn’t apply in Object Icon, since strings are immutable.

utf-8 escape sequences

Two new escape sequences are provided to represent the utf-8 character sequences of unicode characters. These are \u, which is followed by up to 4 hex digits, and \U, which is followed by up to six hex digits. Each expands to between one and four characters, depending on the unicode character concerned. So, for example, the line

write(image("\u0001*\u00ff*\u1234*\U10ffff"))

writes

"\x01*\xc3\xbf*\xe1\x88\xb4*\xf4\x8f\xbf\xbf"

Note that this is still just an ordinary string, rather than a ucs string.

Creating a `ucs` string

A ucs string can be created at compile-time, as a literal, or at runtime, via the builtin ucs function. To create a literal, prefix a u to an ordinary string literal, which must be valid utf-8; for example :-

s := u"\u0001*\u00ff*\u1234*\U10ffff"

To use ucs, just call it like any other function.

s := ucs(x)

The parameter to ucs must be something which can be converted to a string which must be valid utf-8 (otherwise ucs fails). Note that all plain ascii strings (ie, those with only characters less than 128) are in utf-8 format.

Operations on `ucs` strings

The ucs type supports all of the familiar string operations, with the same semantics as the conventional string type.

String operations which take two parameters can usually mix string and ucs types, although some care is needed. The general rule is: if either parameter is a ucs, then the other parameter must be convertible to a ucs. For example, consider string catenation. The expression

"abc" || u"\u1234"

is valid, and has the result u"abc\u1234", because "abc" is valid utf8 and hence convertible to a ucs. However, the expression

"\xff" || u"\u1234"

is invalid and will cause a runtime error because "\xff" is not valid utf8, and hence cannot be converted to a ucs.

Converting a ucs back to a normal string produces the utf8 representation. This is the internal representation, so this operation is very fast.

Csets

In Icon, csets can only represent characters in the range 0 to 255. Object Icon extends this range to cover all possible unicode characters (0 up to 0x10FFFF).

The \u and \U escape sequences can be used to specify characters greater than 255. For example

'\x01\xff\u1234\U10ffff'

specifies a cset with four characters. You can also specify a range of characters by using a hyphen. Thus 'a-zA-Z' has the lower and upper case characters, '0-9' has the digits, and so on. A new keyword, &uset, contains all of the possible characters and is equivalent to '\x00-\U10ffff'.

Generating the elements of a cset with ! will produce one-character strings for those elements less than 256, and one-character ucs strings for those elements greater than or equal to 256. So for example the expression

!'\x01\xff\u1234\U10ffff'

produces the following four results

"\x01"
"\xff"
u"\u1234"
u"\U10ffff"

Indexing

Indexing a cset will produce either a normal string or a ucs string, depending on whether any of the elements in the range are greater than or equal to 256. For example the expression

'\x01\xff\u1234\U10ffff'[1:1 to 5]

produces the following results

""
"\x01"
"\x01\xff"
u"\x01\u00ff\u1234"
u"\x01\u00ff\u1234\U10ffff"

The builtin ord() function can be used to access the numerical values of some or all of the characters in a cset (see below for a full explanation of ord).

Conversion

Any cset can be converted to a ucs string, but only one containing only characters less than 256 can be converted to a normal string.

Builtin functions

ord(x, i, j)

The ord function expands on its Icon predecessor. The first parameter can be a string, a ucs string, or a cset. The optional parameters i and j specify a range within x, and default to 1 and 0 respectively. The result sequence is the integer character values of the specified range. For example

ord(u"\x01\u00ff\u1234\U10ffff")

generates

whilst

ord(&ucase, 5, 10)

generates

uchar(x)

This is the ucs equivalent of the char function. It produces a one-character ucs string containing character number x.

text(x)

This function will try to convert x to either a ucs or a conventional string as appropriate. If x is a string or ucs, it is just returned. If x is a cset then it is converted to a string if its highest char is < 256; otherwise it is converted to a ucs. For any other type, normal string conversion is attempted.

lang.Text class

This class has a some static methods which may prove useful.

The method has_ord(c, x) tests whether character number x is in cset c.

The method utf8_seq(i) produces the utf-8 string representation of character i. This is useful for building up a utf-8 string which can then be passed to ucs. For example, consider the problem of converting an iso-8859-1 format string to a ucs. One way to do this would be :-

procedure iso8859_to_ucs(s)
   local t
   t := u""
   every t ||:= uchar(ord(s))
   return t
end

The drawback with this method is that it is creating lots of temporary ucs values in the every loop (uchar produces one, and the old value of t is thrown away).

A quicker way is to create a utf-8 string first, and then create the ucs result at the end :-

import lang(Text)

procedure iso8859_to_ucs(s)
   local t
   t := ""
   every t ||:= Text.utf8_seq(ord(s))
   return ucs(t)
end

Editing non-ASCII source code

Source code files can be edited in non-ASCII format.

To specify a file’s encoding, a preprocessor directive, $encoding is used. The directive is followed by the encoding name, which at present can take one of three possible values :-

ASCII (the default)
ISO-8859-1
UTF-8

Each source file is processed as a sequence of codepoints, which are converted from the input bytes, based on the encoding. For ASCII encoding and ISO-8859-1 encoding, each codepoint is the same as each input byte. The only difference is that ASCII restricts the range of codepoints to 0-127, as opposed to 0-255 for ISO-8859-1. For UTF-8 encoding each codepoint may correspond to several input bytes, and may be any valid Unicode codepoint.

Literals

Other than escape sequences, each codepoint within a string, ucs or cset literal will correspond to exactly one character in that literal. For a string, the codepoint must be in the range 0-255; otherwise a compile-time error is signalled.

Example

import io

$encoding UTF-8

procedure main()
   local s
   s := u"Министры иностранных дел Европейского союза утвердили"
   s ? every write(upto('ив'))
end

This program produces the output

Contents