Quantcast

Unicode issues

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Unicode issues

Jim Balter-2
[moved to scala-debate per Tony's suggestion]

Thanks, Erik. I agree with all of that, except for the term "glyph",
which isn't quite right -- the correct term is "character". There's a
good glossary, as well as a discussion of the changes made to Java to
support Unicode supplementary characters, and the issues involved, at

http://java.sun.com/developer/technicalArticles/Intl/Supplementary/

It's highly recommended reading for anyone interested in this discussion.

At the moment, my concern is mostly about neither UTF-8 nor UTF-32,
but rather with UTF-16 because that is the encoding for
java.lang.String and, as so many people have pointed out, Java-Scala
interoperability is essential. I think perhaps the lowest-impact
approach would be to provide a Unicode view of Strings, so for
instance one could do

s.unicodeView.exists(Character.isLetter(_))

All that would take is an implicit conversion from a String to a
UnicodeView, and I have written code to do that. It would be nice to
be able to do

s.unicodeView.exists(_.isLetter)

and I have written that code too, but it makes isLetter an implicit
method for any Int, which may not be desirable.

-- Jim

On Thu, Dec 23, 2010 at 9:56 PM, Erik Osheim <[hidden email]> wrote:

> On Thu, Dec 23, 2010 at 11:50:00PM -0500, Arya Irani wrote:
>> What are some performance considerations that should be kept in mind when
>> implementing a UTF string library?
>>
>> Must a UTF8 string be stored as an Array[Byte]?  What about Seq[Seq[Byte]]
>> or Array[Seq[Byte]] or Array[Array[Byte]], where each element represents a
>> code point?
>
> Just to be clear: I think Jim wants a Unicode string library rather
> than just a UTF-8 string library. Unicode is a specification which
> assigns numbers to glyphs, whereas UTF-8 is a particular method of
> storing strings of Unicode glyphs as bytes.
>
> You can use Seq[Int] (which corresponds to the UTF-32 encoding) to
> correctly represent all existing Unicode glyphs at the cost of
> increased memory usage. For instance, "cat" takes 12 bytes when
> represented as Array[Int] (UTF-32) but 3 bytes when represented as
> Array[Byte] (UTF-8).
>
> UTF-8 uses a variable number of bits to reduce memory usage (in essence
> a simple form of compression), but this complicates code which wants to
> handle the string in terms of glyphs (for instance the number of bytes
> in a UTF-8 string will often differ from the number of Unicode glyphs).
> With the Seq[Int] representation the length of the sequence is the same
> as the number of glyphs.
>
> I think the ideal would be to have a library which can deal with (at
> least) two representations of a Unicode string: UTF-32 (Seq[Int]) and
> UTF-8 (Seq[Byte]). One could use the former for simplicity and speed,
> and the latter when trying to conserve memory and for I/O.
>
> None of this is profound, but I thought it would be useful to make the
> distinction between Unicode and UTF-8 (the two are often conflated).
>
> -- Erik
>
Loading...