|
[moved to scala-debate per Tony's suggestion]
Thanks, Erik. I agree with all of that, except for the term "glyph", which isn't quite right -- the correct term is "character". There's a good glossary, as well as a discussion of the changes made to Java to support Unicode supplementary characters, and the issues involved, at http://java.sun.com/developer/technicalArticles/Intl/Supplementary/ It's highly recommended reading for anyone interested in this discussion. At the moment, my concern is mostly about neither UTF-8 nor UTF-32, but rather with UTF-16 because that is the encoding for java.lang.String and, as so many people have pointed out, Java-Scala interoperability is essential. I think perhaps the lowest-impact approach would be to provide a Unicode view of Strings, so for instance one could do s.unicodeView.exists(Character.isLetter(_)) All that would take is an implicit conversion from a String to a UnicodeView, and I have written code to do that. It would be nice to be able to do s.unicodeView.exists(_.isLetter) and I have written that code too, but it makes isLetter an implicit method for any Int, which may not be desirable. -- Jim On Thu, Dec 23, 2010 at 9:56 PM, Erik Osheim <[hidden email]> wrote: > On Thu, Dec 23, 2010 at 11:50:00PM -0500, Arya Irani wrote: >> What are some performance considerations that should be kept in mind when >> implementing a UTF string library? >> >> Must a UTF8 string be stored as an Array[Byte]? What about Seq[Seq[Byte]] >> or Array[Seq[Byte]] or Array[Array[Byte]], where each element represents a >> code point? > > Just to be clear: I think Jim wants a Unicode string library rather > than just a UTF-8 string library. Unicode is a specification which > assigns numbers to glyphs, whereas UTF-8 is a particular method of > storing strings of Unicode glyphs as bytes. > > You can use Seq[Int] (which corresponds to the UTF-32 encoding) to > correctly represent all existing Unicode glyphs at the cost of > increased memory usage. For instance, "cat" takes 12 bytes when > represented as Array[Int] (UTF-32) but 3 bytes when represented as > Array[Byte] (UTF-8). > > UTF-8 uses a variable number of bits to reduce memory usage (in essence > a simple form of compression), but this complicates code which wants to > handle the string in terms of glyphs (for instance the number of bytes > in a UTF-8 string will often differ from the number of Unicode glyphs). > With the Seq[Int] representation the length of the sequence is the same > as the number of glyphs. > > I think the ideal would be to have a library which can deal with (at > least) two representations of a Unicode string: UTF-32 (Seq[Int]) and > UTF-8 (Seq[Byte]). One could use the former for simplicity and speed, > and the latter when trying to conserve memory and for I/O. > > None of this is profound, but I thought it would be useful to make the > distinction between Unicode and UTF-8 (the two are often conflated). > > -- Erik > |
| Powered by Nabble | Edit this page |
