I started a pure Lua module to support operation on UTF-8 data.
First goal was reached :
- be able to get the good length
- be able to get substring
- automatic convertion
Pleased to see someone else interested in unicode. I have had a look at your online repo. However, are you aware the following example you give will not work in general:
Code: Select all
Sample of use
code>local data = "àbcdéêèf"
local u = require("utf8")
local udata = u(data)
print(type(data), data) -- the orignal print(type(udata), udata) -- automatic convertion to string
print(#data) -- is not the good number of printed characters on screen print(#udata) -- is the number of printed characters on screen
print(udata:sub(4,5)) -- be able to use the sub() like a string
I will not give you a Lua example because you cannot even type
unicode strings in Lua, but here is the best you can have shown in python:
Code: Select all
s = u"\u0041\u0302\u0020\u0041\u032D"
print(s) # "Â A̭" (3 chars!)
print(repr(s)) # u'A\u0302 A\u032d'
print (len(s)) # 5
The point is what unicode folks call "abstract characters", what is represented by "unicode code points", is not
what you, me, or any other one would call "character", but just what they like to list in their set. In particular, basically, composite characters like Â are represented by 2 codes, one for the base 'A', one for the combining '^'. Which is a very good thing, imo: simple, informative, efficient. But there are also "precomposed characters" with codes representing whole composite characters. These are the ones most (if not all) unicode-aware editors and other text-producing software use, indeed, so that everyone thinks "abstract characters" are just characters and codes just represent characters (even programmers working on unicode). But this is not true.
A single character is represented by a suite of codes (1 or more, there is no formal limit in fact). And each code is 1 number in utf-32 and 1 to 4 (or 6) bytes in utf-8, as you know. Thus, decoding utf-8 gives you an array of codes, but not array of character representations, in the everyday or programming sense of "character". As a consequence, your #udata on my example will give 5, not 3.
Anyway, it's still very, very nice to have utf-8 <--> unicode encoding and decoding routines, and I may reuse them if you don't mind.