Is now any string-based RTL functions (such as StrReplace(), Format(), Trim(), UpperCase(), etc..) will correctly handle characters with variable bytes length?
I have more answers now. For details see :
http://wiki.freepascal.org/Better_LCL_Unicode_SupportThere is no StrReplace() in FPC libs. The other funcs work as expected. UpperCase() continues to work in ASCII area only, just like in Delphi.
The Ansi...() versions of string functions are also Delphi compatible.
And some functions, like Pos(), Length(), Copy() will be bytes-oriented, or character-oriented?
Bytes-oriented.
In Delphi they are not character-oriented either, they are UnicodeChar(*) (WideChar, Word, 2 Bytes)-oriented. One Unicode codepoint in UTF-16 encoding can consist of 2 UnicodeChars(*). There is lots of sloppy Delphi code forgetting this fact.
Please read carefully this:
http://wiki.lazarus.freepascal.org/UTF8_strings_and_charactersand you understand that Pos(), Length(), Copy() are very useful with UTF-8.
Is lazutf8 unit not needed anymore?
No. There will be CodePoint...() functions but they are not implemented yet. With UTF-8 they will be wrappers for the functions in LazUTF8, you can test now with them.
Things happen in a little wrong order. This new UTF-8 support is for Lazarus 2.0 and it works well, but Lazarus 1.4 has not been released and does not even have RC1 yet.
Anyway, please test with the new UTF-8 support and try to find problems. The only known problem is still the Char type with TFormatSettings.
(*) "UnicodeString" and "UnicodeChar" names for types was a very unfortunate choice from Borland.
A Unicode codepoint is a "real" character definition in Unicode which can be encoded differently and its length depends on the encoding.
A Unicode character is either one codepoint or a surrogate pair of multiple codepoints. Yes, this is complex ...