Recent

Author Topic: SelLength incorrect value for text containing characters > $FFFF  (Read 28537 times)

fedkad

  • Full Member
  • ***
  • Posts: 176
Re: SelLength incorrect value for text containing characters > $FFFF
« Reply #30 on: December 08, 2017, 11:01:31 am »
fedkad used it for .. I am not sure what.
...
I think the best meanings for "character" are:
1. Pascal Char, for historical reasons. In Unicode terms it represents a codeunit and is useful in many situations.
2. User perceived character. This involves combining codepoints, glyphs, ligatures and whatever.

By character I mean every possible of 1,112,064 valid code points in Unicode that can be represented in 1, 2, 3, or 4 bytes in UTF-8. In other words the "character" that I am talking about is something that is counted as "1" unit by the UTF8Length function or is returned by the UTF8Copy function when we give 1 as its third argument.

The Lazarus documentation itself is using the same terminology:

function UTF8Copy(
  const s: string;
  StartCharIndex: PtrInt;
  CharCount: PtrInt
):string;


See also: http://wiki.freepascal.org/UTF8_strings_and_characters#Accessing_the_Nth_UTF8_character
Lazarus 2.2.6 / FPC 3.2.2 on x86_64-linux-gtk2 (Ubuntu/GNOME) and x86_64-win64-win32/win64 (Windows 11)

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4467
  • I like bugs.
Re: SelLength incorrect value for text containing characters > $FFFF
« Reply #31 on: December 08, 2017, 11:46:24 am »
By character I mean every possible of 1,112,064 valid code points in Unicode that can be represented in 1, 2, 3, or 4 bytes in UTF-8. In other words the "character" that I am talking about is something that is counted as "1" unit by the UTF8Length function or is returned by the UTF8Copy function when we give 1 as its third argument.
Ok, that is codepoint. BTW, the valid codepoints in Unicode can be represented by any encoding. Conversions between Unicode encodings are always lossless.
Only codepoints are encoded. People should remember it when arguing about the supremacy of a certain encoding. The true complexity of Unicode lies beyond codepoints and encodings.

Quote
The Lazarus documentation itself is using the same terminology:
...
Good point. Actually it bothered also me when I edited the pages. I think they must all be changed to "codepoint" to prevent extra confusion.
« Last Edit: December 08, 2017, 12:20:50 pm by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

munair

  • Hero Member
  • *****
  • Posts: 798
  • compiler developer @SharpBASIC
    • SharpBASIC
Re: SelLength incorrect value for text containing characters > $FFFF
« Reply #32 on: December 08, 2017, 02:04:31 pm »
Only codepoints are encoded. People should remember it when arguing about the supremacy of a certain encoding. The true complexity of Unicode lies beyond codepoints and encodings.
True, but it's the codepoints we work with when representing or manipulating text. It is not just happenstance that UTF-8 has become the predominant encoding scheme on the internet and with Unix-like operating systems.
keep it simple

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4467
  • I like bugs.
Re: SelLength incorrect value for text containing characters > $FFFF
« Reply #33 on: December 08, 2017, 03:42:14 pm »
True, but it's the codepoints we work with when representing or manipulating text.
No, we also work with codeunits, combining codepoints, normalization of "characters" using decomposed or precomposed codepoint forms, locale dependent rules for upper-/lower-casing codepoints and "characters", locale dependent rules for sorting them ... etc.
A programmer of an advanced text layout application must also work with glyphs and graphemes etc. which I don't know well. I only know it is complicated!

Quote
It is not just happenstance that UTF-8 has become the predominant encoding scheme on the internet and with Unix-like operating systems.
Yes, UTF-8 has nice features but it is not a holy grail that makes the Unicode's complexity go away. No encoding can do that because the complexity is not in encodings or codepoints in the first place.
Some people here and in mailing lists have suggested turning all string data into a fixed width array of "characters", I guess something like what UTF-32 does. In their opinion all problems would be solved then. Unfortunately it is not so easy. Clearly those people do not understand the complexity of Unicode, and don't know what the word "character" can mean there.
« Last Edit: December 08, 2017, 06:32:51 pm by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

tomitomy

  • Sr. Member
  • ****
  • Posts: 251
Re: SelLength incorrect value for text containing characters > $FFFF
« Reply #34 on: December 08, 2017, 04:08:22 pm »
Is this right:

Code: [Select]
❤(Character) = #E2#9D#A4(UTF8 code) = #E2(Byte) & #9D(Byte) & #A4(Byte) = 10084(Code Point)

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4467
  • I like bugs.
Re: SelLength incorrect value for text containing characters > $FFFF
« Reply #35 on: December 08, 2017, 06:40:03 pm »
Code: [Select]
❤(Character) = #E2#9D#A4(UTF8 code) = #E2(Byte) & #9D(Byte) & #A4(Byte) = 10084(Code Point)
Maybe "UTF8 code" could be "UTF-8 encoded codepoint" or something.
Anyway you picked an easy example consisting of one codepoint only. A visible character can consist of many codepoints.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

munair

  • Hero Member
  • *****
  • Posts: 798
  • compiler developer @SharpBASIC
    • SharpBASIC
Re: SelLength incorrect value for text containing characters > $FFFF
« Reply #36 on: December 08, 2017, 06:51:32 pm »
No, we also work with codeunits, combining codepoints, normalization of "characters" using decomposed or precomposed codepoint forms, locale dependent rules for upper-/lower-casing codepoints and "characters", locale dependent rules for sorting them ... etc.
Yes, it is a complicated subject. My understanding is that a code unit is a particular sequence of bits. In ASCII a code unit is 7 bits, in UTF-8 a code unit is 8 bits and in UTF-16 a code unit is 16 bits. A codepoint may consist of one or more code units. In ASCII a code unit and a codepoint are the same. In UTF-8 a codepoint can be one, two or three code units. When speaking of characters, most people refer to the glyphs that we see, like 'A' or '€'. In the days of ASCII a code unit, codepoint and character referred to the same 7 bits, or simply a byte. Today a character or codepoint may occupy one or more bytes (code units) depending on the encoding. It would have been simpler if the world could agree on a single encoding covering all glyphs we use for communication.
« Last Edit: December 09, 2017, 12:19:38 am by Munair »
keep it simple

munair

  • Hero Member
  • *****
  • Posts: 798
  • compiler developer @SharpBASIC
    • SharpBASIC
Re: SelLength incorrect value for text containing characters > $FFFF
« Reply #37 on: December 08, 2017, 06:52:18 pm »
Code: [Select]
❤(Character) = #E2#9D#A4(UTF8 code) = #E2(Byte) & #9D(Byte) & #A4(Byte) = 10084(Code Point)
Maybe "UTF8 code" could be "UTF-8 encoded codepoint" or something.
Anyway you picked an easy example consisting of one codepoint only. A visible character can consist of many codepoints.
I think you mistake codepoints for code units. This example shows one codepoint but three code units.
« Last Edit: December 08, 2017, 06:55:24 pm by Munair »
keep it simple

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4467
  • I like bugs.
Re: SelLength incorrect value for text containing characters > $FFFF
« Reply #38 on: December 08, 2017, 09:09:48 pm »
Yes, it is a complicated subject. My understanding is that a code unit is a particular sequence of bits. In ASCII a code unit is 7 bits, in UTF-8 a code unit is 8 bits and in UTF-16 a code unit is 16 bits. A codepoint may consist of one or more code units. In ASCII a code unit and a codepoint are the same. In UTF-8 a codepoint can be one, two or three code units.
... or even four codeunits.
Yes except that terms "codeunit" and "codepoint" are used only with Unicode AFAIK. They were not needed with ASCII.

Quote
When speaking of characters, most people refer to the glyphs that we see, like 'A' or '€' which would be identical to a codepoint.
Those 2 happen to have only one codepoint, yes.

Quote
In the days of ASCII a code unit, codepoint and character referred to the same 7 bits, or simply a byte.
Words codeunit and codepoint were not used then. A "character" meant a byte. Yes, everything was simple in the old days.  :)

Quote
Today a character or codepoint may occupy one or more bytes (code units) depending on the encoding. It would have been simpler if the world could agree on a single encoding covering all glyphs we use for communication.
Clearly you still don't understand the concepts of Unicode.
A codepoint occupies one or more codeunits depending on encoding, yes.
A "character" occupies one or more codepoints. I put the word "character" in quotes because its meaning is so ambiguous.
See these 2 characters: ÓÓ
They look the same, don't they? However:
Ó has 2 codepoints and 3 codeunits.
Ó has 1 codepoint and 2 codeunits.
Amazing, ha!

Quote
I think you mistake codepoints for code units. This example shows one codepoint but three code units.
No mistake. That example had one codepoint but some other example could have more. Note also that encodings make no difference for that. Only codepoints are encoded!

More examples (codepoints are encoded with UTF-8):
ch=Ở has 2 codepoints and 4 codeunits.
ch=Ở has 1 codepoint and 3 codeunits.
ch=Ć̲ has 3 codepoints and 5 codeunits.

See also:
 http://www.alanwood.net/unicode/combining_diacritical_marks.html
« Last Edit: December 08, 2017, 09:37:48 pm by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

munair

  • Hero Member
  • *****
  • Posts: 798
  • compiler developer @SharpBASIC
    • SharpBASIC
Re: SelLength incorrect value for text containing characters > $FFFF
« Reply #39 on: December 09, 2017, 12:09:21 am »
What you describe as a character "ch=Ở has 2 codepoints and 4 codeunits" are in fact two characters: "Ơ" and "  ̉".

What you describe as a character ch=Ć̲ has 3 codepoints and 5 codeunits are in fact three characters: 43h - CCh 81h - CCh B2h.

When feeding that sequence to my UTF-8 library like
Code: FreeBasic  [Select][+][-]
  1. print UMid(text, 2, 1)
the output will be CCh 81h, which is one character or codepoint. Has nothing to do with the glyphs we recognize a a single character. So I believe we understand each other.

But keep in mind that even the Linux console has difficulty displaying these glyphs properly. For example, when trying to build the first glyph of two code points chr($C6)+chr($A0)+chr($CC)+chr($89), a rectangle is displayed instead. Likewise, when decomposing the two codepoints into two strings and concatenating them back will also display the rectangle even though the byte sequence is exactly the same and the UTF-8 library decodes the string correctly.
« Last Edit: December 09, 2017, 12:48:36 am by Munair »
keep it simple

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4467
  • I like bugs.
Re: SelLength incorrect value for text containing characters > $FFFF
« Reply #40 on: December 09, 2017, 01:04:12 am »
What you describe as a character "ch=Ở has 2 codepoints and 4 codeunits" are in fact two characters: "Ơ" and "  ̉".
What you describe as a character ch=Ć̲ has 3 codepoints and 5 codeunits are in fact three characters: 43h - CCh 81h - CCh B2h.
Well, they are codepoints which are combined to form a "user perceived character".
In my opinion calling "codepoint" as "character" is very confusing. How do you call the visual combined thingy then?

Quote
Keep in mind that even the Linux console has difficulty displaying these glyphs properly.
...
Yes, both SW applications and fonts have issues displaying complex and rare Unicode "characters". Still a long way to go...

I guess you were right about the terms (codeunit, codepoint) being used already before Unicode. There were many encoding systems before the global Unicode standard was created. Anyway I am happy it was created. It is complex, yes, but at least it is a global standard.

BTW, I changed "character" into "codepoint" where appropriate in this page:
 http://wiki.lazarus.freepascal.org/UTF8_strings_and_characters
The wrong "character" term is still used in some function names. I am not sure if it is worth the trouble to change them. The current names should be deprecated and kept as aliases for some time.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

howardpc

  • Hero Member
  • *****
  • Posts: 4144
Re: SelLength incorrect value for text containing characters > $FFFF
« Reply #41 on: December 09, 2017, 09:10:42 am »
The wrong "character" term is still used in some function names. I am not sure if it is worth the trouble to change them. The current names should be deprecated and kept as aliases for some time.

I am convinced it is worth the trouble to change them. A comprehensively consistent naming convention for Lazarus functions is nowhere more applicable than in the confusing world of unicode and text encoding functionality.
Some of this confusion arises (apart from the inherent complexities of the topic) because people in discussion use the same terms such as "character" with different or overlapping or imprecise meanings, often without being aware of the ambiguities, or different interpretations other readers may assume. Surely the LCL can embrace informed and intelligent consistency and abandon well-meant but (with hindsight) flawed naming.

munair

  • Hero Member
  • *****
  • Posts: 798
  • compiler developer @SharpBASIC
    • SharpBASIC
Re: SelLength incorrect value for text containing characters > $FFFF
« Reply #42 on: December 09, 2017, 09:59:34 am »
Well, they are codepoints which are combined to form a "user perceived character".
In my opinion calling "codepoint" as "character" is very confusing. How do you call the visual combined thingy then?
The visual combined thingy or glyph is nothing because one cannot do string operations on it without breaking it. For example, if it would be the first glyph in a string consisting of five code units, then a ULeft(text, 1) should isolate all five code units, but that's not happening because there's no indication in the first byte that it should take the next four too. This is confirmed in a text editor supporting UTF-8. When going over the glyph with the cursor, it sort of sticks to it three times before moving to the next. In short, it isn't a real glyph or character because it isn't recognized as such. One can do a lot of funny things in Unicode. It's like a palette of over a million colors, but it doesn't guarantee proper handing of all glyphs that could be constructed.
keep it simple

munair

  • Hero Member
  • *****
  • Posts: 798
  • compiler developer @SharpBASIC
    • SharpBASIC
Re: SelLength incorrect value for text containing characters > $FFFF
« Reply #43 on: December 09, 2017, 10:08:06 am »
The wrong "character" term is still used in some function names. I am not sure if it is worth the trouble to change them. The current names should be deprecated and kept as aliases for some time.

I am convinced it is worth the trouble to change them. A comprehensively consistent naming convention for Lazarus functions is nowhere more applicable than in the confusing world of unicode and text encoding functionality.
Some of this confusion arises (apart from the inherent complexities of the topic) because people in discussion use the same terms such as "character" with different or overlapping or imprecise meanings, often without being aware of the ambiguities, or different interpretations other readers may assume. Surely the LCL can embrace informed and intelligent consistency and abandon well-meant but (with hindsight) flawed naming.
The trouble goes back to the 70s and 80s when programming languages were developed using improper names. For example, the function ASC should only allow bytes in the range 0h - 7Fh strictly speaking. The name should be ASCE (extended ASCII) to allow all unsigned byte values. Same for CHR -> CHRE. Especially since the range 80h to FFh had different characters for each code page.
« Last Edit: December 09, 2017, 10:16:02 am by Munair »
keep it simple

tomitomy

  • Sr. Member
  • ****
  • Posts: 251
Re: SelLength incorrect value for text containing characters > $FFFF
« Reply #44 on: December 09, 2017, 12:40:49 pm »
I think it's a bad idea to represent a single character with more than one codepoint, it breaks the unitarity of the encoding (using the same rule to encode all the characters), and I do not know why there are multiple-codepoint characters, and who are using these characters? Do they have no trouble? No one object to this rule?

 

TinyPortal © 2005-2018