Recent

Author Topic: Unicode, and IntToStr and PosEx: How to make it work?  (Read 8794 times)

EganSolo

  • Sr. Member
  • ****
  • Posts: 290
Unicode, and IntToStr and PosEx: How to make it work?
« on: February 19, 2017, 03:33:53 am »
So, I'm trying to use unicode... yeah...

I've set {$modeswitch UnicodeStrings} in my units, but I'm using IntToStr and the compiler complains about the implicit type conversion from AnsiString to UnicodeString.

I can get rid of this warning by explicit typecasting as in
Code: Pascal  [Select][+][-]
  1. UnicodeString(IntToStr())

Is this the right thing to do?

More importantly, when I set $modeswitch UnicodeStrings} it sets string to UnicodeString in my unit but not globally across all units.
Therefore SysUtils remains AnsiString.

Do I have to worry about every method in SysUtils now, such as Trim, TrimLeft, Pos, PosEx etc?

The other option would be to remove this switch and reinstate {$H+} and then convert UnicodeString into String...

Also, can I use {$modeswitch UnicodeStrings} without {$mode Delphi}?

This entire Unicode / ANSI code business smells like a messed-up marriage.

We need a counselor...
« Last Edit: February 19, 2017, 09:55:54 am by EganSolo »

EganSolo

  • Sr. Member
  • ****
  • Posts: 290
Re: Unicode and IntToStr: Why the warning?
« Reply #1 on: February 19, 2017, 09:55:10 am »
OK, so I'm doing a bit more digging...
Consider the following simple program
Code: Pascal  [Select][+][-]
  1. program Project1;
  2. {$modeswitch UnicodeStrings}
  3. uses sysutils,Classes,strutils;
  4. var S1, S2 : String;
  5.     S3     : String;
  6.     i,j    : integer;
  7. begin
  8.   S1 := 'Bonjour Sérénità';
  9.   S3 := 'à';
  10.   i  := Pos('Bonjour', S1);
  11.   J  := PosEx(S3,S1,i);
  12.   WriteLn('i = ', i, ' j = ', j);
  13.   Readln();
  14. end.
  15.  
When you run it, j = 18, when in fact, it should be 15.

Also, consider this other bit of code:
Code: Pascal  [Select][+][-]
  1. program Project1;
  2. {$modeswitch UnicodeStrings}
  3. uses sysutils,Classes,strutils;
  4. var S1, S2 : String;
  5.     C      : Char  ;
  6.     i,j    : integer;
  7. begin
  8.   S1 := 'Bonjour Sérénità';
  9.   C  := WideChar('à');
  10.   i  := Pos('Bonjour', S1);
  11.   J  := PosEx(C,S1,i);
  12.   WriteLn('i = ', i, ' j = ', j);
  13.   Readln();
  14. end.
  15.  

This code does not compile. I get an error because the compiler can't find a suitable PosEx function for the widechar:
project1.lpr(11,9) Error: Can't determine which overloaded function to call

Any suggestions on how to fix this?

Thaddy

  • Hero Member
  • *****
  • Posts: 14373
  • Sensorship about opinions does not belong here.
Re: Unicode, and IntToStr and PosEx: How to make it work?
« Reply #2 on: February 19, 2017, 10:30:14 am »
That code compiles with trunk.
I seem to remember that the particular posex was also backported to 3.0.2.
Which version of FPC are you using? Try upgrading to 3.0.2 first.

[edit]

I verified that it is indeed back-ported to 3.0.2.
So upgrade to 3.0.2 and your code works as expected.
« Last Edit: February 19, 2017, 10:50:53 am by Thaddy »
Object Pascal programmers should get rid of their "component fetish" especially with the non-visuals.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4468
  • I like bugs.
Re: Unicode and IntToStr: Why the warning?
« Reply #3 on: February 19, 2017, 12:08:49 pm »
When you run it, j = 18, when in fact, it should be 15.
No, Pos() returns a byte position and 18 is correct. UTF8Pos() would return 15.
Often you can use byte positions also with UTF-8 data. See examples here:
 http://wiki.freepascal.org/UTF8_strings_and_characters

Please also consider unit LazUnicode for truly portable code dealing with strings :
 http://wiki.freepascal.org/Better_Unicode_Support_in_Lazarus#CodePoint_functions_for_encoding_agnostic_code
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

wp

  • Hero Member
  • *****
  • Posts: 11916
Re: Unicode, and IntToStr and PosEx: How to make it work?
« Reply #4 on: February 19, 2017, 12:42:38 pm »
Juha, but he has {$modeswitch unicodestrings}. Should this make a string to a unicodestring?

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4468
  • I like bugs.
Re: Unicode, and IntToStr and PosEx: How to make it work?
« Reply #5 on: February 19, 2017, 01:47:54 pm »
Juha, but he has {$modeswitch unicodestrings}. Should this make a string to a unicodestring?
I think the first example was edited and {$modeswitch unicodestrings} was added. My answer is valid only if the code is without the modeswitch and is used with the Lazarus Unicode system.
For example LCL does not work well with {$modeswitch unicodestrings} currently. With pure FPC programs it is OK.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

Thaddy

  • Hero Member
  • *****
  • Posts: 14373
  • Sensorship about opinions does not belong here.
Re: Unicode, and IntToStr and PosEx: How to make it work?
« Reply #6 on: February 19, 2017, 02:19:54 pm »
Juha, the provided examples ARE FPC only and I have verified that they only work with 3.0.2 or trunk and not with 3.0.0. Lazarus has nothing to do with it.
It's plain UTF16 here.

The only thing I did not check here is a run against Delphi 2010+ to see if the results are the same. (But I did so after the initial bug fix)
The index returned is 16, which is correct. It is a character index and NOT a byte index.
There is a minor issue with the cast as widechar, that is not necessary - but it does compile as well - , because it is already unicodechar of course....
Code: Pascal  [Select][+][-]
  1. program untitled;
  2. {$mode delphiunicode}  // I would do that...
  3. uses sysutils,Classes,strutils;
  4. var S1, S2 : String;
  5.     C      : Char  ;
  6.     i,j    : integer;
  7. begin
  8.   S1 := 'Bonjour Sérénità';
  9.   C  := 'à';
  10.   i  := Pos('Bonjour', S1);
  11.   J  := PosEx(C,S1,i);
  12.   WriteLn('i = ', i, ' j = ', j);
  13.   Readln;
  14. end.

« Last Edit: February 19, 2017, 03:03:17 pm by Thaddy »
Object Pascal programmers should get rid of their "component fetish" especially with the non-visuals.

EganSolo

  • Sr. Member
  • ****
  • Posts: 290
Re: Unicode, and IntToStr and PosEx: How to make it work?
« Reply #7 on: February 20, 2017, 12:01:52 am »

Thanks everyone for your feedback.
Downloaded fpc 3.0.2. Recompiled Laz 1.6.2 with it. Reran the following code:
I've extended my test code a bit more. You can copy this code and drop it inside a simple command line project, compile and run.


Code: Pascal  [Select][+][-]
  1. program Unicode_and_pos_ex;
  2. {$modeswitch UnicodeStrings} //Switching String to UnicodeString
  3. uses sysutils,Classes,strutils;
  4. var S1,S2  : String;
  5.     S3     : String;
  6.     C1,C2  : Char  ;
  7.     i,j,k,l: integer;
  8. begin
  9.   S1 := 'Bonjour Sérénitàa';
  10.   S3 := 'à';
  11. //  C  := 'à';       //<== Error: Incompatible types: got "Constant String" expected "WideChar"
  12.   C1 := 'a';         //This compiles just fine. Why the difference?
  13.   C2 := Char('à');   //this compiles.
  14.   i  := Pos('Bonjour', S1);
  15.   J  := PosEx(S3,S1,i);
  16.   K  := PosEx(C1,S1,i);
  17.   L  := PosEx(C2,S1,i);
  18.   WriteLn('i = ', i, ', j = ', j, ', k = ' , k, ', l = ', l);
  19.  
  20.  
  21.   S2 := '     ' + S1 + '    ';
  22.   S2 := Trim(S2);
  23.   If S1 = S2
  24.   then Writeln('Trim works')
  25.   else Writeln('Trim is broken');
  26.  
  27.  
  28.   S2 := UpperCase(S1);
  29.   Writeln('Uppercase S1 = ', S2);
  30.   S2 := Lowercase(S2);
  31.   Writeln('Lowercase S2 = ', S2);
  32.   If S2 = S1
  33.   then Writeln('Uppercase / Lowercase work')
  34.   else Writeln('Uppercase / Lowercase broken');
  35.  
  36.  
  37.   i   := Pos(' ',S1);
  38.   S2  := Copy(S1,i+1,Length(S1));
  39.   If S2 = 'Sérénitàa'
  40.   then writeln('copy works')
  41.   else writeln('copy broken');
  42.   Readln();
  43.  
  44.  
  45.  {returns:
  46.     i = 1, j = 18, k = 20, l = 0
  47.     Trim works
  48.     Uppercase S1 = BONJOUR SAcRAcNITA A
  49.     Lowercase S2 = bonjour sAcrAcnitA a
  50.     Uppercase / Lowercase broken
  51.     copy works
  52.  }
  53. end.
  54.  


Observations:

    • The compiler chocks on C2 := ' à' but not on C1 := 'a'?
    • PosEx for S3 returns 16 (correctly :) )
    • PosEx for C1 returns 20 for the position of 'a' but there aren't 20 characters in the string
    • PosEx returns 0 for the position of 'à' with C2
    • It would seem that UpperCase is broken? Perhaps I'm not using it right?


    So, 3.0.2 is a definite improvement over 3.0.0 but I think the PosEx for Char is still broken. As a workaround, we should be using PosEx with strings only until the PosEx with Char is fixed. Also, UpperCase is having issues or maybe I'm not understanding how to use it appropriately?








    EganSolo

    • Sr. Member
    • ****
    • Posts: 290
    Re: Unicode, and IntToStr and PosEx: How to make it work?
    « Reply #8 on: February 20, 2017, 08:57:14 am »
    Here is another update.
    I've recompiled the RegExp unit with {$modeswitch UnicodeStrings}. To do this, I copied the unit and insert the mode switch inside the unit. This unit has a few defines to turn Unicode support on such as
    FPS_OS_UNICODE and UNICODE but the problem is that even with these switches turned on I was still receiving warnings about implicit type conversions between AnsiString and WideString. By switching UnicodeStrings on, these warnings went away.

    To run this test, include the RegExp unit first, then open it and save it as UnicodeRegExp. Remove RegExp and keep UnicodeRegExp. Add {$modeswitch UnicodeStrings} to UnicodeRegExp and then compile your program as below.
     
    The following code below works as expected with unicode strings, which is great news.
    Code: Pascal  [Select][+][-]
    1. program TryUnicode;
    2. {$modeswitch UnicodeStrings}
    3. {$codepage utf-8} //We need this otherwise the test will fail spectacularly.
    4. uses
    5.  UnicodeRegExp
    6.  ;
    7.  
    8. var S1, S2: String;
    9.     RegExp : TRegExpr;
    10.     c      : Char;
    11.     Success: integer;
    12.     failure: integer;
    13.     aCount : integer;
    14. begin
    15.   //First a straightforward test.
    16.   S1 := '[o]+';
    17.   S2 := 'book';
    18.   RegExp := TREgExpr.Create(S1);
    19.   RegExp.Compile;
    20.   If RegExp.Exec(S2)
    21.   then writeln('test works')
    22.   else writeln('test fails');
    23.   //A second, simple unicode enalbed test
    24.   S1 := '[à]+';
    25.   S2 := 'bààk';
    26.   RegExp.Expression := S1;
    27.   RegExp.Compile;
    28.   If RegExp.Exec(S2)
    29.   then writeln('test works')
    30.   else writeln('test fails');
    31.  
    32.   //A more elaborate unicode test
    33.   {From the unicode table:
    34.      U+00E0     à      c3 a0   LATIN SMALL LETTER A WITH GRAVE
    35.      U+00E1     á      c3 a1   LATIN SMALL LETTER A WITH ACUTE
    36.      U+00E2     â      c3 a2   LATIN SMALL LETTER A WITH CIRCUMFLEX
    37.      U+00E3     ã      c3 a3   LATIN SMALL LETTER A WITH TILDE
    38.      U+00E4     ä      c3 a4   LATIN SMALL LETTER A WITH DIAERESIS
    39.      U+00E5     å      c3 a5   LATIN SMALL LETTER A WITH RING ABOVE
    40.      U+00E6     æ      c3 a6   LATIN SMALL LETTER AE
    41.      U+00E7     ç      c3 a7   LATIN SMALL LETTER C WITH CEDILLA
    42.      U+00E8     è      c3 a8   LATIN SMALL LETTER E WITH GRAVE
    43.      U+00E9     é      c3 a9   LATIN SMALL LETTER E WITH ACUTE
    44.   }
    45.   S1 := '[à-é]+';
    46.   RegExp.Expression := S1;
    47.   RegExp.Compile;
    48.   Success := 0;
    49.   failure := 0;
    50.   aCount  := 0;
    51.   For c := Char('à') to Char('è') do
    52.   begin
    53.      Write(c, ' -- ');
    54.      System.Inc(aCount);
    55.      If aCount mod 20 = 0
    56.      then Writeln;
    57.      S2 := 'b' + c + 'k';
    58.      If RegExp.Exec(S2)
    59.      then System.Inc(Success)
    60.      else System.Inc(Failure);
    61.   end;
    62.   Writeln('Successful match(es): ' , success);
    63.   Writeln('Failed matches: ', failure);
    64.   RegExp.free;
    65.   Readln();
    66. end.
    67.  

    Using Unicode strings is far more involved than using straight ansi Strings. Coders using non-English languages may already be familiar with all of this, but for some of us, using Unicode will prove arduous until we figure out exactly what needs to be done. For instance, if you do not include {$codepage utf-8} in the code above, it will fail. Figuring out that the French accented characters I chose require utf-8 is neither intuitive nor straightforward. If I could specify the code by language, that is if there were a directive such as {$Unicodelanguage French} then that would be far more easier to handle, but as it is, it seems like a hit or miss experience.

    Still, I'm happy this is working.

    More to come as I continue to explore the art of the possible with unicode.

    Thaddy

    • Hero Member
    • *****
    • Posts: 14373
    • Sensorship about opinions does not belong here.
    Re: Unicode, and IntToStr and PosEx: How to make it work?
    « Reply #9 on: February 20, 2017, 09:35:58 am »
    One remark: Use {$mode delphiunicode} and NOT {$modeswitch unicodestrings}
    There is more involved. A {$mode} is a whole set of {$modeswitches}

    You want 16 bit unicode UTF16 (delphi unicode), not UTF8 (Lazarus unicode).

    Once you do that, your problems disappear, mostly. (Like the C variable)

    Also: you can read the documentation and the sourcecode if a UTF16 overload is already available... like for trim.
    All of that will come in time. If it doesn't work, check the docs and check the sourcecode.
    Object Pascal programmers should get rid of their "component fetish" especially with the non-visuals.

    EganSolo

    • Sr. Member
    • ****
    • Posts: 290
    Re: Unicode, and IntToStr and PosEx: How to make it work?
    « Reply #10 on: February 20, 2017, 10:02:30 pm »
    Thaddy,

    Thanks for the feedback. Please run the program below. First, enable {$modeswitch Unicodestrings} and {$mode utf-8}, then comment them out and enable {$mode delphiunicode}. You will readily see that the result with delphiunicode is far worse than the result with unicodestrings and utf-8. I may still be doing something wrong, though.

    Code: Pascal  [Select][+][-]
    1. program Unicode_and_pos_ex;
    2. //First run this program with UnicodeStrings and utf-8 enabled, then comment them out and enable {$mode delphiunicode}
    3. //and run again.
    4. {$modeswitch UnicodeStrings}
    5. {$codepage utf-8}
    6. //{$mode delphiunicode}
    7. uses sysutils,strutils;
    8.  
    9. var S1,S2  : String;
    10.     S3     : String;
    11.     C1,C2  : Char  ;
    12.  
    13. Procedure InitVars;
    14. begin
    15.           {0        1       }
    16.           {12345678901234567}
    17.    S1 := 'bonjour sérénitàa';
    18.    S3 := 'à';
    19.    //  C  := 'à';       //<== Error: Incompatible types: got "Constant String" expected "WideChar"
    20.    C1 := 'a';         //This compiles just fine. Why the difference?
    21.    C2 := Char('à');   //this compiles.
    22. end;
    23.  
    24. Procedure TestUnicodePos;
    25. var i : integer;
    26. begin
    27.    Writeln('searching for the substring ''jour'' in ''bonjour sérénitàa''');
    28.    i  := Pos('jour', S1);
    29.    If i = 4
    30.    then writeln('Pos works for non-accented searches')
    31.    else writeln('Pos broken for non-accented searches. Got ', i, ' when expecting 4');
    32.  
    33.    Writeln('searching for the substring ''à'' in ''bonjour sérénitàa''');
    34.    i  := Pos(S3 , S1);
    35.    If i = 16
    36.    then writeln('Pos works for string accented searches')
    37.    else writeln('Pos broken for string accented searches. Got ', i, ' when expecting 16');
    38.  
    39.    Writeln('searching for the char ''à'' in ''bonjour sérénitàa''');
    40.    i  := Pos(C2 , S1);
    41.    If i = 16
    42.    then writeln('Pos works for accented character searches')
    43.    else writeln('Pos broken for accented character searches. Got ', i, ' when expecting 16');
    44. end;
    45.  
    46. Procedure TestUnicodePosEx;
    47. var i : integer;
    48. begin
    49.    Writeln('searching for the substring ''à'' in ''bonjour sérénitàa'' after pos 4');
    50.    i  := PosEx(S3 , S1, 4);
    51.    If i = 16
    52.    then writeln('PosEx works for string accented searches')
    53.    else writeln('PosEx broken for string accented searches. Got ', i, ' when expecting 16');
    54.  
    55.    Writeln('searching for the char ''à'' in ''bonjour sérénitàa'' after post 4.');
    56.    i  := PosEx(C2 , S1, 4);
    57.    If i = 16
    58.    then writeln('PosEx works for accented character searches')
    59.    else writeln('PosEx broken for accented character searches. Got ', i, ' when expecting 16');
    60. end;
    61.  
    62. Procedure TestUnicodeTrimming;
    63. const BlankPad = '      ';
    64. begin
    65.    S2 := BlankPad + S1 + BlankPad;
    66.    S2 := Trim(S2);
    67.    If S1 = S2
    68.    then Writeln('Trim works')
    69.    else Writeln('Trim is broken');
    70.  
    71.    S2 := BlankPad + S1;
    72.    S2 := TrimLeft(S2);
    73.    If S1 = S2
    74.    then Writeln('TrimLeft works')
    75.    else Writeln('TrimLeft is broken');
    76.  
    77.    S2 := S1 + BlankPad;
    78.    S2 := TrimRight(S2);
    79.    If S1 = S2
    80.    then Writeln('RightTrim works')
    81.    else Writeln('RightTrim is broken');
    82. end;
    83.  
    84. Procedure TestUnicodeCopying;
    85. var i: integer;
    86. begin
    87.    i   := Pos(' ',S1);
    88.    S2  := Copy(S1,i+1,Length(S1));
    89.    If S2 = 'sérénitàa'
    90.    then writeln('copy works')
    91.    else writeln('copy broken');
    92. end;
    93.  
    94. Procedure TestUnicodeUpperLower;
    95. begin
    96.    Writeln('original string: ' , S1);
    97.    S2 := UpperCase(S1);
    98.    Writeln('Uppercase S1 = ', S2);
    99.    If S2 = 'BONJOUR SÉRÉNITÀA'
    100.    then Writeln('Uppercase works')
    101.    else Writeln('Uppercase broken: got ', S2, ' instead of BONJOUR SÉRÉNITÀA');
    102.    S2 := Lowercase(S2);
    103.    Writeln('Lowercase S2 = ', S2);
    104.    If S2 = S1
    105.    then Writeln('Lowercase work')
    106.    else Writeln('Lowercase broken: got ', S2, ' instead of bonjour sérénitàa');
    107. end;
    108.  
    109. begin
    110.  InitVars;
    111.  TestUnicodePos;
    112.  TestUnicodePosEx;
    113.  TestUnicodeTrimming;
    114.  TestUnicodeCopying;
    115.  TestUnicodeUpperLower;
    116.  Readln();
    117. end.
    118.  
    « Last Edit: February 21, 2017, 12:49:53 am by EganSolo »

    howardpc

    • Hero Member
    • *****
    • Posts: 4144
    Re: Unicode, and IntToStr and PosEx: How to make it work?
    « Reply #11 on: February 20, 2017, 11:32:54 pm »
    You're going to have to get used to the idea that unicode-encoded ansistrings such as the UTF8 encoding used by default in the Lazarus IDE editor is a multi-byte encoding.
    Your equation
     1 character = 1 byte
    is a wrong assumption, except for the first few ANSI characters encoded in UTF8.

    Pos() is not broken at all. It copes with multibyte-encoded strings, and it returns the byte position of the given 'character', which is only identical to its apparent 'character' position in the string if all characters are one-byte.
    For instance, if you check Length(S1) in your example, you'll see that it is not 17 bytes, but 20 bytes, since three (accented) characters occupy two bytes in the string (not one byte like the other low-value characters). The visual representation of the string gives no clue as to the underlying storage requirement of the string encoding.
    UTF8-encoded codepoints may require 1, 2, 3 or 4 bytes for each 'character' displayed. You can't tell by looking at a string display how many bytes are needed for each codepoint, you have to use the functions provided in LazUTF8 such as UTF8CharacterLength().

    EganSolo

    • Sr. Member
    • ****
    • Posts: 290
    Re: Unicode, and IntToStr and PosEx: How to make it work?
    « Reply #12 on: February 21, 2017, 12:13:08 am »
    Hi Howard,

    Thanks for your reply. I amended my code to include a test for length. It actually works as expected, returning 17 and not 20.
    You might want to run this little program below to see what it does. Did you run your code with or without {$codepage utf-8}? Please see the code below.

    I get that we need to use different codepages. The theory is simple. The practice is not: Here's what I am struggling with as I go through this gyration:

    • Should I use {$modeswitch unicodestrings} or {$mode delphiunicode} like Thaddy suggested?
    • Which string methods are supported out-of-the-box in 3.0.2. for unicode? Clearly Uppercase is not. Which other methods fail?
    • In either {$mode delphiunicode} or {$modeswitch unicodestrings} can I rely on TCharcter to figure out if a string is an identifier, a symbol, etc regardless of the actual code page?
    • What about collations? Do I need to use them if I'm using utf-8?

    As you can see, the details are where there's a bump on the road, and I'd wager to say that I'm not the only one :)

    Code: Pascal  [Select][+][-]
    1. program Unicode_and_pos_ex;
    2. {$modeswitch UnicodeStrings}
    3. {$codepage utf-8}
    4. //{$mode delphiunicode}
    5. uses sysutils,strutils;
    6.  
    7. var S1,S2  : String;
    8.     S3     : String;
    9.     C1,C2  : Char  ;
    10.  
    11. Procedure InitVars;
    12. begin
    13.   {0        1       }
    14.   {12345678901234567}
    15.    S1 := 'bonjour sérénitàa';
    16.    S3 := 'à';
    17.    //  C  := 'à';       //<== Error: Incompatible types: got "Constant String" expected "WideChar"
    18.    C1 := 'a';         //This compiles just fine. Why the difference?
    19.    C2 := Char('à');   //this compiles.
    20. end;
    21.  
    22. Procedure TestUnicodeLength;
    23. var len : integer;
    24. begin
    25.   len := Length(S1);
    26.   If len = 17
    27.   then writeln('length works')
    28.   else writeln('length is broken: got ', len, ' expected 17');
    29. end;
    30.  
    31. Procedure TestUnicodePos;
    32. var i : integer;
    33. begin
    34.    Writeln('searching for the substring ''jour'' in ''bonjour sérénitàa''');
    35.    i  := Pos('jour', S1);
    36.    If i = 4
    37.    then writeln('Pos works for non-accented searches')
    38.    else writeln('Pos broken for non-accented searches. Got ', i, ' when expecting 4');
    39.  
    40.    Writeln('searching for the substring ''à'' in ''bonjour sérénitàa''');
    41.    i  := Pos(S3 , S1);
    42.    If i = 16
    43.    then writeln('Pos works for string accented searches')
    44.    else writeln('Pos broken for string accented searches. Got ', i, ' when expecting 16');
    45.  
    46.    Writeln('searching for the char ''à'' in ''bonjour sérénitàa''');
    47.    i  := Pos(C2 , S1);
    48.    If i = 16
    49.    then writeln('Pos works for accented character searches')
    50.    else writeln('Pos broken for accented character searches. Got ', i, ' when expecting 16');
    51. end;
    52.  
    53. Procedure TestUnicodePosEx;
    54. var i : integer;
    55. begin
    56.    Writeln('searching for the substring ''à'' in ''bonjour sérénitàa'' after pos 4');
    57.    i  := PosEx(S3 , S1, 4);
    58.    If i = 16
    59.    then writeln('PosEx works for string accented searches')
    60.    else writeln('PosEx broken for string accented searches. Got ', i, ' when expecting 16');
    61.  
    62.    Writeln('searching for the char ''à'' in ''bonjour sérénitàa'' after post 4.');
    63.    i  := PosEx(C2 , S1, 4);
    64.    If i = 16
    65.    then writeln('PosEx works for accented character searches')
    66.    else writeln('PosEx broken for accented character searches. Got ', i, ' when expecting 16');
    67. end;
    68.  
    69. Procedure TestUnicodeTrimming;
    70. const BlankPad = '      ';
    71. begin
    72.    S2 := BlankPad + S1 + BlankPad;
    73.    S2 := Trim(S2);
    74.    If S1 = S2
    75.    then Writeln('Trim works')
    76.    else Writeln('Trim is broken');
    77.  
    78.    S2 := BlankPad + S1;
    79.    S2 := TrimLeft(S2);
    80.    If S1 = S2
    81.    then Writeln('TrimLeft works')
    82.    else Writeln('TrimLeft is broken');
    83.  
    84.    S2 := S1 + BlankPad;
    85.    S2 := TrimRight(S2);
    86.    If S1 = S2
    87.    then Writeln('RightTrim works')
    88.    else Writeln('RightTrim is broken');
    89. end;
    90.  
    91. Procedure TestUnicodeCopying;
    92. var i: integer;
    93. begin
    94.    i   := Pos(' ',S1);
    95.    S2  := Copy(S1,i+1,Length(S1));
    96.    If S2 = 'sérénitàa'
    97.    then writeln('copy works')
    98.    else writeln('copy broken');
    99. end;
    100.  
    101. Procedure TestUnicodeUpperLower;
    102. begin
    103.    Writeln('original string: ' , S1);
    104.    S2 := UpperCase(S1);
    105.    Writeln('Uppercase S1 = ', S2);
    106.    If S2 = 'BONJOUR SÉRÉNITÀA'
    107.    then Writeln('Uppercase works')
    108.    else Writeln('Uppercase broken: got ', S2, ' instead of BONJOUR SÉRÉNITÀA');
    109.    S2 := Lowercase(S2);
    110.    Writeln('Lowercase S2 = ', S2);
    111.    If S2 = S1
    112.    then Writeln('Lowercase work')
    113.    else Writeln('Lowercase broken: got ', S2, ' instead of bonjour sérénitàa');
    114. end;
    115.  
    116. begin
    117.  InitVars;
    118.  TestUnicodeLength;
    119.  TestUnicodePos;
    120.  TestUnicodePosEx;
    121.  TestUnicodeTrimming;
    122.  TestUnicodeCopying;
    123.  TestUnicodeUpperLower;
    124.  Readln();
    125. end.
    126.  
    « Last Edit: February 21, 2017, 12:50:51 am by EganSolo »

    EganSolo

    • Sr. Member
    • ****
    • Posts: 290
    Re: Unicode, and IntToStr and PosEx: How to make it work?
    « Reply #13 on: February 21, 2017, 03:41:42 am »
    Alright, one more update :)
    • Using $mode DelphiUnicode does not work. In fact, I can't even find a way to assign a constant char to a char. C := 'à' does not work, nor does C := Char('à');
    • Using {$modeswitch UnicodeStrings} in conjunction with {$codepage utf-8} works... almost. If you run the code below, you will see that all the tests succeed but...
    • console display for utf-8 is lacking. The console displays some of the accented letters but not all. I'm still trying to figure out why. I'm on Windows by the way. Note from the code below that UTF8ToConsole, UTF8ToWinCP and UTF8ToSys don't do anything over and beyond what the rest of the code does. By the way to make this work, manually include the package LazUtils into your command.
    • SetMultiByteConversionCodePage does nothing either, which is expected since the code page is set to utf-8. By the way, if you're hoping to replace the switch {$codepage utf-8} with the more dynamic call to SetMultiByteConversionCodePage, you will have to contend with the compiler errors if you're using char. To see what I am talking about, simply comment out the {$codepage utf-8} at the start of the program and uncomment SetMultiByteConversionCodePage in the InitVars method. You won't be able to compile the program.
    • Another suggestion was to use SetConsoleOutputCP, which I have commented in my code. It actually degrades the output. I may not be using this right, but it doesn't help. If you wish to understand why, please check this excellent explanation here: http://forum.lazarus.freepascal.org/index.php?topic=26562.30In fact, it doesn't seem possible to create a console application in Lazarus with full Unicode support. See the program below to understand what I mean.
    • I am hopeful though that for most string operations I need to perform including parsing, hashing, and regexp search, that there won't be any major issues. I will post back here what I find after I run a battery of regression tests to see if something is amiss.

    Code: Pascal  [Select][+][-]
    1. program Unicode_and_pos_ex;
    2. {$modeswitch UnicodeStrings}
    3. {$codepage utf-8}
    4. //{$mode delphiunicode}
    5. uses Lazutf8, SysUtils, StrUtils, Windows, character;
    6.  
    7. var S1,S2  : String;
    8.     S3     : String;
    9. //  C      : WideChar; You will need to uncomment this line if you switch to delphiunicode.
    10.     C      : Char;
    11.  
    12. Procedure InitVars;
    13. begin
    14.   {0        1       }
    15.   {12345678901234567}
    16.    S1 := 'bonjour sérénitàa';
    17.    S3 := 'à';
    18.    C  := 'à';      //Comment this out if you switch to delphiunicode.
    19. // C := Char('à');   You will have to uncomment this line if you switch to delphi unicode.
    20.  
    21.    {
    22.  
    23.      None of these calls affect the console or get it to render Unicode appropriately.
    24.  
    25.      SetMultiByteConversionCodePage(CP_UTF8);
    26.      SetMultiByteRTLFileSystemCodePage(CP_UTF8);
    27.      SetConsoleOutputCP(CP_UTF8); Degrades output to console. Result is worse when this is invoked.
    28.      SetTextCodePage(Output, CP_UTF8); //Degrades output as well.
    29.  
    30.    }
    31. end;
    32.  
    33. Procedure TestUnicodeLength;
    34. var len : integer;
    35. begin
    36.   len := Length(S1);
    37.   If len = 17
    38.   then writeln('length works')
    39.   else writeln('length is broken: got ', len, ' expected 17');
    40. end;
    41.  
    42. Procedure TestUnicodePos;
    43. var i : integer;
    44. begin
    45.    Writeln('searching for the substring ''jour'' in ''bonjour sérénitàa''');
    46.    i  := Pos('jour', S1);
    47.    If i = 4
    48.    then writeln('Pos works for non-accented searches')
    49.    else writeln('Pos broken for non-accented searches. Got ', i, ' when expecting 4');
    50.  
    51.    Writeln('searching for the substring ''à'' in ''bonjour sérénitàa''');
    52.    i  := Pos(S3 , S1);
    53.    If i = 16
    54.    then writeln('Pos works for string accented searches')
    55.    else writeln('Pos broken for string accented searches. Got ', i, ' when expecting 16');
    56.  
    57.    Writeln('searching for the char ''à'' in ''bonjour sérénitàa''');
    58.    i  := Pos(C , S1);
    59.    If i = 16
    60.    then writeln('Pos works for accented character searches')
    61.    else writeln('Pos broken for accented character searches. Got ', i, ' when expecting 16');
    62. end;
    63.  
    64. Procedure TestUnicodePosEx;
    65. var i : integer;
    66. begin
    67.    Writeln('searching for the substring ''à'' in ''bonjour sérénitàa'' after pos 4');
    68.    i  := PosEx(S3 , S1, 4);
    69.    If i = 16
    70.    then writeln('PosEx works for string accented searches')
    71.    else writeln('PosEx broken for string accented searches. Got ', i, ' when expecting 16');
    72.  
    73.    Writeln('searching for the char ''à'' in ''bonjour sérénitàa'' after post 4.');
    74.    i  := PosEx(C , S1, 4);
    75.    If i = 16
    76.    then writeln('PosEx works for accented character searches')
    77.    else writeln('PosEx broken for accented character searches. Got ', i, ' when expecting 16');
    78. end;
    79.  
    80. Procedure TestUnicodeTrimming;
    81. const BlankPad = '      ';
    82. begin
    83.    S2 := BlankPad + S1 + BlankPad;
    84.    S2 := Trim(S2);
    85.    If S1 = S2
    86.    then Writeln('Trim works')
    87.    else Writeln('Trim is broken');
    88.  
    89.    S2 := BlankPad + S1;
    90.    S2 := TrimLeft(S2);
    91.    If S1 = S2
    92.    then Writeln('TrimLeft works')
    93.    else Writeln('TrimLeft is broken');
    94.  
    95.    S2 := S1 + BlankPad;
    96.    S2 := TrimRight(S2);
    97.    If S1 = S2
    98.    then Writeln('RightTrim works')
    99.    else Writeln('RightTrim is broken');
    100. end;
    101.  
    102. Procedure TestUnicodeCopying;
    103. var i: integer;
    104. begin
    105.    i   := Pos(' ',S1);
    106.    S2  := Copy(S1,i+1,Length(S1));
    107.    If S2 = 'sérénitàa'
    108.    then writeln('copy works')
    109.    else writeln('copy broken');
    110. end;
    111.  
    112. Procedure TestUnicodeUpperLower;
    113. const lcAconst : WideChar = 'à';
    114.       ucAconst : WideChar = 'À';
    115. var
    116.   ucA : char;
    117. begin
    118.    Writeln('original string: ' , S1);
    119.    S2 := TCharacter.ToUpper(S1);
    120.    Writeln('Uppercase S1 = ', S2);
    121.    If S2 = 'BONJOUR SÉRÉNITÀA'
    122.    then Writeln('Uppercase works')
    123.    else Writeln('Uppercase broken: got ', S2, ' instead of BONJOUR SÉRÉNITÀA');
    124.    S2 := TCharacter.ToLower(S2);
    125.    Writeln('Lowercase S2 = ', S2);
    126.    If S2 = S1
    127.    then Writeln('Lowercase work')
    128.    else Writeln('Lowercase broken: got ', S2, ' instead of sérénitàa');
    129.    ucA := TCharacter.ToUpper(lcAconst);
    130.    If ucA = ucAConst
    131.    then writeln('character uppercase worked')
    132.    else writeln('character uppercase broken');
    133. end;
    134.  
    135. Procedure TestIdentifier(Const S: String);
    136. var i : integer;
    137. begin
    138.    Write('String ' , S);
    139.    If length(S) = 0 then exit;
    140.    With TCharacter do
    141.      For i := 1 to length(S) do
    142.      If Not (IsLetterOrDigit(S[i]) or (S[i] = '_'))
    143.      then begin
    144.        Writeln(' is not an identifier');
    145.        exit;
    146.      end;
    147.    Writeln(' is an identifier');
    148. end;
    149.  
    150. Procedure TestIdentifiers;
    151. const French_Id1   = '___Éternité133'   ;
    152.       French_Id2   = 'Pérénial_Témérité';
    153.       Croatian_Id1 = 'Vječnost'         ;
    154.       Russian_Id1  = 'Вечность'         ;
    155.       Arabic_Id1   = 'خلود123'          ;
    156.       NonId        = 'J''usqu''à demain';
    157. begin
    158.    TestIdentifier(French_Id1);
    159.    TestIdentifier(French_Id2);
    160.    TestIdentifier(Croatian_Id1);
    161.    TestIdentifier(Russian_Id1);
    162.    TestIdentifier(Arabic_Id1);
    163.    TestIdentifier(NonId);
    164. end;
    165.  
    166. begin
    167.  InitVars;
    168.  TestUnicodeLength;
    169.  TestUnicodePos;
    170.  TestUnicodePosEx;
    171.  TestUnicodeTrimming;
    172.  TestUnicodeCopying;
    173.  TestUnicodeUpperLower;
    174.  TestIdentifiers;
    175.  Readln();
    176. end.
    177.  

     

    TinyPortal © 2005-2018