Recent

Author Topic: Is this string function efficient?  (Read 11867 times)

AlexTP

  • Hero Member
  • *****
  • Posts: 2365
    • UVviewsoft
Re: Is this string function efficient?
« Reply #15 on: January 10, 2018, 11:08:11 am »
Time test

result: mse 8, alextp 13, thaddy 18

 
Code: Pascal  [Select][+][-]
  1. unit Unit1;
  2.  
  3. {$mode objfpc}{$H+}
  4.  
  5. interface
  6.  
  7. uses
  8.   Classes, SysUtils, Forms, Controls, Graphics, Dialogs;
  9.  
  10. type
  11.  
  12.   { TForm1 }
  13.  
  14.   TForm1 = class(TForm)
  15.     procedure FormCreate(Sender: TObject);
  16.   private
  17.  
  18.   public
  19.  
  20.   end;
  21.  
  22. var
  23.   Form1: TForm1;
  24.  
  25. implementation
  26.  
  27. {$R *.lfm}
  28.  
  29. { TForm1 }
  30.  
  31.  
  32. function SConvertUtf8ToWideForAscii(const S: string): UnicodeString;
  33. var
  34.   ps, pe: PByte;
  35.   pd: PWord;
  36.   NLen: integer;
  37. begin
  38.   NLen:= Length(S);
  39.   SetLength(Result, NLen);
  40.   ps:= pointer(S);
  41.   pe:= ps + NLen;
  42.   pd:= pointer(Result);
  43.   while ps < pe do begin
  44.     pd^:= ps^;
  45.     Inc(ps);
  46.     Inc(pd);
  47.   end;
  48. end;
  49.  
  50. function SConv1(const s: string): Unicodestring;
  51. var
  52.   i: integer;
  53. begin
  54.   setlength(Result, Length(s));
  55.   for i:= 1 to length(s) do
  56.     result[i]:= widechar(ord(s[i]));
  57. end;
  58.  
  59. function SConv2(const s: UTF8string): Unicodestring;
  60. begin
  61.   result:= s;
  62. end;
  63.  
  64.  
  65. procedure TForm1.FormCreate(Sender: TObject);
  66. var
  67.   t1,t2,t3: dword;
  68.   i: integer;
  69.   su: unicodestring;
  70.   sl: tstringlist;
  71. begin
  72.   sl:= tstringlist.create;
  73.   sl.LoadFromFile('/home/user/test/big/3M.xml');
  74.  
  75.   t1:= GetTickCount;
  76.   for i:= 0 to sl.count-1 do
  77.     su:= SConvertUtf8ToWideForAscii(sl[i]);
  78.   t1:= GetTickCount-t1;
  79.  
  80.   t2:= GetTickCount;
  81.   for i:= 0 to sl.count-1 do
  82.     su:= sconv1(sl[i]);
  83.   t2:= GetTickCount-t2;
  84.  
  85.   t3:= GetTickCount;
  86.   for i:= 0 to sl.count-1 do
  87.     su:= SConv2(sl[i]);
  88.   t3:= GetTickCount-t3;
  89.  
  90.   caption:= format('mse %d/ alextp %d/ thaddy %d', [t1, t2, t3]);
  91. end;
  92.  
  93. end.
  94.  

Thaddy

  • Hero Member
  • *****
  • Posts: 14157
  • Probably until I exterminate Putin.
Re: Is this string function efficient?
« Reply #16 on: January 10, 2018, 11:12:59 am »
What is the file content?
Specialize a type, not a var.

AlexTP

  • Hero Member
  • *****
  • Posts: 2365
    • UVviewsoft
Re: Is this string function efficient?
« Reply #17 on: January 10, 2018, 11:19:04 am »
File content is https://ufile.io/pagmk

munair

  • Hero Member
  • *****
  • Posts: 798
  • compiler developer @SharpBASIC
    • SharpBASIC
Re: Is this string function efficient?
« Reply #18 on: January 10, 2018, 11:31:16 am »
Simple, pure ASCII only covers the byte range from 0 to 127, which will not run you into trouble when using UTF-8. The bytes from 128 to 255 are considered extended ASCII and the characters depend on the code page if you use code pages like ISO8859. BUT UTF-8 uses several of these bytes for code point specifics, so EASCII is NOT compatible with UTF-8.

Also, Unicode has nothing to do with characters. It is only a definition of codepoints that can be encoded into characters by UTF-8, UTF-16 or UTF-32.

While UTF-8 is ASCII compatible, UTF-16 and UTF-32 are not, because the first always uses two bytes for a codepoint and sometimes four in case of surrogate pairs, while the latter always uses four bytes.

So unlike what I read in previous posts here, UTF16 ASCII doesn't exist. So why would you want to convert (UTF-8) to wide string in order to get ASCII?
« Last Edit: January 10, 2018, 11:44:24 am by Munair »
keep it simple

munair

  • Hero Member
  • *****
  • Posts: 798
  • compiler developer @SharpBASIC
    • SharpBASIC
Re: Is this string function efficient?
« Reply #19 on: January 10, 2018, 11:39:08 am »
For ASCII length(utf8) = length(utf16).
Wrong. See my previous post.
keep it simple

munair

  • Hero Member
  • *****
  • Posts: 798
  • compiler developer @SharpBASIC
    • SharpBASIC
Re: Is this string function efficient?
« Reply #20 on: January 10, 2018, 11:40:16 am »
For ASCII length(utf8) = length(utf16).
Just for ASCII and ONLY for ASCII... >:( >:( ;D ;D You know that... >:D
Thaddy, you fell into the same trap too.
keep it simple

Thaddy

  • Hero Member
  • *****
  • Posts: 14157
  • Probably until I exterminate Putin.
Re: Is this string function efficient?
« Reply #21 on: January 10, 2018, 12:05:14 pm »
For ASCII length(utf8) = length(utf16).
Just for ASCII and ONLY for ASCII... >:( >:( ;D ;D You know that... >:D
Thaddy, you fell into the same trap too.
Nope. Where? I propose the generic case of converting between UTF8 and UTF16. That's an assignment and can never go wrong.
Assigning ASCII (actually both mean ANSI) can go wrong because ASCII (ANSI) can have different codepages and then the code is broken.....
Look at my autograph: it is very easy for them to prove they are right in many cases, but I need just one case to prove them wrong. Which I did.
« Last Edit: January 10, 2018, 12:06:50 pm by Thaddy »
Specialize a type, not a var.

munair

  • Hero Member
  • *****
  • Posts: 798
  • compiler developer @SharpBASIC
    • SharpBASIC
Re: Is this string function efficient?
« Reply #22 on: January 10, 2018, 12:23:39 pm »
You went wrong by confirming that "For ASCII length(utf8) = length(utf16)". By definition, ASCII is not two-bytes, thus not a wide-string.

Again, ASCII doesn't have code pages. It is only the first 128 bytes which is a fixed one-byte character table across most code pages, which is where any compatibility ends.

If Unicode encodings should be converted to wide string, then the conversion function shouldn't hold the name ASCII, because that's misleading. Even ANSI is misleading because it is an extension of ASCII and can hold different characters depending on the codepage. Therefore ANSI is not UTF-8 compatible.
« Last Edit: January 10, 2018, 12:40:12 pm by Munair »
keep it simple

AlexTP

  • Hero Member
  • *****
  • Posts: 2365
    • UVviewsoft
Re: Is this string function efficient?
« Reply #23 on: January 10, 2018, 12:24:21 pm »
Thaddy, i needed fast func for ASCII [0..127] only.

munair

  • Hero Member
  • *****
  • Posts: 798
  • compiler developer @SharpBASIC
    • SharpBASIC
Re: Is this string function efficient?
« Reply #24 on: January 10, 2018, 12:37:40 pm »
Thaddy, i needed fast func for ASCII [0..127] only.
You can iterate through a UTF-8 string and test for one-byte codepoints using the UTF8CodepointSize() function. It will leave you with plain ASCII but will drop any multibyte characters. However, this requires a UTF-8 encoded string, not an undefined raw or Unicode string.
« Last Edit: January 10, 2018, 12:41:24 pm by Munair »
keep it simple

Thaddy

  • Hero Member
  • *****
  • Posts: 14157
  • Probably until I exterminate Putin.
Re: Is this string function efficient?
« Reply #25 on: January 10, 2018, 12:41:24 pm »
Thaddy, i needed fast func for ASCII [0..127] only.
You can iterate through a UTF-8 string and test for one-byte codepoints using the UTF8CodepointSize() function. It will leave you with plain ASCII but will drop any multibyte characters.
and will be the slowest... :D
Specialize a type, not a var.

AlexTP

  • Hero Member
  • *****
  • Posts: 2365
    • UVviewsoft
Re: Is this string function efficient?
« Reply #26 on: January 10, 2018, 12:42:13 pm »
@Munair,
it not makes sense for fast function.

munair

  • Hero Member
  • *****
  • Posts: 798
  • compiler developer @SharpBASIC
    • SharpBASIC
Re: Is this string function efficient?
« Reply #27 on: January 10, 2018, 12:51:17 pm »
@Munair,
it not makes sense for fast function.
UTF-8 and UTF-16 encodings are not fast by definition because of their multi-byte codepoint lengths. There's no telling where the next single byte (ASCII) )character is without iterating through the string.

Thaddy's example i.e.
Code: Pascal  [Select][+][-]
  1.     program untitled;
  2.     {$ifdef fpc}{$mode delphi}{$H+}{$I-}{$endif}
  3.     var
  4.       s1:UTF8String = 'Whatever'; // or a Lazarus "string"...Juha...  FPC strings are either ShortString, AnsiString or UnicodeString, never UTF8: that's only the case in Lazarus. Confused???
  5.       s2:Unicodestring;
  6.     begin
  7.       s2 := S1;  // conversion is lossless and automatic and fast. Faster than you can  do by hand...
  8.       writeln(s2);
  9.     end.
will leave you with a string of wide characters. It's not ASCII.
« Last Edit: January 10, 2018, 01:31:23 pm by Munair »
keep it simple

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4458
  • I like bugs.
Re: Is this string function efficient?
« Reply #28 on: January 10, 2018, 02:52:56 pm »
You can iterate through a UTF-8 string and test for one-byte codepoints using the UTF8CodepointSize() function. It will leave you with plain ASCII but will drop any multibyte characters. However, this requires a UTF-8 encoded string, not an undefined raw or Unicode string.
UTF-8 is backwards compatible with plain ASCII. You can iterate the data using the good old byte offsets. Just ignore anything outside ASCII.
That is why old XML parsers keep working. All tags are in plain ASCII area. Same thing with parsers for most programming languages like Codetools for Pascal.
UTF8CodepointSize() is needed only when you want to study codepoints outside of plain ASCII.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

ASerge

  • Hero Member
  • *****
  • Posts: 2212
Re: Is this string function efficient?
« Reply #29 on: January 10, 2018, 03:31:26 pm »
Time test
result: mse 8, alextp 13, thaddy 18
This test also shows that the @mse function is the fastest:
Code: Pascal  [Select][+][-]
  1. {$APPTYPE CONSOLE}
  2. {$MODE OBJFPC}
  3. {$LONGSTRINGS ON}
  4. program Project1;
  5.  
  6. uses SysUtils;
  7.  
  8. type
  9.   TTestFunction = function(const S: string): UnicodeString;
  10.  
  11. procedure Measure(Func: TTestFunction; const Description: string;
  12.   Times: Integer = 1000000);
  13. var
  14.   Start, Elapsed: Int64;
  15.   i: Integer;
  16.   TestString: string;
  17.   Unused: WideString;
  18. begin
  19.   TestString := StringOfChar('8', 100);
  20.   Start := GetTickCount64;
  21.   for i := 1 to Times do
  22.     Unused := Func(TestString);
  23.   Elapsed := GetTickCount64 - Start;
  24.   Writeln(Description, ' - ', Elapsed);
  25. end;
  26.  
  27. function AlextpFunc(const S: string): UnicodeString;
  28. var
  29.   i: Integer;
  30. begin
  31.   SetLength(Result, Length(S));
  32.   for i := 1 to Length(S) do
  33.     Result[i] := Widechar(Ord(S[i]));
  34. end;
  35.  
  36. function ThaddyFunc(const S: string): UnicodeString;
  37. begin
  38.   Result := S;
  39. end;
  40.  
  41. function mseFunc(const S: string): UnicodeString;
  42. var
  43.   PStart, PEnd: PByte;
  44.   PDest: PWord;
  45.   Len: SizeInt;
  46. begin
  47.   Len := Length(S);
  48.   SetLength(Result, Len);
  49.   PStart := Pointer(S);
  50.   PEnd := PStart + Len;
  51.   PDest := Pointer(Result);
  52.   while PStart < PEnd do
  53.   begin
  54.     PDest^ := PStart^;
  55.     Inc(PStart);
  56.     Inc(PDest);
  57.   end;
  58. end;
  59.  
  60. function UTF8DecodeWrap(const S: string): UnicodeString;
  61. begin
  62.   Result := UTF8Decode(S);
  63. end;
  64.  
  65. begin
  66.   Measure(@AlextpFunc, 'AlextpFunc');
  67.   Measure(@ThaddyFunc, 'ThaddyFunc');
  68.   Measure(@mseFunc, 'mseFunc');
  69.   Measure(@UTF8DecodeWrap, 'UTF8DecodeWrap');
  70.   Readln;
  71. end.

On my computer:
Quote
AlextpFunc - 1201
ThaddyFunc - 2138
mseFunc - 249
UTF8DecodeWrap - 2231

 

TinyPortal © 2005-2018