Recent

Author Topic: UTF8PosEx function?  (Read 10166 times)

typo

  • Hero Member
  • *****
  • Posts: 3051
UTF8PosEx function?
« on: May 14, 2013, 01:31:30 am »
Is there any UTF8PosEx function?

Leledumbo

  • Hero Member
  • *****
  • Posts: 8757
  • Programming + Glam Metal + Tae Kwon Do = Me
Re: UTF8PosEx function?
« Reply #1 on: May 14, 2013, 04:28:03 am »
Seems no from quickly grep-ing LazUtils.LazUTF8 unit.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11453
  • FPC developer.
Re: UTF8PosEx function?
« Reply #2 on: May 14, 2013, 09:09:09 am »
Is there any UTF8PosEx function?

Different how from normal pos?

theo

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1927
Re: UTF8PosEx function?
« Reply #3 on: May 14, 2013, 10:56:46 am »
Different how from normal pos?
Quote
PosEx

Search for the occurance of a character (or substr) in a string, starting at a certain position.

http://www.freepascal.org/docs-html/rtl/strutils/posex.html

Bart

  • Hero Member
  • *****
  • Posts: 5290
    • Bart en Mariska's Webstek
Re: UTF8PosEx function?
« Reply #4 on: May 14, 2013, 03:13:44 pm »
Different how from normal pos?

Pos('ä','ëä') -> 3 (guessing here, if I recall correctly ë is a 2 byte UTF8 codepoint)
Utf8Pos('ä','ëä') -> 2

Mutatis mutandis for PosExUtf8

Bart

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11453
  • FPC developer.
Re: UTF8PosEx function?
« Reply #5 on: May 14, 2013, 03:24:37 pm »
Pos('ä','ëä') -> 3 (guessing here, if I recall correctly ë is a 2 byte UTF8 codepoint)
Utf8Pos('ä','ëä') -> 2

Afaik currently no utf8 routines use indexes based on codepoints, and usually that isn't necessary anyway (as long as you pass valid codepoint sequences).

Bart

  • Hero Member
  • *****
  • Posts: 5290
    • Bart en Mariska's Webstek
Re: UTF8PosEx function?
« Reply #6 on: May 14, 2013, 11:32:22 pm »
Afaik currently no utf8 routines use indexes based on codepoints, and usually that isn't necessary anyway (as long as you pass valid codepoint sequences).

Looks like you're wrong?

Code: [Select]
  writeln('Utf8Pos = ',Utf8Pos('ä','ëïä'));
  writeln('Pos = ',Pos('ä','ëïä'));

Output:

Code: [Select]
Utf8Pos = 3
Pos = 5

Quote from: LazUtf8
function UTF8Pos(const SearchForText, SearchInText: string;
  StartPos: SizeInt = 1): PtrInt;
// returns the character index, where the SearchForText starts in SearchInText
// an optional StartPos can be given (in UTF-8 codepoints, not in byte)
// returns 0 if not found

And this also answers the original question.
Utf8Pos is "Utf8PosEx", just use the StartPos parameter.

Bart

CM630

  • Hero Member
  • *****
  • Posts: 1091
  • Не съм сигурен, че те разбирам.
    • http://sourceforge.net/u/cm630/profile/
Re: UTF8PosEx function?
« Reply #7 on: June 29, 2017, 10:41:12 am »
I could not find a function that searches backwards in UTF8 strings.
Is there UTF8RPos (or whatever alternative of RPosEx) that comes with Lazarus or shall I use an external solution?
Лазар 3,2 32 bit (sometimes 64 bit); FPC3,2,2; rev: Lazarus_3_0 on Win10 64bit.

howardpc

  • Hero Member
  • *****
  • Posts: 4144
Re: UTF8PosEx function?
« Reply #8 on: June 29, 2017, 12:12:46 pm »
You could use this function (uses LazUTF8):

Code: Pascal  [Select][+][-]
  1. function UTF8RPos(const SearchForText, SearchInText: string): PtrInt;
  2. // returns the character index of the last occurence of SearchInText in SearchForText
  3. // the search works backwards from the end of SearchInText
  4. // returns 0 if not found
  5. var
  6.   r: String;
  7. begin
  8.   r:=Utf8ReverseString(SearchInText);
  9.   Result:=Pos(Utf8ReverseString(SearchForText), r);
  10.   case Result of
  11.     0: ;
  12.     else begin
  13.       Result:=(UTF8Length(r) - UTF8Length(PChar(r), Pred(Result)));
  14.     end;
  15.   end;
  16. end;

CM630

  • Hero Member
  • *****
  • Posts: 1091
  • Не съм сигурен, че те разбирам.
    • http://sourceforge.net/u/cm630/profile/
Re: UTF8PosEx function?
« Reply #9 on: June 30, 2017, 09:12:25 am »

Thanks,
I will make some modification to add offset.
But won't Utf8ReverseString decrease performance signifficantly?
« Last Edit: June 30, 2017, 02:36:34 pm by CM630 »
Лазар 3,2 32 bit (sometimes 64 bit); FPC3,2,2; rev: Lazarus_3_0 on Win10 64bit.

Thaddy

  • Hero Member
  • *****
  • Posts: 14373
  • Sensorship about opinions does not belong here.
Re: UTF8PosEx function?
« Reply #10 on: June 30, 2017, 09:31:30 am »
Yes is does affect performance (big time!). It is cheaper to do a full iteration until the last occurance in most cases.(But not on very long strings.)
The basic algorithm is exactly the same from left-right as right-left only direction should change. But that is not implemented.
OTOH: This works.
Object Pascal programmers should get rid of their "component fetish" especially with the non-visuals.

howardpc

  • Hero Member
  • *****
  • Posts: 4144
Re: UTF8PosEx function?
« Reply #11 on: June 30, 2017, 09:42:34 am »
won't Utf8ReverseString decrease performance signifficantly?

Decrease performance compared to what? Compared to a handcrafted assembler routine? Probably.
Compared to not having a UTF8RPos? That would be a significant increase in performance.

If you find it too slow, then by all means apply optimisations... but here on simple strings it returns results instantaneously. Or, as Thaddy suggests, use a different algorithm that avoids UTF8ReverseString if you find a significant lag in using it. However, a different algorithm would probably duplicate some of the code underlying UTF8ReverseString anyway.
Working with multibyte endcodings is necessarily an order of magnitude or more slower than working with single-byte ANSI strings, however you do it.
Increasing string encoding complexity means more operations, more data pushed through your cpu..., i.e. a longer time taken  (whether microseconds or milliseconds) for the routine to complete.


Thaddy

  • Hero Member
  • *****
  • Posts: 14373
  • Sensorship about opinions does not belong here.
Re: UTF8PosEx function?
« Reply #12 on: June 30, 2017, 09:51:49 am »
The reversestring is necessarily slow. I did not complain about the speed but about the choice of algorithm. It works. It is not a smart solution.
I described the algorithm for forward scanning is the same as for backwards scanning. Get it? It is just not implemented for silly mode (UTF8 hacks based on AnsiString)
But hey if it works you have my blessing.... >:D O:-) And it works... O:-)

An incremental approach (also not ideal) is a trade-off between string length and occurrences. In most cases, say up to High(word) length strings, this is also much faster than reversing the string. Unless the same string occurs more than ~25-% times of its total length. What should be done is implement RPosEx like PosEx, starting from Length and in negative order. The latter being the only really efficient implementation. Instead of reversing the string, you just reverse direction. All -extremely- basic stuff.
« Last Edit: June 30, 2017, 10:00:40 am by Thaddy »
Object Pascal programmers should get rid of their "component fetish" especially with the non-visuals.

wp

  • Hero Member
  • *****
  • Posts: 11916
Re: UTF8PosEx function?
« Reply #13 on: June 30, 2017, 10:14:41 am »
I could not find a function that searches backwards in UTF8 strings.
Is there UTF8RPos (or whatever alternative of RPosEx) that comes with Lazarus or shall I use an external solution?
There IS a Utf8RPos in Trunk and in Laz 1.8RC2.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11453
  • FPC developer.
Re: UTF8PosEx function?
« Reply #14 on: June 30, 2017, 10:15:44 am »
Anything -EX with UTF8 characters as index is already dead slow by definition since it requires a scan to change the incoming position in utf8 characters to a position in bytes.

It should only be used for GUI purposes (where number of characters matter), and that is not that speed dependent.

 

TinyPortal © 2005-2018