UTF8PosEx function?

typo

Hero Member
Posts: 3051

UTF8PosEx function?

« on: May 14, 2013, 01:31:30 am »

Is there any UTF8PosEx function?

Logged

LazarusBrasil.Org

Leledumbo

Hero Member
Posts: 8757
Programming + Glam Metal + Tae Kwon Do = Me

Re: UTF8PosEx function?

« Reply #1 on: May 14, 2013, 04:28:03 am »

Seems no from quickly grep-ing LazUtils.LazUTF8 unit.

Logged

Follow this if you want me to answer: http://wiki.lazarus.freepascal.org/Lazarus_Faq#What_is_the_correct_way_to_ask_questions_in_the_forum.3F

http://pascalgeek.blogspot.com
https://bitbucket.org/leledumbo
https://github.com/leledumbo
Code first, think later - Natural programmer B)

marcov

Administrator
Hero Member
Posts: 11453
FPC developer.

Re: UTF8PosEx function?

« Reply #2 on: May 14, 2013, 09:09:09 am »

Quote from: typo on May 14, 2013, 01:31:30 am

Is there any UTF8PosEx function?

Different how from normal pos?

Logged

theo

Global Moderator
Hero Member
Posts: 1927

Re: UTF8PosEx function?

« Reply #3 on: May 14, 2013, 10:56:46 am »

Quote from: marcov on May 14, 2013, 09:09:09 am

Different how from normal pos?

Quote

PosEx

Search for the occurance of a character (or substr) in a string, starting at a certain position.

http://www.freepascal.org/docs-html/rtl/strutils/posex.html

Logged

Bart

Hero Member
Posts: 5290

Re: UTF8PosEx function?

« Reply #4 on: May 14, 2013, 03:13:44 pm »

Quote from: marcov on May 14, 2013, 09:09:09 am

Different how from normal pos?

Pos('ä','ëä') -> 3 (guessing here, if I recall correctly ë is a 2 byte UTF8 codepoint)
Utf8Pos('ä','ëä') -> 2

Mutatis mutandis for PosExUtf8

Bart

Logged

marcov

Administrator
Hero Member
Posts: 11453
FPC developer.

Re: UTF8PosEx function?

« Reply #5 on: May 14, 2013, 03:24:37 pm »

Quote from: Bart on May 14, 2013, 03:13:44 pm

Pos('ä','ëä') -> 3 (guessing here, if I recall correctly ë is a 2 byte UTF8 codepoint)
Utf8Pos('ä','ëä') -> 2

Afaik currently no utf8 routines use indexes based on codepoints, and usually that isn't necessary anyway (as long as you pass valid codepoint sequences).

Logged

Bart

Hero Member
Posts: 5290

Re: UTF8PosEx function?

« Reply #6 on: May 14, 2013, 11:32:22 pm »

Quote from: marcov on May 14, 2013, 03:24:37 pm

Afaik currently no utf8 routines use indexes based on codepoints, and usually that isn't necessary anyway (as long as you pass valid codepoint sequences).

Looks like you're wrong?

Code: [Select]

  writeln('Utf8Pos = ',Utf8Pos('ä','ëïä'));
  writeln('Pos = ',Pos('ä','ëïä'));

Output:

Code: [Select]

Utf8Pos = 3
Pos = 5

Quote from: LazUtf8

function UTF8Pos(const SearchForText, SearchInText: string;
StartPos: SizeInt = 1): PtrInt;
// returns the character index, where the SearchForText starts in SearchInText
// an optional StartPos can be given (in UTF-8 codepoints, not in byte)
// returns 0 if not found

And this also answers the original question.
Utf8Pos is "Utf8PosEx", just use the StartPos parameter.

Bart

Logged

CM630

Hero Member
Posts: 1091
Не съм сигурен, че те разбирам.

Re: UTF8PosEx function?

« Reply #7 on: June 29, 2017, 10:41:12 am »

I could not find a function that searches backwards in UTF8 strings.
Is there UTF8RPos (or whatever alternative of RPosEx) that comes with Lazarus or shall I use an external solution?

Logged

Лазар 3,2 32 bit (sometimes 64 bit); FPC3,2,2; rev: Lazarus_3_0 on Win10 64bit.

CM630

Hero Member
Posts: 1091
Не съм сигурен, че те разбирам.

Re: UTF8PosEx function?

« Reply #9 on: June 30, 2017, 09:12:25 am »

Thanks,
I will make some modification to add offset.
But won't Utf8ReverseString decrease performance signifficantly?

« Last Edit: June 30, 2017, 02:36:34 pm by CM630 »

Logged

Лазар 3,2 32 bit (sometimes 64 bit); FPC3,2,2; rev: Lazarus_3_0 on Win10 64bit.

Thaddy

Hero Member
Posts: 14373
Sensorship about opinions does not belong here.

Re: UTF8PosEx function?

« Reply #10 on: June 30, 2017, 09:31:30 am »

Yes is does affect performance (big time!). It is cheaper to do a full iteration until the last occurance in most cases.(But not on very long strings.)
The basic algorithm is exactly the same from left-right as right-left only direction should change. But that is not implemented.
OTOH: This works.

Logged

Object Pascal programmers should get rid of their "component fetish" especially with the non-visuals.

howardpc

Hero Member
Posts: 4144

Re: UTF8PosEx function?

« Reply #11 on: June 30, 2017, 09:42:34 am »

Quote from: CM630 on June 30, 2017, 09:12:25 am

won't Utf8ReverseString decrease performance signifficantly?

Decrease performance compared to what? Compared to a handcrafted assembler routine? Probably.
Compared to not having a UTF8RPos? That would be a significant increase in performance.

If you find it too slow, then by all means apply optimisations... but here on simple strings it returns results instantaneously. Or, as Thaddy suggests, use a different algorithm that avoids UTF8ReverseString if you find a significant lag in using it. However, a different algorithm would probably duplicate some of the code underlying UTF8ReverseString anyway.
Working with multibyte endcodings is necessarily an order of magnitude or more slower than working with single-byte ANSI strings, however you do it.
Increasing string encoding complexity means more operations, more data pushed through your cpu..., i.e. a longer time taken (whether microseconds or milliseconds) for the routine to complete.

Logged

Thaddy

Hero Member
Posts: 14373
Sensorship about opinions does not belong here.

Re: UTF8PosEx function?

« Reply #12 on: June 30, 2017, 09:51:49 am »

The reversestring is necessarily slow. I did not complain about the speed but about the choice of algorithm. It works. It is not a smart solution.
I described the algorithm for forward scanning is the same as for backwards scanning. Get it? It is just not implemented for silly mode (UTF8 hacks based on AnsiString)
But hey if it works you have my blessing....

And it works...

An incremental approach (also not ideal) is a trade-off between string length and occurrences. In most cases, say up to High(word) length strings, this is also much faster than reversing the string. Unless the same string occurs more than ~25-% times of its total length. What should be done is implement RPosEx like PosEx, starting from Length and in negative order. The latter being the only really efficient implementation. Instead of reversing the string, you just reverse direction. All -extremely- basic stuff.

« Last Edit: June 30, 2017, 10:00:40 am by Thaddy »

Logged

Object Pascal programmers should get rid of their "component fetish" especially with the non-visuals.

wp

Hero Member
Posts: 11916

Re: UTF8PosEx function?

« Reply #13 on: June 30, 2017, 10:14:41 am »

Quote from: CM630 on June 29, 2017, 10:41:12 am

I could not find a function that searches backwards in UTF8 strings.
Is there UTF8RPos (or whatever alternative of RPosEx) that comes with Lazarus or shall I use an external solution?

There IS a Utf8RPos in Trunk and in Laz 1.8RC2.

Logged

marcov

Administrator
Hero Member
Posts: 11453
FPC developer.

Re: UTF8PosEx function?

« Reply #14 on: June 30, 2017, 10:15:44 am »

Anything -EX with UTF8 characters as index is already dead slow by definition since it requires a scan to change the incoming position in utf8 characters to a position in bytes.

It should only be used for GUI purposes (where number of characters matter), and that is not that speed dependent.

Logged

Lazarus

Bookstore

Search

Recent

Author Topic: UTF8PosEx function? (Read 10166 times)

typo

UTF8PosEx function?

Leledumbo

Re: UTF8PosEx function?

marcov

Re: UTF8PosEx function?

theo

Re: UTF8PosEx function?

Bart

Re: UTF8PosEx function?

marcov

Re: UTF8PosEx function?

Bart

Re: UTF8PosEx function?

CM630

Re: UTF8PosEx function?

howardpc

Re: UTF8PosEx function?

CM630

Re: UTF8PosEx function?

Thaddy

Re: UTF8PosEx function?

howardpc

Re: UTF8PosEx function?

Thaddy

Re: UTF8PosEx function?

wp

Re: UTF8PosEx function?

marcov

Re: UTF8PosEx function?

	Computer Math and Games in Pascal (preview)
	Lazarus Handbook