Recent

Author Topic: I'm seeking (beta-)testers for my fast regular expression engine FLRE  (Read 28297 times)

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11383
  • FPC developer.
Re: I'm seeking (beta-)testers for my fast regular expression engine FLRE
« Reply #15 on: September 05, 2015, 02:07:53 pm »
This is the only benchmark that you need to excel in:

http://benchmarksgame.alioth.debian.org/u32/performance.php?test=regexdna

:-)
« Last Edit: September 05, 2015, 02:34:31 pm by marcov »

Leledumbo

  • Hero Member
  • *****
  • Posts: 8747
  • Programming + Glam Metal + Tae Kwon Do = Me
Re: I'm seeking (beta-)testers for my fast regular expression engine FLRE
« Reply #16 on: September 05, 2015, 04:45:38 pm »
Has just got a time to do the benchmark, here's on:
Intel Core i5 4200u
Manjaro Linux 64-bit
kernel 4.1.6
FPC trunk svn revision 31532
GCC 5.2.0
Need to adjust your benchmark code a little due to IsDebuggerPresent being called without {$ifdef windows} guard (while the definition has that guard).

EDIT:
I update capture a HTML view image with the fastest (green) and slowest (red)
« Last Edit: September 05, 2015, 05:36:35 pm by Leledumbo »

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11383
  • FPC developer.
Re: I'm seeking (beta-)testers for my fast regular expression engine FLRE
« Reply #17 on: September 05, 2015, 05:29:59 pm »
Windows 8.1 64-bit  i7-3770:

DXE4:
Code: [Select]
D:\src\flre-master\examples\benchmark>benchmark
                                                        Time     | Match count
==============================================================================
FLRE:
                                        /Twain/ :        8.14 ms |        2388
                                    /(?i)Twain/ :        8.27 ms |        2657
                                   /[a-z]shing/ :        8.62 ms |        1877
                   /Huck[a-zA-Z]+|Saw[a-zA-Z]+/ :       11.46 ms |         396
                                    /\b\w+nn\b/ :       33.98 ms |         359
                             /[a-q][^u-z]{13}x/ :      105.11 ms |        4929
                  /Tom|Sawyer|Huckleberry|Finn/ :       19.84 ms |        3015
              /(?i)Tom|Sawyer|Huckleberry|Finn/ :       28.76 ms |        4820
          /.{0,2}(Tom|Sawyer|Huckleberry|Finn)/ :       35.81 ms |        3015
          /.{2,4}(Tom|Sawyer|Huckleberry|Finn)/ :       35.02 ms |        2220
            /Tom.{10,25}river|river.{10,25}Tom/ :       34.59 ms |           2
                                 /[a-zA-Z]+ing/ :       48.74 ms |       95863
                        /\s[a-zA-Z]{0,12}ing\s/ :       43.02 ms |       67810
                /([A-Za-z]awyer|[A-Za-z]inn)\s/ :       16.12 ms |         313
                    /["'][^"']{0,30}[?!\.]["']/ :       39.14 ms |        9857

Done!
FPC trunk, about a week old: (-O4, win32SEH on, which is not default)
Code: [Select]

                                                        Time     | Match count
==============================================================================
FLRE:
                                        /Twain/ :       13.40 ms |        2388
                                    /(?i)Twain/ :       13.84 ms |        2657
                                   /[a-z]shing/ :       13.46 ms |        1877
                   /Huck[a-zA-Z]+|Saw[a-zA-Z]+/ :       18.77 ms |         396
                                    /\b\w+nn\b/ :       33.89 ms |         359
                             /[a-q][^u-z]{13}x/ :      211.99 ms |        4929
                  /Tom|Sawyer|Huckleberry|Finn/ :       30.65 ms |        3015
              /(?i)Tom|Sawyer|Huckleberry|Finn/ :       42.53 ms |        4820
          /.{0,2}(Tom|Sawyer|Huckleberry|Finn)/ :       35.36 ms |        3015
          /.{2,4}(Tom|Sawyer|Huckleberry|Finn)/ :       34.95 ms |        2220
            /Tom.{10,25}river|river.{10,25}Tom/ :       34.87 ms |           2
                                 /[a-zA-Z]+ing/ :       53.39 ms |       95863
                        /\s[a-zA-Z]{0,12}ing\s/ :       47.04 ms |       67810
                /([A-Za-z]awyer|[A-Za-z]inn)\s/ :       22.68 ms |         313
                    /["'][^"']{0,30}[?!\.]["']/ :       43.65 ms |        9857

Done!
« Last Edit: September 05, 2015, 05:32:28 pm by marcov »

BeRo

  • New Member
  • *
  • Posts: 45
    • My site
Re: I'm seeking (beta-)testers for my fast regular expression engine FLRE
« Reply #18 on: September 05, 2015, 05:48:45 pm »
Is it possible to use a function as second argument ?

FLRE has TFLRE.PtrReplaceCallback  TFLRE.ReplaceCallback and  TFLRE.UTF8ReplaceCallback now, and  TFLRE.PtrReplaceAll  TFLRE.ReplaceAll and  TFLRE.UTF8ReplaceAll are now renamed to  TFLRE.PtrReplace  TFLRE.Replace and  TFLRE.UTF8Replace. And there are also new another API functions now, for example: ExtractAll Test Find and Split.
 

BeniBela

  • Hero Member
  • *****
  • Posts: 905
    • homepage
Re: I'm seeking (beta-)testers for my fast regular expression engine FLRE
« Reply #19 on: September 05, 2015, 06:02:40 pm »
Can add memory usage to the benchmark, too?


Especially for something  like ^xyz.*
It would just need to store xyz, and a start-with-flag.


So Unicode capable means x.y would match an UTF-8 string xäy or x♥y ?.




BeRo

  • New Member
  • *
  • Posts: 45
    • My site
Re: I'm seeking (beta-)testers for my fast regular expression engine FLRE
« Reply #20 on: September 05, 2015, 07:02:20 pm »
Can add memory usage to the benchmark, too?

This is a bit problematic, because the memory usage of FLRE (like Google's RE2 also) is more or less highly dependent on the input, since the DFA is built on-demand where the NFA states will be cached (including a DFA cache reset mechanism for to avoid running out of memory), see https://swtch.com/~rsc/regexp/regexp3.html which elucidated the NFA/DFA-hybrid design of Google's RE2 (and therefore also indirectly the design of FLRE) including the memory usage topic sketchy.

In other words:

Such memory usage infos would be correct only for the each specific individual inputs then. Moreover, it is then dynamically also, due to the DFA cache resets.

Therefore a FLRE instance should be also cached and then used several times, which is no problem also, because it is thread safe. I think, I'll implement a TFLRECache class for it, so that one no longer needs to manually to implement this yourself.

Especially for something  like ^xyz.*
It would just need to store xyz, and a start-with-flag.

Such optimizations (in addition besides dozens of other even further optimizations) are already implemented. It's not my first regular expression engine ;) It's is my fourth regexp engine or so. My first regexp engine was the regexp engine inside my ECMAScript/JavaScript engine BESEN, my second was BRRE and my third was the so far still unreleased BeRoMicroRegExp.

So Unicode capable means x.y would match an UTF-8 string xäy or x♥y ?.

If a regexp is compiled with the rfUTF8 flag, then the dot matches UTF8-codepoint-wise, otherwise, if without rfUTF8 flag, then purely bytewise/codeunit-wise.

If a regexp is compiled with rfUTF8, then UTF8 decoding automations will be compiled directly into the regexp itself, so that the underlying algorithms are still almost pure bytewise working algorithms, so that speed optimizations are easier to implement and where the code is overall less error-prone regarding bugs. Google's RE2 do here the same trick, for to keep the underlying matching algorithms simple and maintainable as possible.





Jurassic Pork

  • Hero Member
  • *****
  • Posts: 1228
Re: I'm seeking (beta-)testers for my fast regular expression engine FLRE
« Reply #21 on: September 06, 2015, 12:21:08 am »
hello,
i try to use the new Functions Split and ExtractAll but it doesn't work  :( . BeRo can you show me how to use these functions or examples ? 
Jurassic computer : Sinclair ZX81 - Zilog Z80A à 3,25 MHz - RAM 1 Ko - ROM 8 Ko

BeRo

  • New Member
  • *
  • Posts: 45
    • My site
Re: I'm seeking (beta-)testers for my fast regular expression engine FLRE
« Reply #22 on: September 06, 2015, 05:10:37 am »
hello,
i try to use the new Functions Split and ExtractAll but it doesn't work  :( . BeRo can you show me how to use these functions or examples ?

for Split:

Code: [Select]
var SplittedStrings:TFLREStrings;

  FLREInstance:=TFLRE.Create('\s+',[]);
  try
   if FLREInstance.Split('Hello world    this is a test',SplittedStrings) then begin
    for i:=0 to length(SplittedStrings)-1 do begin
     writeln(SplittedStrings[i]);
    end;
   end;
  finally
   SetLength(SplittedStrings,0);
   FLREInstance.Free;
  end;

for ExtractAll (which had a bug, which I've fixed now):

Code: [Select]
var  MultiExtractions:TFLREMultiStrings;

  FLREInstance:=TFLRE.Create('(\w+)\s(\w+)',[]);
  try
   if FLREInstance.ExtractAll('Word pair - Pair Words',MultiExtractions) then begin
    for i:=0 to length(MultiExtractions)-1 do begin
     for j:=0 to length(MultiExtractions[i])-1 do begin
      write('"',MultiExtractions[i,j],'" ');
     end;
     writeln;
    end;
   end;
   SetLength(MultiExtractions,0);
  finally
   FLREInstance.Free;
  end;

BeRo

  • New Member
  • *
  • Posts: 45
    • My site
Re: I'm seeking (beta-)testers for my fast regular expression engine FLRE
« Reply #23 on: September 06, 2015, 09:22:38 am »
So, a thread-safe TFLRECache is also implemented now, for to simplify the multiple usage of single thread-safe TFLRE instances at different code locations and/or threads.

Jurassic Pork

  • Hero Member
  • *****
  • Posts: 1228
Re: I'm seeking (beta-)testers for my fast regular expression engine FLRE
« Reply #24 on: September 06, 2015, 09:39:19 am »
hello BeRO,
thanks for your quick response. This code doesn't work with your last release :
Code: [Select]
const stringInFile = '79324817350003235658981449032';
var
  regexFLRE : TFLRE;
  i: integer;
  pattern: TFLRERawByteString;
  SplittedStrings:TFLREStrings; 
begin       
pattern := '^.*([0-9]{10,10})([0-9]{14,14})([0-9]{5,5}).*';
regexFLRE := TFLRE.Create(pattern, []);
regexFLRE.MaximalDFAStates:=65536;
  if regexFLRE.Split(stringInFile,SplittedStrings) then begin
     for i:=0 to length(SplittedStrings)-1 do begin
      writeln(SplittedStrings[i]);
     end;
  end;
end;

What's wrong ? with this string and this pattern it works with Tregexpr
Jurassic computer : Sinclair ZX81 - Zilog Z80A à 3,25 MHz - RAM 1 Ko - ROM 8 Ko

BeRo

  • New Member
  • *
  • Posts: 45
    • My site
Re: I'm seeking (beta-)testers for my fast regular expression engine FLRE
« Reply #25 on: September 06, 2015, 09:55:41 am »
hello BeRO,
thanks for your quick response. This code doesn't work with your last release :
What's wrong ? with this string and this pattern it works with Tregexpr

Your regex pattern is whole-string-anchored, where you do need the rfMULTILINE flag, for to make it no more whole-string-anchored. But I should maybe change that in the API, since \A and \Z is also supported, hmmmmm. I think I will make some inquiries, how the majority of other engines handle it.   

Try:

Code: [Select]
pattern := '^.*([0-9]{10,10})([0-9]{14,14})([0-9]{5,5}).*';
regexFLRE := TFLRE.Create(pattern, [rfMULTILINE]);

And instead Split, you do want ExtractAll, because you do want "extracting", not "splitting", right? FLRE's Split is modelled after PHP's preg_split. So try:

Code: [Select]
const stringInFile = '79324817350003235658981449032';
var
  regexFLRE : TFLRE;
  i, j: integer;
  pattern: TFLRERawByteString;
  MultiExtractions:TFLREMultiStrings;
begin
pattern := '^.*([0-9]{10,10})([0-9]{14,14})([0-9]{5,5}).*';
regexFLRE := TFLRE.Create(pattern, [rfMULTILINE]);
regexFLRE.MaximalDFAStates:=65536;
  if regexFLRE.ExtractAll(stringInFile,MultiExtractions) then begin
    for i:=0 to length(MultiExtractions)-1 do begin
     for j:=0 to length(MultiExtractions[i])-1 do begin
      write('"',MultiExtractions[i,j],'" ');
     end;
     writeln;
    end;
  end;
end;

which has followed output:

Code: [Select]
"79324817350003235658981449032" "7932481735" "00032356589814" "49032"

FLRE has different API design concepts than TRegExpr ;) 

« Last Edit: September 06, 2015, 10:12:04 am by BeRo »

BeRo

  • New Member
  • *
  • Posts: 45
    • My site
Re: I'm seeking (beta-)testers for my fast regular expression engine FLRE
« Reply #26 on: September 06, 2015, 10:22:46 am »
Hm ok, it seems that the almost another regex engines (all except some few engines) does handle it like FLRE, so that I'll change nothing now. In other words, if you do want that ^ $ are line anchors, and not string anchors, you must use the rfMULTILINE flag.

Source: http://www.regular-expressions.info/refanchors.html


Jurassic Pork

  • Hero Member
  • *****
  • Posts: 1228
Re: I'm seeking (beta-)testers for my fast regular expression engine FLRE
« Reply #27 on: September 06, 2015, 10:55:53 am »
thanks BeRo, it's OK now , even without rfMultiline flag. I didn't understand what Split function was for  :-[  . 

another question : with the ExtractAll Function the first extracted string is the input string ?
Jurassic computer : Sinclair ZX81 - Zilog Z80A à 3,25 MHz - RAM 1 Ko - ROM 8 Ko

BeRo

  • New Member
  • *
  • Posts: 45
    • My site
Re: I'm seeking (beta-)testers for my fast regular expression engine FLRE
« Reply #28 on: September 06, 2015, 11:12:35 am »
the Split function is to split for example given this string "abc,def;123" with the  regex [,;] where result string array is then ['abc','def','123']

And ExtractAll is like MatchAll, just with the difference, that ExtractAll returns the extracted strings of the found match capture groups, instead Start/Length integer pairs, i.e.

Code: [Select]
[[whole-match-string of the first whole match, first-capture-group-string, second-capture-group-string, ...],
 [whole-match-string of the second whole match, first-capture-group-string, second-capture-group-string, ...],
 [whole-match-string of the third whole match, first-capture-group-string, second-capture-group-string, ...],
 ...
]
« Last Edit: September 06, 2015, 11:14:17 am by BeRo »

Jurassic Pork

  • Hero Member
  • *****
  • Posts: 1228
Re: I'm seeking (beta-)testers for my fast regular expression engine FLRE
« Reply #29 on: September 06, 2015, 11:24:10 am »
Ok , it is clear now  :)
Jurassic computer : Sinclair ZX81 - Zilog Z80A à 3,25 MHz - RAM 1 Ko - ROM 8 Ko

 

TinyPortal © 2005-2018