Recent

Author Topic: TRegistry - cleanup and fixes  (Read 16097 times)

CCRDude

  • Hero Member
  • *****
  • Posts: 596
TRegistry - cleanup and fixes
« on: January 30, 2019, 09:30:51 pm »
I’ve spend a few days rewriting an old application of mine. Back then, it was written in Delphi 7, Ansi. These days, it needs to be Unicode. And that’s where the things get complicated 😉
A registry editor needs to support all existing registry entry types. I remember many years ago I supplied a patch that adds more types (issue 32114). Meanwhile, a few things have changed:

Issue 1: REG_QWORD
There are 8 bit integers as well. I’ve supplied a patch to add rdInt64 to the enum, REG_QWORD to the associated array, and ReadInt64/WriteInt64 functions. Any idea how I can speed this up? Working around this in every registry handling operation is ugly.

Issue 2: *A vs *W functions for ReadString/WriteString
Back when I first supplied a patch for more types, functions were using the *A api calls. Now they use the *W calls. ReadString and WriteString do convert the PWideChar using UTF8Encode before returning them as string. According to this, this is wrong, since the FCL should not use UTF8. How would I be able to get Unicode values from the registry at all otherwise?

Issue 3: *W function for ReadStringList/WriteStringList
Same issue as above, ReadStringList / WriteStringList do use the *W functions, but do not use any conversion from PWideChar to String, resulting in broken TStringList content with #0 inserted between each character (since it’s latin unicode characters). My patch was to use UTF8 here as well, but since UTF8 is not wanted, should I supply a patch that converts it to plain AnsiString, loosing content, or is there any other way to retrieve Unicode registry content through TRegistry?

Issue 4: Extending existing TRegDataType without the patch applied above
I do have written my own TRegistry descendant to handle ReadInt64/WriteInt64, for example. I need it anyway for a bunch of advanced functions (moving keys, duplicating keys, &c), I could implement all the string functions as Unicode functions with my own data reading/writing backend there - but adding a custom extended TRegDataType clone type would make things unnecessarily complex. Is there any soft way to extend it without FCL modification? One issue is that even if I could „extend“ the enum using code, the associated array is not in the interface, but an implementation, and would need to be overridden as well.

Any recommendations on how patches that both do comply with the FCL AnsiString idea and at the same time provide Unicode support should look like are welcome.
I'm willing to spend more time on this, just need directions how this time should be spent to have a chance to improve fcl-registry :)

Remy Lebeau

  • Hero Member
  • *****
  • Posts: 1312
    • Lebeau Software
Re: TRegistry - cleanup and fixes
« Reply #1 on: February 02, 2019, 12:33:26 am »
Issue 2: *A vs *W functions for ReadString/WriteString
Back when I first supplied a patch for more types, functions were using the *A api calls. Now they use the *W calls. ReadString and WriteString do convert the PWideChar using UTF8Encode before returning them as string. According to this, this is wrong, since the FCL should not use UTF8. How would I be able to get Unicode values from the registry at all otherwise?

I would make ReadString()/WriteString() return/take UnicodeString explicitly, to match the encoding of the Registry's W functions. Let the caller deal with any conversions between String<->UnicodeString as needed. When String is UnicodeString, this would be a no-op both ways. And when String is a codepaged AnsiString, converting from String to UnicodeString would use whatever codepage is stored in the String, and converting from UnicodeString to String would use the RTL's default codepage, which can be customized at runtime.

Issue 3: *W function for ReadStringList/WriteStringList
Same issue as above, ReadStringList / WriteStringList do use the *W functions, but do not use any conversion from PWideChar to String, resulting in broken TStringList content with #0 inserted between each character (since it’s latin unicode characters). My patch was to use UTF8 here as well, but since UTF8 is not wanted, should I supply a patch that converts it to plain AnsiString, loosing content, or is there any other way to retrieve Unicode registry content through TRegistry?

In this case, you are stuck dealing with a list of String values and not UnicodeString values, but really that is not a problem because of what I said above.
Remy Lebeau
Lebeau Software - Owner, Developer
Internet Direct (Indy) - Admin, Developer (Support forum)

ASerge

  • Hero Member
  • *****
  • Posts: 2222
Re: TRegistry - cleanup and fixes
« Reply #2 on: February 02, 2019, 10:41:01 am »
As far as I can see now in the trunk is an implementation with Utf8Encode conversions and a W version API call. Unfortunately, if you do not use LCL, this results in encoding errors for the returned names.

CCRDude

  • Hero Member
  • *****
  • Posts: 596
Re: TRegistry - cleanup and fixes
« Reply #3 on: February 08, 2019, 09:58:37 am »
@ASerge: mostly, yes, expect for REG_MULTISZ, where it's broken.

@Remy Lebeau: I would favor a WideString version as well - but that would break the Delphi 7 compatibility.
I'm indeed stuck with a list of string values - the issue is actually is a problem because it's not a list of string values, but due to missing string type conversion, a list of characters (for standard latin characters, every second byte is #0, which for REG_MULTISZ is a line separator, so I get one character per list item).

My main question is how to do updates to it that won't go wasted but comply with what follows conventions.

Remy Lebeau

  • Hero Member
  • *****
  • Posts: 1312
    • Lebeau Software
Re: TRegistry - cleanup and fixes
« Reply #4 on: February 09, 2019, 03:07:22 am »
@Remy Lebeau: I would favor a WideString version as well - but that would break the Delphi 7 compatibility.

Not really.  Delphi 7 has WideString, and conversions between AnsiString and WideString.

for standard latin characters, every second byte is #0

That is true only if you store characters U+0000..U+00FF in a Unicode UTF-16 string.  That is not true at all for ANSI/UTF-8 strings.

which for REG_MULTISZ is a line separator, so I get one character per list item

You misunderstand how REG_MULTI_SZ actually works.

List items are separated by a NULL CHARACTER, with an extra NULL CHARACTER at the end of the list.  The size of each character is 1 or 2 bytes, determined by whether you use an ANSI or Unicode API to write/read the REG_MULTI_SZ data.

ANSI uses 1-byte character elements, so the item separator is a 1-byte NULL CHARACTER #00.  As such, it is not possible for ANSI list items to have embedded NULL BYTES/CHARACTERS.

UTF-16 uses 2-byte character elements, so the item separator is a 2-byte NULL CHARACTER #0000.  As such, it is perfectly acceptable for Unicode list items to have embedded NULL BYTES, such as for Latin characters, they just can't contain embedded NULL CHARACTERS.

If you need to store string lists with embedded NULL CHARACTERS, you can always use REG_BINARY instead.

My main question is how to do updates to it that won't go wasted but comply with what follows conventions.

There is no problem if you stick with Unicode UTF-16 strings at the API layer.  You just have to worry about possible conversions between ANSI/UTF-8 and UTF-16 at the input/output layer.  Which I think you should let the RTL handle for you.
Remy Lebeau
Lebeau Software - Owner, Developer
Internet Direct (Indy) - Admin, Developer (Support forum)

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11383
  • FPC developer.
Re: TRegistry - cleanup and fixes
« Reply #5 on: February 09, 2019, 02:58:18 pm »
Any idea how I can speed this up?

Whine about it on the forum, apparently. r41267
« Last Edit: February 09, 2019, 02:59:51 pm by marcov »

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: TRegistry - cleanup and fixes
« Reply #6 on: February 09, 2019, 06:22:44 pm »
Issue 2: *A vs *W functions for ReadString/WriteString
Back when I first supplied a patch for more types, functions were using the *A api calls. Now they use the *W calls. ReadString and WriteString do convert the PWideChar using UTF8Encode before returning them as string. According to this, this is wrong, since the FCL should not use UTF8. How would I be able to get Unicode values from the registry at all otherwise?
Instead of:
Code: Pascal  [Select][+][-]
  1.        AList.Text := UTF8Encode(Data);
use:
Code: Pascal  [Select][+][-]
  1.        AList.Text := String(Data);

Issue 3: *W function for ReadStringList/WriteStringList
Same issue as above, ReadStringList / WriteStringList do use the *W functions, but do not use any conversion from PWideChar to String, resulting in broken TStringList content with #0 inserted between each character (since it’s latin unicode characters). My patch was to use UTF8 here as well, but since UTF8 is not wanted, should I supply a patch that converts it to plain AnsiString, loosing content, or is there any other way to retrieve Unicode registry content through TRegistry?
Instead of:
Code: Pascal  [Select][+][-]
  1.   Data := UTF8Decode(StringReplace(List.Text, LineEnding, #0, [rfReplaceAll]) + #0#0);
use
Code: Pascal  [Select][+][-]
  1.   Data := UnicodeString(StringReplace(List.Text, LineEnding, #0, [rfReplaceAll]) + #0#0);

ASerge

  • Hero Member
  • *****
  • Posts: 2222
Re: TRegistry - cleanup and fixes
« Reply #7 on: February 09, 2019, 09:19:20 pm »
use
Code: Pascal  [Select][+][-]
  1.   Data := UnicodeString(StringReplace(List.Text, LineEnding, #0, [rfReplaceAll]) + #0#0);
May be better
Code: Pascal  [Select][+][-]
  1.   Data := UnicodeString(StringReplace(List.Text, List.LineEnding, #0, [rfReplaceAll]) + #0#0);

CCRDude

  • Hero Member
  • *****
  • Posts: 596
Re: TRegistry - cleanup and fixes
« Reply #8 on: February 09, 2019, 09:50:56 pm »
Not really.  Delphi 7 has WideString, and conversions between AnsiString and WideString.

My apologies, in my memory it was a pre-String=WideString version and the RTL was still AnsiString back then. Should've checked first though.

That is true only if you store characters U+0000..U+00FF in a Unicode UTF-16 string.  That is not true at all for ANSI/UTF-8 strings.

I was speaking about TRegistry, which used the *W functions, which return UTF-16 ;)

You misunderstand how REG_MULTI_SZ actually works. ...
List items are separated by a NULL CHARACTER, with an extra NULL CHARACTER at the end of the list.  The size of each character is 1 or 2 bytes, determined by whether you use an ANSI or Unicode API to write/read the REG_MULTI_SZ data.

Fully understood here :) Since TRegistry was using *W, I was just ignoring the *A case.

If you need to store string lists with embedded NULL CHARACTERS, you can always use REG_BINARY instead.

I don't need to do that at all :)

But if you want to have some fun, find programs that use REG_MULTI_SZ and use the binary editor of regedit to remove the trailing zeros... you'll be astonished how many fail of correctly handling the data then (e.g.  the start page setting of many older IE versions was just interpreting the stuff in memory behind as "start pages").

Any idea how I can speed this up?

Whine about it on the forum, apparently. r41267

Great, so I'll just stop trying to find out how to submit useful patches, and start whining more?

Come on, I wasn't asking for someone to do the job for me (at least that was not my intention), but for instructions on how to better supply qualified patches.

Remy Lebeau

  • Hero Member
  • *****
  • Posts: 1312
    • Lebeau Software
Re: TRegistry - cleanup and fixes
« Reply #9 on: February 09, 2019, 10:29:24 pm »
My apologies, in my memory it was a pre-String=WideString version and the RTL was still AnsiString back then. Should've checked first though.

It is, actually.  But it does have WideString available.  WideString was first introduced in Delphi 3.

You misunderstand how REG_MULTI_SZ actually works. ...
List items are separated by a NULL CHARACTER, with an extra NULL CHARACTER at the end of the list.  The size of each character is 1 or 2 bytes, determined by whether you use an ANSI or Unicode API to write/read the REG_MULTI_SZ data.

Fully understood here :) Since TRegistry was using *W, I was just ignoring the *A case.

Apparently not, since you thought the presence of NULL BYTES in a UTF-16 string would cause items to be separated incorrectly.  IT DOES NOT.  NULL BYTES are fine in a UTF-16 string.  NULL CHARACTERS are not fine.  There is a difference!
Remy Lebeau
Lebeau Software - Owner, Developer
Internet Direct (Indy) - Admin, Developer (Support forum)

ASerge

  • Hero Member
  • *****
  • Posts: 2222
Re: TRegistry - cleanup and fixes
« Reply #10 on: February 10, 2019, 07:19:00 am »
I saw a new implementation of TRegistry in the trunk (revision 41273) and could not resist to fix it (new version attached).

===In all===
Fixed: Excluded annoying hint for unused parameters

===regini.inc===
Fixed: TRegIniFile.ReadSectionValues check value type before read it as string to exclude runtime errors

===registry.pp===
Fixed: TRegistry LoadKey, ReplaceKey, RestoreKey, SaveKey, UnloadKey, MoveKey tag as unimplemented
Fixed: TRegistry.WriteBinaryData, append const to Buffer param, now accept const buffers too
Fixed: TRegistry WriteDate and WriteTime set inline to code WriteDateTime, because it do the same
Fixed: TRegistry.CloseKey set as class and static, because it's not used Self
Fixed: TRegistryIniFile ReadBinaryStream, WriteBinaryStream, UploadFile tag as unimplemented
Fixed: TRegistry.Create micro optimization: exclude zeroing fields (already done in constructor and documented)
Fixed: TRegistry Create, GetData, PutData, ReadString move raise to internal procedure - speed optimization
Fixed: TRegistry GetDataType, GetDataSize call SysGetData directly, exclude unnecessary checks - speed optimization
Fixed: TRegistry ReadBinaryData, ReadInteger, ReadInt64, ReadStringList move raise to separate procedure - speed and size (used more than ones) optimization
Fixed: TRegistry ReadDate and ReadTime call ReadDateTime with Trunc or Frac - size optimization
Fixed: TRegistry.ReadString use local var to exclude multiple call of Length
Fixed: TRegistry.ReadString exclude / and round for integers value, use div
Fixed: TRegistry.ReadString use string(U) instead of Utf8Encode(U) to work with or without LCL
Fixed: TRegistry.ReadStringList unicode support
Fixed: TRegistry ReadStringList, WriteStringList use AList.LineBreak, not global LineEnding
Fixed: TRegistry.WriteExpandString use UnicodeString(Value) instead of Utf8Decode(U) to work with or without LCL
Fixed: TRegistry.WriteStringList unicode support
Fixed: TRegistry.WriteString use UnicodeString(Value) instead of Utf8Decode(U) to work with or without LCL
Fixed: TRegistry DeleteValue, SysGetData, SysPutData, ReadString, ReadStringList API support nil, so change typecast from UnicodeChar(U), to Pointer(U) - speed optimization
Fixed: internal PrepKey rename to Normalize and return UnicodeString, not PChar, and change anywhere Utf8Encode(PChar(PrepKey(...))) to Normalize(...) - speed optimization and work with or without LCL

Code style refactor according coding style from  http://wiki.freepascal.org/DesignGuidelines

If I post in bugtracker, the admins, as usual, will ignore, because a lot of changes  :'(

Bart

  • Hero Member
  • *****
  • Posts: 5275
    • Bart en Mariska's Webstek
Re: TRegistry - cleanup and fixes
« Reply #11 on: February 10, 2019, 01:40:27 pm »
If I post in bugtracker, the admins, as usual, will ignore, because a lot of changes  :'(

It will surely go unnoticed if you leave it here.
Further more: you did not attach a patch but complete sources, this is a sure way to have it rejected.
If you create a proper patch (either using svn or git) it makes it a lot easier to review.

Also, devels often ask to split big changes into seperate patches, so thing can be more easily studied and/or reverted when necessary.

And specific to this topic: hardly any fpc devel uses the registry, so response time may be slow.
Also there are no tests w.r.t. fcl-registry, so it is very hard to test regressions, which in turn makes devels reluctant to apply changes...
I know this because I recently wrote some fixes for the TRegIniFile wrapper around the registry.
(I don't use the registry myself, and never used TRegIniFile in my Delphi days, and I wonder why anybody would nowadays. Just learn to use TRegistry directly, it's not that difficult).

Making tests for all basic functionality of registry would be a first step (not one big test, but separate tests for each functionality and also tests for solved bugs).
Please also notice that, while the registry is a Windows thing, the code is supposed to work in other platforms as well (it stores data in some xml file), so you should not break that.
(IMO they shold not have done that, but then again, who am I to judge?)

And FWIW: I think that the more patches for the registry appear in the bugtracker, the higher the chance is that we will get feedback.

Bart

Bart

  • Hero Member
  • *****
  • Posts: 5275
    • Bart en Mariska's Webstek
Re: TRegistry - cleanup and fixes
« Reply #12 on: February 10, 2019, 02:29:11 pm »
Fixed: TRegistry.ReadString use string(U) instead of Utf8Encode(U) to work with or without LCL
Why? This already works without LCL. The string returned has codepage UTF8 and assigning it to another string will trigger the conversion from UTF8 to current codepage.

Here's a snippet from a pure fpc program (no LCL/LazUtils dependancy):
(DefaultSystemCodePage=1252)
Code: Pascal  [Select][+][-]
  1.     S := Reg.ReadString('StringUtf8');
  2.     if (S <> '') then
  3.     begin
  4.       writeln('ReadString(''StringUtf8'')');
  5.       writeln('StringCodePage=',StringCodePage(S));
  6.       for i := 1 to length(S) do write(IntToHex(Ord(S[i]),2),#32);
  7.       writeln;
  8.       writeln(S);
  9.     end;

This is the output:
Code: Pascal  [Select][+][-]
  1. ReadString('StringUtf8') //value should be 'äëï'. This is of course inside my codepage, if it were outside at any stage conversion would loose characters
  2. StringCodePage=65001
  3. C3 A4 C3 AB C3 AF  //(bytes are UTF8 encoded)
  4. äëï //output in console

Bart

ASerge

  • Hero Member
  • *****
  • Posts: 2222
Re: TRegistry - cleanup and fixes
« Reply #13 on: February 10, 2019, 04:46:36 pm »
Why? This already works without LCL. The string returned has codepage UTF8 and assigning it to another string will trigger the conversion from UTF8 to current codepage.
This show wrong names with trunc, but right with my patch.
Code: Pascal  [Select][+][-]
  1. program RegTest;
  2. {$APPTYPE CONSOLE}
  3. {$MODE OBJFPC}
  4. {$LONGSTRINGS ON}
  5.  
  6. uses SysUtils, Classes, Registry;
  7.  
  8. var
  9.   R: TRegistry;
  10.   L: TStringList;
  11.   Name, Value: string;
  12.   Info: TRegKeyInfo;
  13. begin
  14.   try
  15.     R := TRegistry.Create;
  16.     try
  17.       if R.OpenKeyReadOnly('...Some not ansi key') then
  18.       begin
  19.         if R.GetKeyInfo(Info) then
  20.           Writeln(DateTimeToStr(Info.FileTime)); // Forgot, this also patched, because original return in UTC not local time
  21.         L := TStringList.Create;
  22.         try
  23.           R.GetValueNames(L);
  24.           for Name in L do
  25.             if R.GetDataType(Name) in [rdString, rdExpandString] then
  26.             begin
  27.               Value := R.ReadString(Name);
  28.               Writeln(Name, '=', Value);
  29.             end;
  30.         finally
  31.           L.Free;
  32.         end;
  33.       end;
  34.     finally
  35.       R.Free;
  36.     end;
  37.   except
  38.     on E: Exception do
  39.       ShowException(E, nil);
  40.   end;
  41.   Readln;
  42. end.

Bart

  • Hero Member
  • *****
  • Posts: 5275
    • Bart en Mariska's Webstek
Re: TRegistry - cleanup and fixes
« Reply #14 on: February 10, 2019, 07:03:47 pm »
This may or may not applye. Please don't shoot me.

The "some not ansi key" string must be defined somewhere in your sourcecode.
If you write that code in Lazarus IDE then the sourcecode will be stored as UTF8, without BOM.
If you then compile this from commandline, the string literal will contains UTF encoded bytes, but it will be interpreted as local codepage encoded string, so it will have a totally other content than what you expected.
As a result you will not be able to open the key.

Bart

 

TinyPortal © 2005-2018