To start with Lazarus maps String to AnsiString through the omnipresent default directive {$H+} at least on Linux-64.
Perhaps. Just note that Indy also supports when 'string' maps to 'UnicodeString' via {$ModeSwitch UnicodeStrings} or {$Mode DelphiUnicode}. Various properties and function signatures in Indy change depending on whether AnsiString or UnicodeString is being used.
IIRC, doesn't Lazarus use UTF-8 encoded AnsiStrings, though? Would that also apply to strings from the UI? You need to take a string's encoding into account when calling Indy's ToBytes() and BytesToString() functions when dealing with AnsiStrings. You need to tell them that input/output strings are encoded in UTF-8 and not in the OS's default locale, via either:
- the ASrcEncoding parameter of ToBytes(), and the ADestEncoding parameter of BytesToString().
- the ASrcEncoding parameter of TIdIOHandler.WriteLn(), and the ADestEncoding parameter of TIdIOHandler.ReadLn().
- the DefAnsiEncoding property of TIdIOHandler, which WriteLn() and ReadLn() default to if no encoding is passed in explicitly.
- the global GIdDefaultTextEncoding variable in the IdGlobal.pas unit.
For some reason GDB would not take me to the ToBytes function using F7 during the runs of the previous days.
Again, did you recompile Indy with debug info enabled?
This time I set an explicit breakpoint and was the first time I saw the execution of this code. Suspect maybe this is where the problem lies, for IndyTextEncoding_UTF8() seems to be working fine with the syntax of the example1.
As I've already explained earlier, I don't expect any problems with example 1, because the IIdTextEncoding.GetBytes() method takes a UnicodeString as input, not a generic String. So, if you pass in an AnsiString as input to GetString(), the compiler/RTL - not Indy - will convert the string to Unicode before GetBytes() is even entered. And in the context of Lazarus, with its UTF-8 encoded AnsiStrings, I would expect that conversion to take UTF-8 into account, thus no data loss. Same with the IIdTextEncoding.GetString() method, which returns a UnicodeString, which the compiler/RTL - not Indy - will convert when assigned to an AnsiString.
But example 1 is not what TIdIOHandler does internally, example 2 is, and that is the one that is having issues, likely because of the use of IndyTextEncoding_OSDefault as an intermediate conversion.
This time I set an explicit breakpoint and was the first time I saw the execution of this code.
OK, now we are finally getting somewhere useful...
{$IFDEF STRING_IS_ANSI}
LBytes := nil; // keep the compiler happy // (0)
{$ENDIF}
We've established that 'string' is 'AnsiString'...
LLength := IndyLength(AValue, ALength, AIndex); // (1), LLength becomes 4
And that 'string' is UTF-8 encoded ('αβ' is 4 bytes when encoded in UTF-8)...
In which case, you need to either:
- set the ASrcEncoding parameter of ToBytes(), and the ADestEncoding parameter of BytesToString(), to IndyTextEncoding_UTF8 instead of their default values of nil.
- set the IdGlobal.GIdDefaultTextEncoding variable to encUTF8 instead of its default value of encASCII.
In the context of TIdIOHandler, you can set its DefAnsiEncoding property to IndyTextEncoding_UTF8, and leave off any encodings when calling TIdIOHandler.WriteLn() and TIdIOHandler.ReadLn().
EnsureEncoding(ADestEncoding); // (3) ADestEncoding = IUNKNOWN, GDB won't let me F7 this
EnsureEncoding() is also in IdGlobal.pas, don't know why the debugger won't let you step into it.
If the VEncoding parameter is nil (which it is not in this example), it gets set to an encoding specified by the ADefEncoding parameter, which is encIndyDefault by default, so the VEncoding will be set to IndyTextEncoding_Default, which returns an encoding determined by the IdGlobal.GIdDefaultTextEncoding variable, which is encASCII by default, so IndyTextEncoding_ASCII is used by default.
But, in this example, ADestEncoding is being set to IndyTextEncoding_UTF8 by the caller, so EnsureEncoding() is a no-op.
EnsureEncoding(ASrcEncoding, encOSDefault); // (4) ASrcEncoding = IUNKNOWN
Following the above logic, ASrcEncoding is initially nil, so it gets set to IndyTextEncoding_OSDefault, whose implementation is TIdMBCSEncoding using 'char' or 'ASCII' as the charset when calling into ICONV, depending on the IdGlobal.GIdIconvUseLocaleDependantAnsiEncoding variable, which is false by default so 'ASCII' is the default charset.
LBytes := RawToBytes(AValue[AIndex], LLength); // (5) LBytes becomes (206,177,206,178)
Those bytes are the correct UTF-8 encoded form of 'αβ'...
CheckByteEncoding(LBytes, ASrcEncoding, ADestEncoding); // (6) LBytes becomes ()
And this is where data loss occurs, because ASrcEncoding is set to IndyTextEncoding_OSDefault instead of IndyTextEncoding_UTF8, so the bytes will not be interpreted as UTF-8 correctly. There is a TODO comment inside of IndyTextEncoding_OSDefault() to have it use UTF-8 on POSIX systems (which includes Linux), but that has not been enabled yet.
Internally, CheckByteEncoding() looks like this:
procedure CheckByteEncoding(var VBytes: TIdBytes; ASrcEncoding, ADestEncoding: IIdTextEncoding);
begin
if ASrcEncoding <> ADestEncoding then begin
VBytes := ADestEncoding.GetBytes(ASrcEncoding.GetChars(VBytes));
end;
end;
We know what the output of ADestEncoding.GetBytes() is (no bytes), but what is the output of ASrcEncoding.GetChars(VBytes)? Is it also empty? IOW, is the loss of data happening because TIdMCBSEncoding.GetChars() returns no chars at all when given UTF-8 encoded bytes, or is the loss because TIdUTF8Encoding.GetBytes() can't process the chars that TIdMBCSEncoding.GetChars() returned? You should be able to put breakpoints in those method implementations.
Even though the input UTF-8 bytes are not being interpreted as UTF-8, I would expect TIdMBCSEncoding.GetChars() to still be able to return SOME chars. Incorrect chars perhaps, maybe even $FFFD chars, but not zero chars. That would imply a logic bug inside of the IdGlobal.DoIconvBytesToChars() function.
You might need to set the IdGlobal.GIdIconvUseLocaleDependantAnsiEncoding variable to true, or the IdGlobal.GIdIconvIgnoreIllegalChars variable to true, too work around that.
Have F7ed CheckByteEncoding as well.
procedure CheckByteEncoding(var VBytes: TIdBytes; ASrcEncoding, ADestEncoding: IIdTextEncoding);
begin
if ASrcEncoding <> ADestEncoding then begin // according to the GDB both ASrcEncoding and ADestEncoding are valued IUNKNOWN
VBytes := ADestEncoding.GetBytes(ASrcEncoding.GetChars(VBytes)); // (8) this statement is executed, should it?
end;
end;
[/code]
Yes, the ( 8 ) line should be executed, since ASrcEncoding and ADestEncoding are pointing at different objects, so the comparison evaluates as false.
Both ASrcEncoding as well as ADestEncoding are of the IIdTextEncoding type, which I understand is an interface.
Yes.
Have no clues on how do interfaces compare
For purposes of the '=' and '<>' comparison operators, they are simply raw pointer comparisons.
However, for Indy 11, I'm expanding on IIdTextEncoding comparisons to take codepages and charsets into account, so even if two IIdTextEncoding variables point at different objects in memory, byte conversions can be skipped if both objects logically represent the same character encoding.
Would expect statement #8 to not be executed.
In the situation where you pass in a UTF-8 encoded AnsiString, and ask for it to be output as a UTF-8 byte array, then you would be correct ONLY WHEN the ASrcEncoding parameter of ToBytes() is set to IndyTextEncoding_UTF8, which it is not in this example. Had it been, the comparison in CheckByteEncoding() would have evaluated as true instead of false, and the conversion skipped.
Cannot F7 statement #8 of course.
Which is odd, since those methods are also in IdGlobal.pas. You should be able to put breakpoints in the implementations, though.