Recent

Author Topic: Unicode Filename issue on Windows  (Read 3806 times)

snorkel

  • Hero Member
  • *****
  • Posts: 817
Unicode Filename issue on Windows
« on: January 12, 2017, 01:39:26 am »
Hi,
I have a function that monitors a directory for changes and it's based of this example:
http://forum.codecall.net/topic/76318-monitor-a-folder-for-changes/

Most of the time this works perfectly, but once in a great while someone will put a file in the directory that has a EM Dash char in it which is unicode, this causes a ton of issues with SetLength(FFolderItemInfo.Name, vFileInfo^.FileNameLength div 2);
the div 2 gets the correct length when the filename is ascii, but not Unicode.
If the filename has the EM char in it removing the div 2 works.

How can I get this to work for ascii and Unicode filenames?

Thanks in advance.
***Snorkel***
If I forget, I always use the latest stable 32bit version of Lazarus and FPC. At the time of this signature that is Laz 3.0RC2 and FPC 3.2.2
OS: Windows 10 64 bit

snorkel

  • Hero Member
  • *****
  • Posts: 817
Re: Unicode Filename issue on Windows
« Reply #1 on: January 12, 2017, 01:47:35 am »
Ok, I might be answering my own question but maybe this would work?

http://www.freepascal.org/docs-html/rtl/system/stringelementsize.html
***Snorkel***
If I forget, I always use the latest stable 32bit version of Lazarus and FPC. At the time of this signature that is Laz 3.0RC2 and FPC 3.2.2
OS: Windows 10 64 bit

Sanem

  • Full Member
  • ***
  • Posts: 173
Re: Unicode Filename issue on Windows
« Reply #2 on: January 21, 2019, 04:08:14 pm »
Hi, I have 2wo questions from you, any help appreciated prior.

1: what did you write exactly instead of that div 2 in procedure TFolderMonWorker.Execute;
as you can see in the screenshot attached I have tried your way but it still is incorrect for Arabic names.

2: when I add a file to monitor folder, there is 2 notification with new and modify and modify values, do you know why is that?
 
Regards

Thaddy

  • Hero Member
  • *****
  • Posts: 14214
  • Probably until I exterminate Putin.
Re: Unicode Filename issue on Windows
« Reply #3 on: January 21, 2019, 04:27:57 pm »
That's just bad - or very old - code. The div 2 assumes old school UTF16, nowadays known as UCS2.
UTF16 is not a two byte encoding anymore. It is basically a 2-4 byte encoding. Lazarus has the (opinionated!! even more problematic) UTF8 which is a 1 to 4 byte encoding.
That code is only suitable for UCS2 and is not supported.
« Last Edit: January 21, 2019, 04:30:59 pm by Thaddy »
Specialize a type, not a var.

Remy Lebeau

  • Hero Member
  • *****
  • Posts: 1312
    • Lebeau Software
Re: Unicode Filename issue on Windows
« Reply #4 on: January 21, 2019, 08:52:59 pm »
The div 2 assumes old school UTF16, nowadays known as UCS2.

You mean the other way around - old school UCS2, nowadays replaced with UTF16.

That code is only suitable for UCS2 and is not supported.

Not true.  The FileNameLength specifies the number of total bytes regardless of encoding.  Using 'div 2' works just fine with UCS2 and UTF16 alike, as it is counting encoded WideChar elements, not actual Unicode characters.  The EM Dash character (Unicode U+2014) takes up only 1 WideChar element in both UCS2 and UTF16.

And, FWIW, ReadDirectoryChangeW() provides Unicode filenames in UTF16 format.

So the OP's problem is something else.
Remy Lebeau
Lebeau Software - Owner, Developer
Internet Direct (Indy) - Admin, Developer (Support forum)

Remy Lebeau

  • Hero Member
  • *****
  • Posts: 1312
    • Lebeau Software
Re: Unicode Filename issue on Windows
« Reply #5 on: January 21, 2019, 09:04:49 pm »
Most of the time this works perfectly, but once in a great while someone will put a file in the directory that has a EM Dash char in it which is unicode, this causes a ton of issues with SetLength(FFolderItemInfo.Name, vFileInfo^.FileNameLength div 2);
the div 2 gets the correct length when the filename is ascii, but not Unicode.
If the filename has the EM char in it removing the div 2 works.

I seriously doubt that, given that

- dividing the FileNameLength by 2 assumes the string data consists of 2-byte elements. Indeed, Unicode on Windows is handled using UTF-16, which does use 2-byte elements.

- the EM Dash char (U+2014) takes up only one 2-byte WideChar in UTF-16.

Even for Unicode characters that require two 2-byte WideChars (aka, surrogate pairs), dividing the FileNameLength by 2 would still work just fine, since it counts the total number of bytes used for encoded elements, not the number of Unicode characters.  So whether a given character takes up 1 or 2 elements, the FileNameLength will account for that as expected.

How can I get this to work for ascii and Unicode filenames?

The code shown works perfectly fine with Unicode filenames.  Especially the EM Dash character, which takes up only 1 WideChar element.  So the problem has to be something else.  Please provide a hex dump of the raw FileName data, and the corresponding FileNameLength.  Also, make sure that FFolderItemInfo.Name is a 16bit (Wide|Unicode)String and not an 8bit (Ansi|UTF8)String.

If FFolderItemInfo.Name is an 8bit string then passing the FileNameLength as-is to SetLength() is wrong, it would need to be translated from a 16bit length to an 8bit length first.  But why is the code even calling SetLength() at all?  WideCharLenToString() is already returning a String that took the FileNameLength into account, so the SetLength() afterwards is completely unnecessary.
« Last Edit: January 21, 2019, 09:08:41 pm by Remy Lebeau »
Remy Lebeau
Lebeau Software - Owner, Developer
Internet Direct (Indy) - Admin, Developer (Support forum)

Sanem

  • Full Member
  • ***
  • Posts: 173
Re: Unicode Filename issue on Windows
« Reply #6 on: January 22, 2019, 12:25:56 pm »
Most of the time this works perfectly, but once in a great while someone will put a file in the directory that has a EM Dash char in it which is unicode, this causes a ton of issues with SetLength(FFolderItemInfo.Name, vFileInfo^.FileNameLength div 2);
the div 2 gets the correct length when the filename is ascii, but not Unicode.
If the filename has the EM char in it removing the div 2 works.

I seriously doubt that, given that

- dividing the FileNameLength by 2 assumes the string data consists of 2-byte elements. Indeed, Unicode on Windows is handled using UTF-16, which does use 2-byte elements.

- the EM Dash char (U+2014) takes up only one 2-byte WideChar in UTF-16.

Even for Unicode characters that require two 2-byte WideChars (aka, surrogate pairs), dividing the FileNameLength by 2 would still work just fine, since it counts the total number of bytes used for encoded elements, not the number of Unicode characters.  So whether a given character takes up 1 or 2 elements, the FileNameLength will account for that as expected.

How can I get this to work for ascii and Unicode filenames?

The code shown works perfectly fine with Unicode filenames.  Especially the EM Dash character, which takes up only 1 WideChar element.  So the problem has to be something else.  Please provide a hex dump of the raw FileName data, and the corresponding FileNameLength.  Also, make sure that FFolderItemInfo.Name is a 16bit (Wide|Unicode)String and not an 8bit (Ansi|UTF8)String.

If FFolderItemInfo.Name is an 8bit string then passing the FileNameLength as-is to SetLength() is wrong, it would need to be translated from a 16bit length to an 8bit length first.  But why is the code even calling SetLength() at all?  WideCharLenToString() is already returning a String that took the FileNameLength into account, so the SetLength() afterwards is completely unnecessary.


Thank you so much you're solution solved the problem.
"make sure that FFolderItemInfo.Name is a 16bit (Wide|Unicode)String and not an 8bit (Ansi|UTF8)String."

Regards

 

TinyPortal © 2005-2018