Recent

Author Topic: Autodetect file format  (Read 5025 times)

SeregaKR

  • New Member
  • *
  • Posts: 35
Autodetect file format
« on: July 10, 2018, 06:21:58 am »
Hello.
I would like to know how autodetect file format works in fpspreadsheet. For now it seems it just analyse file extension like xls - Excel 8, xlsx - Office XML and so on. But I have a problem when our partners send us xls files,which are actually xlsx files with manually changed extension. The same problem is with Excel 5 files when fpspreadsheet see them as normal Excel 8 files.
Is there a way to detect the format regardless of extension?

Thaddy

  • Hero Member
  • *****
  • Posts: 14214
  • Probably until I exterminate Putin.
Re: Autodetect file format
« Reply #1 on: July 10, 2018, 08:46:33 am »
Well xlsx is a text format afaik, and xls is a binary format afaik.
So maybe this helps:
Code: Pascal  [Select][+][-]
  1. function IsBinaryFile(const Filename:AnsiString):Boolean;
  2. var
  3.  f:File of byte;
  4.  b:byte = 0;
  5. begin
  6.   Result := false;
  7. {$push}{$I-}
  8.   Assign(f,Filename);
  9.   reset(f);
  10.   if (IOResult = 0) and (Filename <>'') then
  11.     while not eof(f) do
  12.     begin
  13.       Read(f,b);
  14.       // test for non-printable characters
  15.       if ((b > 0) and (b < 8)) or ((b > 13) and (b < 26)) then exit(true);
  16.     end;
  17.   Close(f);
  18. {$pop}
  19. end;

See http://www.differencebetween.net/technology/difference-between-xls-and-xlsx/.
It seems to work pretty good. OTOH it is slow for real text files and fast for real BIFF files (xls) , so maybe just test the header for xml presence.
And xmlsx can be zipped, so also test for zip info.

I suspect there is something much simpler than the above!
« Last Edit: July 10, 2018, 09:20:01 am by Thaddy »
Specialize a type, not a var.

wp

  • Hero Member
  • *****
  • Posts: 11858
Re: Autodetect file format
« Reply #2 on: July 10, 2018, 10:41:21 am »
Hello.
I would like to know how autodetect file format works in fpspreadsheet. For now it seems it just analyse file extension like xls - Excel 8, xlsx - Office XML and so on. But I have a problem when our partners send us xls files,which are actually xlsx files with manually changed extension. The same problem is with Excel 5 files when fpspreadsheet see them as normal Excel 8 files.
Is there a way to detect the format regardless of extension?
First of all: Tell your customers never to do this. Even Excel cannot read an xls file renamed to xlsx! (LibreOffice Calc on the other hand can...)

Yes, fpspreadsheet relies on the extension to distinguish between the main file formats. Only in case of xls (which may contain either BIFF2, BIFF5 or BIFF8 binary formats) it reads the header and makes a decision based on characteristic differences.  But since it is easy to check the signature of zipped files ('PK' in the first two bytes) I also added xlsx and ods to this test now. However, to distinguish between these two fpspreadsheet must try to read the files and catch the exception of the reader (while your program is running in the debugger, the IDE will generate an exception which will not occur when the program is run outside the IDE. But you can avoid this by adding the exception type EFPSpreadsheetReader to the "Language Exceptions" under "Tools" > "Options" > "Debugger").

So, update your fpspreadsheet to the current svn version and fpspreadsheet will accept xls files "disguised" as ".xlsx" or ".ods", or ".abc" or whatever.

RayoGlauco

  • Full Member
  • ***
  • Posts: 176
  • Beers: 1567
Re: Autodetect file format
« Reply #3 on: July 10, 2018, 11:15:10 am »
If you only need to distinguish XLS or XLSX formats, you can read the first bytes of the file:

D0 CF ... = XLS
50 4B ... = XLSX

There is info about filetype signatures here: https://en.wikipedia.org/wiki/List_of_file_signatures

edit: if you know your file can only be XLS or XLSX, you can read the first byte: if it is D0 then you have an XLS file, else you have a XLSX file. If the file can be another type, you need more tests, see the wikipedia link.
« Last Edit: July 10, 2018, 11:23:51 am by RayoGlauco »
To err is human, but to really mess things up, you need a computer.

wp

  • Hero Member
  • *****
  • Posts: 11858
Re: Autodetect file format
« Reply #4 on: July 10, 2018, 11:39:56 am »
Not necessarily:
  • FpSpreadsheet supports also the old Excel2 format which has the extension .xls, too, but begins with $09, $00, $04, $00.
  • And the first bytes 'PK' (your $50 $4B) are common to all zipped files. Since both xlsx and ods are zipped xml files your xlsx file might be an ods as well.

RayoGlauco

  • Full Member
  • ***
  • Posts: 176
  • Beers: 1567
Re: Autodetect file format
« Reply #5 on: July 10, 2018, 11:50:14 am »
Not necessarily:
  • FpSpreadsheet supports also the old Excel2 format which has the extension .xls, too, but begins with $09, $00, $04, $00.
  • And the first bytes 'PK' (your $50 $4B) are common to all zipped files. Since both xlsx and ods are zipped xml files your xlsx file might be an ods as well.

You are right, that's why I said "if you know the file only can be XLS or XLSX".
But, if there are XLS files not beginning by $D0, you can compare the first byte against $50: it is $50 -> it's XLSX; else, it's XLS.
To err is human, but to really mess things up, you need a computer.

SeregaKR

  • New Member
  • *
  • Posts: 35
Re: Autodetect file format
« Reply #6 on: July 10, 2018, 11:57:25 am »
I need to open the file and check if it fits the template (e.g. certain cell have a number inside).
But sometimes one of our partners send us wrong files or files like I described before and as a result  they are processed incorrectly. That's why I'm trying to test the format so I will process the correct files with wrong format.

Thank you for your help.

Revisiobn 6555 is current?

wp

  • Hero Member
  • *****
  • Posts: 11858
Re: Autodetect file format
« Reply #7 on: July 10, 2018, 12:00:36 pm »

SeregaKR

  • New Member
  • *
  • Posts: 35
Re: Autodetect file format
« Reply #8 on: July 10, 2018, 12:37:59 pm »
Ok. Downloaded and installed current revision. Now I get the error as described.
But is there actually a way to process the incorrect file with correct reader?

For now I have a component sWorkBookSource1 on the form with AutoDetectFormat = True.
Here is the code I use for now (that doesn't use the file with incorrect extension):
Code: Pascal  [Select][+][-]
  1. var test:integer;
  2. ...
  3. try
  4.     sWorkbookSource1.FileName :=FileOpen1.Dialog.FileName;
  5.     test := trunc(sWorkbookSource1.Workbook.GetWorksheetByIndex(0).ReadAsNumber(1,1));
  6. except
  7.     showmessage('Not processed');
  8. end;    
  9.  

I guess I need to write something like this to process such files, but I'm not sure if it's the correct way.
Code: Pascal  [Select][+][-]
  1. var test:integer;
  2. ...
  3. try
  4.     sWorkbookSource1.FileName :=FileOpen1.Dialog.FileName;
  5.     test := trunc(sWorkbookSource1.Workbook.GetWorksheetByIndex(0).ReadAsNumber(1,1));
  6. except
  7.     try
  8.        sWorkbookSource1.AutoDetectFormat := False;
  9.        sWorkbookSource1.FileFormat := sfOOXML;
  10.        sWorkbookSource1.FileName :=FileOpen1.Dialog.FileName;
  11.        test := trunc(sWorkbookSource1.Workbook.GetWorksheetByIndex(0).ReadAsNumber(1,1));
  12.     except    
  13.        showmessage('Not processed');
  14.     end;
  15. end;    
  16.  
« Last Edit: July 10, 2018, 12:40:56 pm by SeregaKR »

wp

  • Hero Member
  • *****
  • Posts: 11858
Re: Autodetect file format
« Reply #9 on: July 10, 2018, 12:42:38 pm »
Does the exception also fire when you run the program outside the IDE? I am sure it does not. Did you read my note about the exception? If you don't want the exception to happen inside the IDE you must add its type (EFPSpreadReader) to the Language exceptions of "Tools" > "Options".

Of course you can do the checks proposed by RayoGlauco before you attempt to load the file into fpspreadsheet. But I would bet that customers who give you xls files disguised as xlsx will also be able to give you jpegs or whatever renamed to xlsx...

But this is not necessary because the "disguised" file is read correctly by fpspreadsheet now. Try the attached demo: it writes an xlsx file, renames it to xls and then opens the xls file (although it is an xlsx) without issues - just click on the buttons at the top in the order 1 -> 2 -> 3.
« Last Edit: July 10, 2018, 01:06:21 pm by wp »

SeregaKR

  • New Member
  • *
  • Posts: 35
Re: Autodetect file format
« Reply #10 on: July 10, 2018, 01:51:10 pm »
Yes I read your note about adding the exception.

Your example works. Now I need to understand why my proram don't.

wp

  • Hero Member
  • *****
  • Posts: 11858
Re: Autodetect file format
« Reply #11 on: July 10, 2018, 01:58:54 pm »
Maybe the file is not a spreadsheet file at all. Can you post the file here? (rename its extension to .zip to be accepted by the forum software). If you cannot or do not want to publish the file send me a PM and I give you me personal e-mail address.

SeregaKR

  • New Member
  • *
  • Posts: 35
Re: Autodetect file format
« Reply #12 on: July 10, 2018, 02:05:32 pm »
found the bug. It was a spelling mistake on my part. Sorry for misleading.

 

TinyPortal © 2005-2018