How to remove duplicate entries from very large text files

Gizmo

Hero Member
Posts: 831

Re: How to remove duplicate entries from very large text files

« Reply #30 on: September 16, 2016, 04:09:35 pm »

Fungus

Thanks...to clarify, I'm not referring to hashes here as such now - just using an integer list. I'm thinking that if we can work out a way of computing an integer value for a string that is reversable, then once it has been found in the input file once, it won't be added again. Lets pretend 'ABC' equalled 589654, for arguments sake, then next line read the program does, if it happens to also be 'ABC', it will compute '589654' for it, look it up in the list and know it's already been found and thus not add it again.

I guess I could always just use the MD5 or SHA-1 digests, and store the binary digests in the list but then there's the reversing issue - I'd still need the originating value, given that hashes are non-reversable (thus the use of HashLists in the first place). Collisions are an irritating news issue really. Engineering a hash collision is one thing, but over true data it is not at all common, and, for SHA-1, you're still looking at 1 in 1.46 trillion trillion trillion trillion to one odds of one random line of data having the same SHA-1 hash as another.

« Last Edit: September 16, 2016, 04:11:41 pm by Gizmo »

Logged

Lazarus 2.2.6 FPC 3.2.2- Linux Mint 21 LTS, Windows 10 64 and Mac OSX 14.0
Useful Pages to remember :

http://wiki.freepascal.org/Cross_compiling#From_Linux_x64_to_Linux_i386
https://wiki.freepascal.org/macOS_Big_Sur_changes_for_developers#ARM64.2FAArch64.2FApple_Silicon_Support

rvk

Hero Member
Posts: 6171

Re: How to remove duplicate entries from very large text files

« Reply #31 on: September 16, 2016, 04:18:46 pm »

Quote from: Fungus on September 16, 2016, 04:04:02 pm

The issue remaining will still be that a hash of 64bit cannot be considered 100% unique and therefore you cannot be sure that all duplicates are removed and you might even end up with the removal of words that are not duplicated.

Such number would be essentially the same as a hash.

I would still use a hash-number with this method. But I wouldn't create a separate index file. I would store the position with the hash-number (multiple same hash-numbers can exists so you would duplicate them because the position-numbers would be unique).

When you search that hash-list you would need to iterate all duplicate numbers and use a second filestream (of the same file) to reread the word in question to see if you're really dealing with a duplicate. That way you have the index in-memory (as pointers to the original file) and don't need to build a separate file for that.

Logged

Martin_fr

Administrator
Hero Member
Posts: 9913
Debugger - SynEdit - and more

Re: How to remove duplicate entries from very large text files

« Reply #32 on: September 16, 2016, 04:21:27 pm »

Quote from: Gizmo on September 16, 2016, 04:09:35 pm

I'm thinking that if we can work out a way of computing an integer value for a string that is reversable, then once it has been found in the input file once, it won't be added again. Lets pretend 'ABC' equalled 589654, for arguments sake, then next line read the program does, if it happens to also be 'ABC', it will compute '589654' for it, look it up in the list and know it's already been found and thus not add it again.

1) if 'ABC', it will compute '589654', there may be other words, that also compute to '589654' . This can not be avoided (unless all words come from a known list, and never ever can there be any other word).
But you can store all words that compute the same, under that number. (separate with a #00 byte)

2) This is what a hash table is. This is what TFPHashList does.
You just need a version that doesnt force shortstrings.

Logged

From the wiki: Ide Tools, Code completion and more / IDE cool features / Debugger Status

rvk

Hero Member
Posts: 6171

Re: How to remove duplicate entries from very large text files

« Reply #33 on: September 16, 2016, 04:22:35 pm »

Quote from: Gizmo on September 16, 2016, 04:09:35 pm

I'm thinking that if we can work out a way of computing an integer value for a string that is reversable, then once it has been found in the input file once, it won't be added again. Lets pretend 'ABC' equalled 589654, for arguments sake, then next line read the program does, if it happens to also be 'ABC', it will compute '589654' for it, look it up in the list and know it's already been found and thus not add it again.

That would be brilliant if you can put it into an Int64. But how large are your biggest possible words???
No way you can fit a 20 letter word in an Int64.

For 36 letter/numbers, which is not even all letters, unicode?? you would need 7 bits.
(count them... 1,2,4,8,16,32,64. That's 7 bits.)
That's about 9 letters per 64bit. So your words can't exceed 9 characters. Do you have bigger words ??????

« Last Edit: September 16, 2016, 04:24:29 pm by rvk »

Logged

Gizmo

Hero Member
Posts: 831

Re: How to remove duplicate entries from very large text files

« Reply #34 on: September 16, 2016, 04:58:35 pm »

Quote from: engkin on September 16, 2016, 12:30:12 pm

Use NameOfIndex.

Edit:
Set the Capacity at the beginning.
Use FindIndexOf instead of Find.

As a side note, consider using something like BufStream to increase the speed of reading.

Engkin...nice and clear suggestion as always Engkin, thanks. But sadly, still no joy.

If I try using

Code: [Select]


if HashList.FindIndexOf(printableword)) = -1 then 
begin
  //Add the values to the list etc
end;

the memory usage is still the same...it jumps up by 150Mb per second. And if I set HashList.Capacity to MaxInt or 2147483647, it almost immediately crashes reporting that the max capacity has been reached. So I take it back out and the program does start the deduplication for a few seconds before absorbing 1.3Gb RAM and then it crashes.

Any other ideas?

Logged

rvk

Hero Member
Posts: 6171

Re: How to remove duplicate entries from very large text files

« Reply #35 on: September 16, 2016, 05:13:15 pm »

Not tested and just something I just typed... (ugh...)
It could probably be made much more efficient with a generic record TList.
Also I have not sorted the TList so you'll need to iterate through the whole list.
Making it sorted would be much faster.

But hey... it's just a little concept:
(how does this do for big files??)

Code: Pascal [Select][+]

type
  PMyHash = ^TMyHash;
  TMyHash = record
    HashValue: int64;
    Position: int64;
  end;
 
  TMyHashList = class(TList)
  private
    function Get(Index: integer): PMyHash;
  public
    destructor Destroy; override;
    function Add(Value: PMyHash): integer;
    property Items[Index: integer]: PMyHash read Get; default;
  end;
 
function TMyHashList.Add(Value: PMyHash): integer;
begin
  Result := inherited Add(Value);
end;
 
destructor TMyHashList.Destroy;
var
  i: integer;
begin
  for i := 0 to Count - 1 do
    FreeMem(Items[i]);
  inherited;
end;
 
function TMyHashList.Get(Index: integer): PMyHash;
begin
  Result := PMyHash(inherited Get(Index));
end;
 
function ReadLine(const Stream: TStream; var Line: string): boolean;
var
  RawLine: string;
  ch: AnsiChar;
begin
  Result := False;
  ch := #0;
  while (Stream.Read(ch, 1) = 1) and (ch <> #13) and (ch <> #10) do
  begin
    Result := True;
    RawLine := RawLine + ch;
  end;
  Line := RawLine;
  if (ch = #13) then
  begin
    Result := True;
    if (Stream.Read(ch, 1) = 1) and (ch <> #10) then
      Stream.Seek(-1, soCurrent); // unread it if not LF character.
  end;
end;
 
procedure WriteLine(const Stream: TStream; Line: string);
begin
  Stream.Write(Line[1], Length(Line));
  Stream.Write(LineEnding, Length(LineEnding));
end;
 
function FPHash(const s: shortstring): longword;
var
  p, pmax: PChar;
begin
{$push}
{$Q-}
  Result := 0;
  p := @s[1];
  pmax := @s[length(s) + 1];
  while (p < pmax) do
  begin
    Result := longword(longint(Result shl 5) - longint(Result)) xor longword(P^);
    Inc(p);
  end;
{$pop}
end;
 
procedure Deduplicate(BigTextFile: string);
var
  FOut, FIn, Idx: TFileStream;
  Hash: TMyHashList;
  RHash: TMyHash;
  MyHash: PMyHash;
  Line, Check: string;
  Hs: integer;
  AddValue: boolean;
begin
  FOut := TFileStream.Create(BigTextFile + '-DeDuplicated', fmCreate);
  FIn := TFileStream.Create(BigTextFile, fmOpenRead or fmShareDenyWrite);
  Idx := TFileStream.Create(BigTextFile, fmOpenRead or fmShareDenyWrite); // for reread
  Hash := TMyHashList.Create;
  try
    RHash.Position := FIn.Position;
    while ReadLine(FIn, Line) do
    begin
      RHash.HashValue := FPHash(Line);
      AddValue := True;
      for Hs := 0 to Hash.Count - 1 do
      begin
        if Hash.Items[Hs]^.HashValue = RHash.HashValue then
        begin
          Idx.Position := Hash.Items[Hs]^.Position;
          if ReadLine(Idx, Check) then AddValue := (Line <> Check);
          if not AddValue then break;
        end;
      end;
      if AddValue then
      begin
        GetMem(MyHash, SizeOf(TMyHash));
        MyHash^.HashValue := RHash.HashValue;
        MyHash^.Position := RHash.Position;
        Hash.Add(MyHash);
        WriteLine(FOut, Line);
      end;
      RHash.Position := FIn.Position;
    end;
  finally
    FOut.Free;
    FIn.Free;
    Idx.Free;
    Hash.Free;
  end;
end;

« Last Edit: September 16, 2016, 05:45:58 pm by rvk »

Logged

Martin_fr

Administrator
Hero Member
Posts: 9913
Debugger - SynEdit - and more

Re: How to remove duplicate entries from very large text files

« Reply #36 on: September 16, 2016, 05:57:59 pm »

a bit more details on the idea from my last post

calchash(string): integer; // use the hash function from tfphashlist.

in dummy code

hope I didnt miss any error.

memory usage will be constant, all extra goes into files

Code: Pascal [Select][+]

  table_size = 100000000; // 100 million = 800 MB
  table: array of int64;
  setlength(table, table_size);
 
  foreach word_in_input do begin
     h := calchash_word mod table_size;
     known := table[h];
 
     if (p and $8000000000000000) = 0 then p := p - 1
     // table was initialized with 0, so we store all pos to output incremented by 1
 
    
     if known = -1 then begin
       p = output.pos; // current pos in output
       output.writeln(word);
       table[h] := (p+1) and $7ffffffffffffff;  // reserve the top bit for other purpose / 63 bit are enough for file pos
       continue; // next word
    end;
 
     dup := false;
     known_orig := known;
 
     // top bit indicates that there is more than one word for this value
     while (not dup) and ((known & $8000000000000000) <> 0) do begin
         extra_file.seek(known and $7ffffffffffffff, from_start); // from start
         known_prev :=    extra_file.read_int64();
         word_pos :=    extra_file.read_int64();
         output_2nd_handle.seek(word_pos);
         // read word from output
         dup := word = word_from_output     
         known := known_prev;
     end;
 
     if not dup then begin
        output_2nd_handle.seek(known, from_start);
        // read word from output
        dup := word = word_from_output     
        output_2nd_handle.seek(-1, from_end); // to eol
     end;
 
     if dup then continue;  // next word
 
     // write info to extra file for hashes with many words
       p = output.pos; // current pos in output
       output.writeln(word);
       x := extra_file.pos;
       extra_file.seek(-1, from_end); // go to eol // apend
       extra_file.write_int64(known_orig);
       extra_file.write_int64(p);
       table[h] := x or $8000000000000000;  // pos in xtra file
 
  end
 

Logged

From the wiki: Ide Tools, Code completion and more / IDE cool features / Debugger Status

Fungus

Sr. Member
Posts: 353

Re: How to remove duplicate entries from very large text files

« Reply #37 on: September 16, 2016, 07:49:46 pm »

Fair enough, suggestions are rolling in.. So, here is my suggestion:

Code: Pascal [Select][+]

Program DeDupAndSort;
 
{$H+}
 
Uses Classes, SysUtils;
 
Const
  CRLF : String = #13#10;
 
Type
  PInt64 = ^Int64;
 
  //Helper class to read and write lines
  TStringIO = Class Helper For TFileStream
    Function ReadLine: RawByteString;
    Procedure WriteLine(Const S: String);
  End;
 
Function TStringIO.ReadLine: RawByteString;
Var C: Char;
Begin
  //Helper class ReadLine
  Result:= '';
  While Position < Size Do Begin
    Read(C, SizeOf(C));
    If C = #10 Then Exit
    Else If C <> #13 Then Result:= Result + C;
  End;
End;
 
Procedure TStringIO.WriteLine(Const S: String);
Var L: Integer;
Begin
  //Helper class WriteLine
  L:= Length(S);
  If L > 0 Then Write(S[1], L);
  Write(CRLF[1], Length(CRLF));
End;
 
//Global variables
Var
   InFile, OutFile: TFileStream;
   OutIndex: TList;
   FileName, CurWord: String;
   InsertAt: Integer;
   Tick, MemUsed: Int64;
 
Procedure MakeRandomFile;
Const cLetters : String = 'abcdefghijklmnopqrstuvxyzABCDEFGHIJKLMNOPQRSTUVXYZ';
      cWordCount = 1000000;
      cMinWordLen = 5;
      cMaxWordLen = 30;
Var W: String;
    I, WL, WI: Integer;
Begin
  //Create a file of random words where every 10th word is duplicated
  Randomize;
  With TFileStream.Create(FileName, fmCreate) Do Try
    For I:= 1 To cWordCount Do Begin
      If I Mod 10 <> 0 Then Begin
        W:= '';
        WL:= Random(cMaxWordLen - cMinWordLen) + cMinWordLen;
        For WI:= 1 To WL Do W:= W + cLetters[Random(Length(cLetters)-1)+1];
      End;
      WriteLine(W);
    End;
  Finally
    Free;
  End;
End;
 
Function Lookup(Const S: String): Boolean;
Var First, Last, Idx, Cmp: Integer;
    LCase, CurStr: String;
Begin
  //Lookup "S" using the sorted index
  Result:= False;
  If OutIndex.Count = 0 Then Begin
    //Empty, adjust insert position
    InsertAt:= 0;
    Exit;
  End;
 
  //Lower case for case insensitivity
  LCase:= LowerCase(S);
 
  //Adjust first and last element
  First:= 0;
  Last:= OutIndex.Count - 1;
 
  //This is a basic binary search routine, removing half of the entries
  //for each lookup. This is extremely efficient for large data amounts
  //but it requires the values to be sorted
  While First <= Last Do Begin
 
    //Find index to test (middle of the search range)
    Idx:= ((Last - First) Div 2) + First;
 
    //Load the element as lowercase from OutFile
    OutFile.Position:= PInt64(OutIndex[Idx])^;
    CurStr:= LowerCase(OutFile.ReadLine);
 
    //Compare elements
    Cmp:= CompareStr(LCase, CurStr);
 
    //Handle comparison:
    If Cmp < 0 Then Last:= Idx - 1 //If less, we cut away upper half
    Else If Cmp > 0 Then First:= Idx + 1 //If greater, we cut away the lower half
    Else Begin
      //We have a match / duplicate
      Result:= True;
      Exit;
    End;
 
  End;
 
  //Adjust the insert position of the tested string
  If Cmp > 0 Then Inc(Idx);
  InsertAt:= Idx;
 
End;
 
Procedure Insert(Const S: String);
Var I: PInt64;
Begin
  //Insert a string to OutFIle and its index
  New(I);
  I^:= OutFile.Size;
  OutFile.Position:= I^;
  OutFile.WriteLine(S);
  OutIndex.Insert(InsertAt, I);
End;
 
Procedure ExportSorted;
Var I: Integer;
Begin
  //Export a sorted version of the de-duplicated file
  With TFileStream.Create(FileName + '_sort', fmCreate) Do Try
    For I:= 0 To OutIndex.Count - 1 Do Begin
      OutFile.Position:= PInt64(OutIndex[I])^;
      WriteLine(OutFile.ReadLine);
    End;
  Finally
    Free;
  End;
End;
 
Procedure FreeOutIndex;
Var I: Integer;
Begin
  //Release the index
  For I:= 0 To OutIndex.Count - 1 Do Dispose(PInt64(OutIndex[I]));
  OutIndex.Free;
End;
 
Begin
 
  //Create a file name
  FileName:= ExtractFilePath(ParamStr(0)) + 'bigtextfile.txt';
 
  //If the file does not exists, generate it
  If Not FileExists(FileName) Then MakeRandomFile;
 
  //Create input file, output file and index
  InFile:= TFileSTream.Create(FileName, fmOpenRead);
  OutFile:= TFileStream.Create(FileName + '_ddup', fmCreate);
  OutIndex:= TList.Create;
 
  Try
 
    //Store memory used
    MemUsed:= GetHeapStatus.TotalAllocated;
 
    //Store tick
    Tick:= GetTickCount64;
 
    While InFile.Position < InFile.Size Do Begin
 
      //Read current word
      CurWord:= InFile.ReadLine;
 
      //Lookup word and insert if unique
      If Not Lookup(CurWord) Then Insert(CurWord)
      Else WriteLn('Removing duplicate: ' + CurWord);
 
    End;
 
    //Show memory used
    MemUsed:= GetHeapStatus.TotalAllocated - MemUsed;
    WriteLn(Format('Total memory used is %d bytes', [MemUsed]));
 
    //Show time used
    Tick:= GetTickCount64 - Tick;
    WriteLn(Format('Duplicates removed in %.3f seconds', [Tick/1000.0]));
 
    //Export a sorted list of unique words and show speed
    Tick:= GetTickCount64;
    ExportSorted;
    Tick:= GetTickCount64 - Tick;
    WriteLn(Format('Export of sorted file completed in %.3f seconds', [Tick/1000.0]));
 
  Finally
 
    //Release da grease!
    FreeOutIndex;
    OutFile.Free;
    InFile.Free;
 
  End;
 
  //Yep!!
  WriteLn('Done');
 
End.

It will generate a file just shy of 20MB and remove the duplicates and output both an unsorted and a sorted version of the result, on my system this takes 304.284 seconds. Since no hashes or CRC's are used the output is 100% accurate.

Logged

Gizmo

Hero Member
Posts: 831

Re: How to remove duplicate entries from very large text files

« Reply #38 on: September 16, 2016, 09:59:17 pm »

Just a very quick reply to say thanks so much you three for the awesome suggestions which I will try out on Sunday night (no time between now and then). It amazes me how you guys quickly come up with something like it's second nature when it's taken me a week of battles. It's really very kind of you. I will report back on the outcome.

Logged

BeniBela

Hero Member
Posts: 908

Re: How to remove duplicate entries from very large text files

« Reply #39 on: September 16, 2016, 11:18:59 pm »

You could speed that up with a Bloom filter

It is a probabilistic hash structure that detects almost all unique lines. Then you only need to check the remaining possible duplicates.

Logged

https://www.benibela.de/index_en.html
https://github.com/benibela

shobits1

Sr. Member
Posts: 271
.

Re: How to remove duplicate entries from very large text files

« Reply #40 on: September 17, 2016, 02:17:18 am »

I have this idea but don't know if it is efficient.
how about creating a dictionary(or array) for all the words in the text file. then translate the text file using indexes from dictionary this will reduce the memory usage. now sort the items on the new file check for duplicate and remove them from original text file.

illustration:
original text (size: 118 bytes)
Lorem ipsum dolor sit amet
consectetur adipiscing elit.
Maecenas egestas purus mauris, vitae
consectetur adipiscing elit.

the dict will hold
1:Lorem 2:ipsum 3:dolor 4:sit 5:amet 6:consectetur 7:adipiscing 8:elit 9:. 10:Maecenas 11:egestas 12:purus 13:mauris 14:, 15:vitae

new file will be (key:value) or array (size: 35 bytes)
1:$0102030405
2:$06070809
3:$0A0B0C0D0E0F
4:$06070809

sort new file by value
1:$0102030405
2:$06070809
4:$06070809
3:$0A0B0C0D0E0F

delete all unique value will result
4:$06070809

read and write original text skipping remaining items will result.
1:$0102030405
2:$06070809
3:$0A0B0C0D0E0F

one thing thought, this will need to read the text file many times but it will save memory a lot when the text includes unicode characters.

what do you think.

« Last Edit: September 17, 2016, 02:19:02 am by shobits1 »

Logged

Martin_fr

Administrator
Hero Member
Posts: 9913
Debugger - SynEdit - and more

Re: How to remove duplicate entries from very large text files

« Reply #41 on: September 17, 2016, 03:00:58 am »

Quote from: shobits1 on September 17, 2016, 02:17:18 am

I have this idea but don't know if it is efficient.
how about creating a dictionary(or array) for all the words in the text file. then translate the text file using indexes from dictionary this will reduce the memory usage. now sort the items on the new file check for duplicate and remove them from original text file.

1) Building the dictionary may be costly, as you must check each (sub-)word that you add, if it is already present (for that you must either keep the words in memory, or read the file over and over and over.)

2) The dictionary only saves space, if
a) there are duplicate (sub-)words
b) the words are longer than the resulting number (they will be if your numbers a of variable byte len, no padding)

3) You may still run out of memory

4) most important: If you are going to sort, you may sort the original list? what is the benefit of translating it first?

Logged

From the wiki: Ide Tools, Code completion and more / IDE cool features / Debugger Status

shobits1

Sr. Member
Posts: 271
.

Re: How to remove duplicate entries from very large text files

« Reply #42 on: September 17, 2016, 08:05:55 am »

1- very True, but I'm hoping the words counts won't be big tens of thousands.
2.a- in file with gigabyte of string it bound to be many many duplicated words and if case-sensitiveness doesn't matter there'll be more.

2.b- no padding needed... and in most cases there'll be no lose in space at least for words with 4 chars when words count exceed 2^16 (if the file is unicode there will be huge memory reduction).

3- everything has limit.

4- already proposed by molly and the first thing that comes to my mind, but OP didn't take interest in it, maybe the file should not be sorted. also translating the file would result in binary file so small that makes using conventional sort method possible.

Logged

Lazarus

Bookstore

Search

Recent

Author Topic: How to remove duplicate entries from very large text files (Read 24431 times)

Gizmo

Re: How to remove duplicate entries from very large text files

rvk

Re: How to remove duplicate entries from very large text files

Martin_fr

Re: How to remove duplicate entries from very large text files

rvk

Re: How to remove duplicate entries from very large text files

Gizmo

Re: How to remove duplicate entries from very large text files

rvk

Re: How to remove duplicate entries from very large text files

Martin_fr

Re: How to remove duplicate entries from very large text files

Fungus

Re: How to remove duplicate entries from very large text files

Gizmo

Re: How to remove duplicate entries from very large text files

BeniBela

Re: How to remove duplicate entries from very large text files

shobits1

Re: How to remove duplicate entries from very large text files

Martin_fr

Re: How to remove duplicate entries from very large text files

shobits1

Re: How to remove duplicate entries from very large text files

	Computer Math and Games in Pascal (preview)
	Lazarus Handbook