Recent

Author Topic: Strings and special characters removal  (Read 9054 times)

lucamar

  • Hero Member
  • *****
  • Posts: 4219
Re: Strings and special characters removal
« Reply #15 on: February 15, 2019, 11:28:10 pm »
Errrr.... maybe a stupid question:
What's wrong with defining an Array containing all illegal characters, then just loop through that array firing of a ReplaceStr or StringReplace against the Target-String?

Nothing wrong ... except speed. Testing for legal characters in a set is much quicker than looping through an array and replacing the illegal ones. Note that there are just 36 legal vs. 219 illegal values.

ETA: Huh ... yeah, what dbannon said  :-[
« Last Edit: February 15, 2019, 11:31:52 pm by lucamar »
Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!) :P
Lazarus/FPC 2.0.8/3.0.4 & 2.0.12/3.2.0 - 32/64 bits on:
(K|L|X)Ubuntu 12..18, Windows XP, 7, 10 and various DOSes.

garlar27

  • Hero Member
  • *****
  • Posts: 652
Re: Strings and special characters removal
« Reply #16 on: February 15, 2019, 11:34:42 pm »
I've been looking at this http://wiki.freepascal.org/UTF8_strings_and_characters#The_beauty_of_UTF-8 and it allows you to know when you are dealing with a multi-byte char.

Now you know that
Code: Pascal  [Select][+][-]
  1. case AnAscciiChar of
  2.    000..127: OneByteChar;
  3.    128..191: TwoByteChar;
  4.    192..223: ThreeByteChar;
  5.    224..255: FourByteChar;
  6. end;
  7.  
I would do something like this:
Code: Pascal  [Select][+][-]
  1.    ALength := Length(AStr);
  2.    ind := 1;
  3.    while ind <= ALength do begin
  4.       AChar := AStr[ind];
  5.       case AChar of
  6.          000..127: IncCount := 1;
  7.          128..191: IncCount := 2;
  8.          192..223: IncCount := 3;
  9.          224..255: IncCount := 4;
  10.       end;
  11.       TheRealChar := Copy(AStr, ind, IncCount);
  12.       ProcessTheRealChar(TheRealChar);
  13.  
  14.       Inc(ind, IncCount);
  15.    end;
  16.  
But this will not tell you if this multi-byte char IS or IS NOT a "letter" because it includes punctuation and other symbols.
Does anyone know how to know if it is a letter or something else?

lucamar

  • Hero Member
  • *****
  • Posts: 4219
Re: Strings and special characters removal
« Reply #17 on: February 15, 2019, 11:45:43 pm »
But this will not tell you if this multi-byte char IS or IS NOT a "letter" because it includes punctuation and other symbols.
Does anyone know how to know if it is a letter or something else?

Sure: Get the codepoint and test against all known "letter" codepoints  ;)
There must be a function somewhere which does it already, IsAlpha() or something alike, but I don't remember where it is. Let me check ...

(later) I haven't found (yet) anything as simple as a IsUnicodeAlpha() function but there are some things it the unicodedata unit (in {fpc-source}/rtl/objpas/unicodedata.pas) that may be useful to implement one.
« Last Edit: February 16, 2019, 12:05:00 am by lucamar »
Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!) :P
Lazarus/FPC 2.0.8/3.0.4 & 2.0.12/3.2.0 - 32/64 bits on:
(K|L|X)Ubuntu 12..18, Windows XP, 7, 10 and various DOSes.

garlar27

  • Hero Member
  • *****
  • Posts: 652
Re: Strings and special characters removal
« Reply #18 on: February 15, 2019, 11:55:10 pm »
I saw a IsLetter in the unit Character but it looks like it works with UTF16 ....
 :-\

dbannon

  • Hero Member
  • *****
  • Posts: 2786
    • tomboy-ng, a rewrite of the classic Tomboy
Re: Strings and special characters removal
« Reply #19 on: February 16, 2019, 04:44:58 am »
.....
But this will not tell you if this multi-byte char IS or IS NOT a "letter" because it includes punctuation and other symbols.
Does anyone know how to know if it is a letter or something else?
Wow, that could be really hard I suspect.
There is, afaik, no rule about how a code is mapped to UTF8 let alone the other unicodes. And there are quite a lot of them. I also think that some could be disputed as to whether or not they contribute to making up a pronounceable word. Would that be the test you want to apply ?

Davo
Lazarus 3, Linux (and reluctantly Win10/11, OSX Monterey)
My Project - https://github.com/tomboy-notes/tomboy-ng and my github - https://github.com/davidbannon

Thaddy

  • Hero Member
  • *****
  • Posts: 14205
  • Probably until I exterminate Putin.
Re: Strings and special characters removal
« Reply #20 on: February 16, 2019, 09:59:58 am »
I would determine the UTF8 codepoint, pass it to UTF8Decode, so it becomes a single UTF16 UnicodeChar and subsequently call the TCharacter functions on it. That's a bit cumbersome, but would work.
Specialize a type, not a var.

dbannon

  • Hero Member
  • *****
  • Posts: 2786
    • tomboy-ng, a rewrite of the classic Tomboy
Re: Strings and special characters removal
« Reply #21 on: February 16, 2019, 11:22:23 pm »
Are you suggesting that TCharacter has a lookup table that lists over a million UTF8 codes ? Or just those ones defined as being characters ?

Are CJK classed as characters for example ?

(I guess I need to look through the code)
Lazarus 3, Linux (and reluctantly Win10/11, OSX Monterey)
My Project - https://github.com/tomboy-notes/tomboy-ng and my github - https://github.com/davidbannon

garlar27

  • Hero Member
  • *****
  • Posts: 652
Re: Strings and special characters removal
« Reply #22 on: February 18, 2019, 05:13:20 pm »
I've been looking in units and functions related to UTF8 and there's nothing that can give info related to the char/codepoint but unicode nit have that info so its like Thaddy said:
I would determine the UTF8 codepoint, pass it to UTF8Decode, so it becomes a single UTF16 UnicodeChar and subsequently call the TCharacter functions on it. That's a bit cumbersome, but would work.
Maybe you should test if it is better to convert every utf8 codepoint to unicode or if it is better to convert the whole string. don't know which would be better.

furious programming

  • Hero Member
  • *****
  • Posts: 853
Re: Strings and special characters removal
« Reply #23 on: February 18, 2019, 06:30:56 pm »
I've been looking in units and functions related to UTF8 and there's nothing that can give info related to the char/codepoint […]

What about the following functions from LazUTF8 unit:

- UTF8CodepointSize,
- UTF8CodepointSizeFull,
- UTF8CodepointSizeFast,
- UTF8CharacterLength.

Usage is simple:

Code: Pascal  [Select][+][-]
  1. uses
  2.   LazUTF8;
  3.  
  4.   procedure ShowCodepointSizes(const AString: String);
  5.   var
  6.     Codepoint: PChar = #0;
  7.     Size: Integer;
  8.   begin
  9.     if AString <> '' then
  10.       Codepoint := @AString[1];
  11.  
  12.     while Codepoint^ <> #0 do
  13.     begin
  14.       Size := UTF8CodepointSize(Codepoint);
  15.       Codepoint += Size;
  16.  
  17.       Write(Size:2);
  18.     end;
  19.   end;
  20.  
  21. begin
  22.   ShowCodepointSizes('zażółć gęślą jaźń');
  23. end.

Output:

Code: Pascal  [Select][+][-]
  1.  1 1 2 2 2 2 1 1 2 2 1 2 1 1 1 2 2
« Last Edit: February 18, 2019, 06:32:59 pm by furious programming »
Lazarus 3.2 with FPC 3.2.2, Windows 10 — all 64-bit

Working solo on an acrade, action/adventure game in retro style (pixelart), programming the engine and shell from scratch, using Free Pascal and SDL. Release planned in 2026.

JLWest

  • Hero Member
  • *****
  • Posts: 1293
Re: Strings and special characters removal
« Reply #24 on: February 18, 2019, 08:32:56 pm »
I've been looking in units and functions related to UTF8 and there's nothing that can give info related to the char/codepoint […]

What about the following functions from LazUTF8 unit:

- UTF8CodepointSize,
- UTF8CodepointSizeFull,
- UTF8CodepointSizeFast,
- UTF8CharacterLength.

Usage is simple:

Code: Pascal  [Select][+][-]
  1. uses
  2.   LazUTF8;
  3.  
  4.   procedure ShowCodepointSizes(const AString: String);
  5.   var
  6.     Codepoint: PChar = #0;
  7.     Size: Integer;
  8.   begin
  9.     if AString <> '' then
  10.       Codepoint := @AString[1];
  11.  
  12.     while Codepoint^ <> #0 do
  13.     begin
  14.       Size := UTF8CodepointSize(Codepoint);
  15.       Codepoint += Size;
  16.  
  17.       Write(Size:2);
  18.     end;
  19.   end;
  20.  
  21. begin
  22.   ShowCodepointSizes('zażółć gęślą jaźń');
  23. end.

Output:

Code: Pascal  [Select][+][-]
  1.  1 1 2 2 2 2 1 1 2 2 1 2 1 1 1 2 2

I don't really understand this.
FPC 3.2.0, Lazarus IDE v2.0.4
 Windows 10 Pro 32-GB
 Intel i7 770K CPU 4.2GHz 32702MB Ram
GeForce GTX 1080 Graphics - 8 Gig
4.1 TB

lucamar

  • Hero Member
  • *****
  • Posts: 4219
Re: Strings and special characters removal
« Reply #25 on: February 18, 2019, 08:58:57 pm »
I've been looking in units and functions related to UTF8 and there's nothing that can give info related to the char/codepoint […]

What about the following functions from LazUTF8 unit:
[... etc...]

I don't really understand this.

Never mind, JLWest, it has nothing to do with your question. As lots of other times, the discussion has deviated slightly O.T.  :D
Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!) :P
Lazarus/FPC 2.0.8/3.0.4 & 2.0.12/3.2.0 - 32/64 bits on:
(K|L|X)Ubuntu 12..18, Windows XP, 7, 10 and various DOSes.

furious programming

  • Hero Member
  • *****
  • Posts: 853
Re: Strings and special characters removal
« Reply #26 on: February 18, 2019, 09:03:08 pm »
Yes, finally, I quoted the words of another user. This is just an example of obtaining the size of codepoints in a given string of characters and an example of iterating over them (there are ready functions for this in LCL library).

I did not start O.T. :D


BTW: the function UTF8CodepointSizeFull is local in LazUTF8 unit.
« Last Edit: February 18, 2019, 09:07:00 pm by furious programming »
Lazarus 3.2 with FPC 3.2.2, Windows 10 — all 64-bit

Working solo on an acrade, action/adventure game in retro style (pixelart), programming the engine and shell from scratch, using Free Pascal and SDL. Release planned in 2026.

 

TinyPortal © 2005-2018