Recent

Author Topic: Parallel Code  (Read 14712 times)

Nitorami

  • Sr. Member
  • ****
  • Posts: 496
Re: Parallel Code
« Reply #15 on: February 16, 2018, 05:47:59 pm »
No, it has nothing to do with threading, it must be the div performance in the inner loop.
Odd enough, on my home PC, an AMD A4-5300, there is no difference between 32bit and 64bit.

rvk

  • Hero Member
  • *****
  • Posts: 6163
Re: Parallel Code
« Reply #16 on: February 16, 2018, 06:05:48 pm »
Delphi 10.2 Tokyo also doesn't show any difference in 32 bit and 64 bit.
So if it's that, FPC does something really inefficient in 64 bit.

But that wouldn't explain why your AMD A4-5300 doesn't show any difference.

dieselfan

  • New Member
  • *
  • Posts: 16
Re: Parallel Code
« Reply #17 on: February 19, 2018, 08:30:46 am »
So I installed 64 and 32 on Windows 10 Pro 64bit

Lazarus 1.8 32bit - all integers replaced with nativeuint
SingleThread took 31,454 seconds, highest prime was 3001134
MultiThread took 4,375 seconds, highest prime was 3001134

Lazarus 1.8 64bit - as per downloaded example
SingleThread took 32,907 seconds, highest prime was 3001134
MultiThread took 4,313 seconds, highest prime was 3001134
Both with FPC 3.0.4

What did I do wrong?
« Last Edit: February 19, 2018, 08:33:54 am by dieselfan »
AMD 1800X
Manjaro KDE
Rarely Windows 10 Pro

rvk

  • Hero Member
  • *****
  • Posts: 6163
Re: Parallel Code
« Reply #18 on: February 19, 2018, 09:00:35 am »
What did I do wrong?
Nothing. Bit I noticed you are also on AMD.

Could someone with an 4 core intel processor test this on 32 and 64 bit?

I'll also try with 4 threads on 32 and 64 bit in bit and see if it makes a difference.

(B.T.W. I think you should actually use 7 threads because the main thread is also a thread. Although that's put on a while loop with sleep.)

tetrastes

  • Sr. Member
  • ****
  • Posts: 481
Re: Parallel Code
« Reply #19 on: February 19, 2018, 09:54:09 am »
i7-3770 @ 3.40Ghz, Windows 10 64-bit, Lasarus 1.8.0 fpc 3.0.4.

32 bit:
SingleThread took 12,718 seconds, highest prime was 3001134
MultiThread took 3,437 seconds, highest prime was 3001134

64 bit:
SingleThread took 44,860 seconds, highest prime was 3001134
MultiThread took 9,922 seconds, highest prime was 3001134

rvk

  • Hero Member
  • *****
  • Posts: 6163
Re: Parallel Code
« Reply #20 on: February 19, 2018, 10:13:45 am »
I'll also try with 4 threads on 32 and 64 bit in bit and see if it makes a difference.
(B.T.W. I think you should actually use 7 threads because the main thread is also a thread. Although that's put on a while loop with sleep.)
Rats... We already established threads have nothing to do with it. The slowdown is also in SingleThread.

i7-3770 @ 3.40Ghz, Windows 10 64-bit, Lasarus 1.8.0 fpc 3.0.4.
32 bit: SingleThread took 12,718 seconds, highest prime was 3001134
64 bit: SingleThread took 44,860 seconds, highest prime was 3001134
Thanks for testing. Now I know I'm not crazy :D

So apparently with an AMD processor there is no slowdown with DIV, MOD and/or Sqrt() between 32 and 64 bit. With an Intel processor there is.

Any (compiler) expert here who can explain this?

Leledumbo

  • Hero Member
  • *****
  • Posts: 8757
  • Programming + Glam Metal + Tae Kwon Do = Me
Re: Parallel Code
« Reply #21 on: February 19, 2018, 10:25:50 am »
i7-5500U CPU @ 2.40GHz, Linux 64-bit, FPC trunk last night

$ fpc -Pi386 -CX -XXs -O4 -OpCOREAVX2 -CpCOREAVX2 -CfAVX2 parallel_thread_example.pas
$ ./parallel_thread_example
Lazarus 1.8
SingleThread took 17.310 seconds, highest prime was 3001134
MultiThread took 10.412 seconds, highest prime was 3001134

$ fpc -CX -XXs -O4 -OpCOREAVX2 -CpCOREAVX2 -CfAVX2 parallel_thread_example.pas
$ ./parallel_thread_example
Lazarus 1.8
SingleThread took 54.481 seconds, highest prime was 3001134
MultiThread took 23.820 seconds, highest prime was 3001134

Pretty interesting to see that 32-bit beats 64-bit in single thread by over 3 times and ovet twice faster in multi thread. I additionally run them using gprof to profile, but the result seems "broken". In 32-bit most time taken (57%) is in TMyWorkerThread.Execute, while in 64-bit 99% time taken in IsPrime. Where is IsPrime in 32-bit?

rvk

  • Hero Member
  • *****
  • Posts: 6163
Re: Parallel Code
« Reply #22 on: February 19, 2018, 10:28:22 am »
I made a little test with Sqrt(), MOD and DIV. We can rule out thread-problems because it also happens in singlethread.
Results seem to indicate Sqrt() is much, much, much slower om 64 bit. Maybe it's not optimized for 64bit.

Quote
32 bit FPC trunk
Sqrt() took 9,593 seconds
DIV took 3,594 seconds
MOD took 3,438 seconds

64 bit FPC trunk
Sqrt() took 14,828 seconds
DIV took 3,719 seconds
MOD took 3,640 seconds

Code: Pascal  [Select][+][-]
  1. program project1;
  2. uses SysUtils;
  3. const
  4.   cMax = MaxInt;
  5. var
  6.   lT: QWord;
  7.   R: ValReal;
  8.   N: NativeInt;
  9.   I: NativeInt;
  10. begin
  11.   Writeln('Running');
  12.  
  13.   lT := GetTickCount64;
  14.   for I := 1 to cMax do
  15.   begin
  16.     R := Sqrt(I);
  17.     R := R + 1.0;
  18.   end;
  19.   lT := GetTickCount64 - lT;
  20.   WriteLn(Format('Sqrt() took %.3f seconds', [lT / 1000]));
  21.  
  22.   lT := GetTickCount64;
  23.   for I := 1 to cMax do
  24.   begin
  25.     N := 30 DIV 5;
  26.     N := N + 1;
  27.   end;
  28.   lT := GetTickCount64 - lT;
  29.   WriteLn(Format('DIV took %.3f seconds', [lT / 1000]));
  30.  
  31.   lT := GetTickCount64;
  32.   for I := 1 to cMax do
  33.   begin
  34.     N := 30 MOD 5;
  35.     N := N + 1;
  36.   end;
  37.   lT := GetTickCount64 - lT;
  38.   WriteLn(Format('MOD took %.3f seconds', [lT / 1000]));
  39.  
  40.   Writeln('Press enter key');
  41.   readln;
  42. end.

Thaddy

  • Hero Member
  • *****
  • Posts: 14373
  • Sensorship about opinions does not belong here.
Re: Parallel Code
« Reply #23 on: February 19, 2018, 10:46:09 am »
Humm.. How is maxint defined on 64 bit systems? Does it use integer32 or nativeint64? That would explain a lot? Can you try High(Longint)
« Last Edit: February 19, 2018, 10:48:33 am by Thaddy »
Object Pascal programmers should get rid of their "component fetish" especially with the non-visuals.

rvk

  • Hero Member
  • *****
  • Posts: 6163
Re: Parallel Code
« Reply #24 on: February 19, 2018, 10:53:39 am »
Humm.. How is maxint defined on 64 bit systems? Does it use integer32 or nativeint64? That would explain a lot?
MaxInt is defined as maxlongint on both 32 and 64 bit. (which is maxLongint  = $7fffffff;)

(using cMax = 2147483647; gives the same results)

(I have a feeling it might be that for both 32 and 64 bit the compiler uses FSQRT while there are more efficient versions for 64 bit but I'm a bit out of my depths there.)

Nitorami

  • Sr. Member
  • ****
  • Posts: 496
Re: Parallel Code
« Reply #25 on: February 19, 2018, 11:21:37 am »
Be careful, maxint depends on the mode and is 32767 in {$mode fpc}. Better set cmax to an absolute value.

As a side note, I seem to remember we should not rely on GetTickCount for absolute timing. Use "Now" instead. But this is not the issue in the matter at hand.

The conclusion that sqrt is slower on 64 bit is a bit premature; when we remove the implicit integer to float conversion in R := sqrt(I) and replace it by something like

var S: ValReal = 0;
for I := 1 to cMax do
begin
  R := Sqrt(S);
  S := S + 1.0;
end;         

the 64-bit-performance is actually 15% better than for 32bit on my PC.

For the original prime algorithm, the sqrt does not matter, because it is only called once in IsPrime. Measurements with the original code on a I5-3320 2.60GHz (4 cores), Win 7, FPC 3.0.4

64bit: 45 sec / 15.8 sec
32bit: 20 sec / 8.8 sec

rvk

  • Hero Member
  • *****
  • Posts: 6163
Re: Parallel Code
« Reply #26 on: February 19, 2018, 11:38:46 am »
The conclusion that sqrt is slower on 64 bit is a bit premature; when we remove the implicit integer to float conversion in R := sqrt(I) and replace it by something like
var S: ValReal = 0;
Good point.

Quote
Running to 2147483647
32 bit: Sqrt() took 9,766 seconds
64 bit: Sqrt() took 11,156 seconds

For me 64 bit is still slower bit I see that the int to float conversion does impact the results.

ASerge

  • Hero Member
  • *****
  • Posts: 2242
Re: Parallel Code
« Reply #27 on: February 19, 2018, 11:52:11 am »
Quote
32 bit FPC trunk
Sqrt() took 9,593 seconds
DIV took 3,594 seconds
MOD took 3,438 seconds

64 bit FPC trunk
Sqrt() took 14,828 seconds
DIV took 3,719 seconds
MOD took 3,640 seconds

Intel i5 (4 core). Windows x64.
Quote
32 bit FPC 3.0.4
Sqrt() took 17.534 seconds
DIV took 1.451 seconds
MOD took 1.451 seconds

64 bit FPC 3.0.4
Sqrt() took 21.044 seconds
DIV took 0.968 seconds
MOD took 0.967 seconds

64 bit FPC trunc 3.1.1
Sqrt() took 17.457 seconds
DIV took 0.967 seconds
MOD took 0.733 seconds

FPC 3.0.4 asm
Code: ASM  [Select][+][-]
  1. # [17] R := Sqrt(I);
  2. cvtsi2sd  %rdi,%xmm0
  3. sqrtsd  %xmm0,%xmm0
  4. movapd  %xmm0,%xmm6
  5.  
FPC 3.1.1 asm
Code: ASM  [Select][+][-]
  1. # [17] R := Sqrt(I);
  2. cvtsi2sdq  %rdi,%xmm0  ; with ...q
  3. sqrtsd  %xmm0,%xmm0
  4. movapd  %xmm0,%xmm6
  5.  

By the way, fpc trunk 3.1.1 inserts "call FPC_OVERFLOW", ignoring {$OVERFLOWCHECKS OFF}, but it is still faster anyway.

rvk

  • Hero Member
  • *****
  • Posts: 6163
Re: Parallel Code
« Reply #28 on: February 19, 2018, 12:00:40 pm »
Intel i5 (4 core). Windows x64.
Quote
32 bit FPC 3.0.4
Sqrt() took 17.534 seconds
DIV took 1.451 seconds
MOD took 1.451 seconds

64 bit FPC 3.0.4
Sqrt() took 21.044 seconds
DIV took 0.968 seconds
MOD took 0.967 seconds

64 bit FPC trunc 3.1.1
Sqrt() took 17.457 seconds
DIV took 0.967 seconds
MOD took 0.733 seconds
...
By the way, fpc trunk 3.1.1 inserts "call FPC_OVERFLOW", ignoring {$OVERFLOWCHECKS OFF}, but it is still faster anyway.
So. do you also happen to have the test for 32 bit FPC trunc 3.1.1? I did both tests on trunk and 32 bit seems to be constantly faster than 64 bit.

(But seeing that your MOD and DIVs are also faster in 64 bit I'm starting to doubt my own 64 bit compile. I'll try this on 1.8 32/64b. I only originally compared 64 bit Laz1.8 with my 32 bit trunk compile)


rvk

  • Hero Member
  • *****
  • Posts: 6163
Re: Parallel Code
« Reply #29 on: February 19, 2018, 12:30:45 pm »
(But seeing that your MOD and DIVs are also faster in 64 bit I'm starting to doubt my own 64 bit compile. I'll try this on 1.8 32/64b. I only originally compared 64 bit Laz1.8 with my 32 bit trunk compile)
Ok, there seems to be something terribly wrong with my own compile process.
(Although Leledumbo had the same problem with trunk and tetrastes with laz1.8 i.c.w. the original code)

These were my latest results:
Quote
Running to 2147483647

64 bit Laz1.8 (download): Sqrt() took 7,806 milliseconds
32 bit Laz1.8 (download): Sqrt() took 9,750 milliseconds

64 bit Laz1.9 trunk (compile): Sqrt() took 11,170 milliseconds (yikes)
32 bit Laz1.9 trunk (compile): Sqrt() took 9,760 milliseconds

(Need to figure out why my own compile of 64 bit is not optimized)

But the thread-demo still shows 64 bit slowdown on the original code in singlethread (even on standard downloaded laz1.8 64bit). So it's not the sqrt() but something else.

 

TinyPortal © 2005-2018