News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Coprocessor slows down the code?

Started by BBalazs, November 22, 2006, 03:32:27 PM

Previous topic - Next topic

BBalazs

Hi all,

Just responding to a request of a friend of mine, I have made an assembly 'inlay', that multiplies two real num (qword type) and stores it.

It was nothing else, just a try, whether is he able to accelerate the original VC++ prog with my assembly insertion.

This code below needs as much as 32 secs under XP on my 3GHz computer (with 50% of processor load).
(please, ignore the comments, those are Hungarian)

He said, his code in VC++ needs approx. 135ms!!!

So I am confused about the real working speed of the coprocessor...I thought it must be near to the core speed...
Just ONE 'FLD mem' command slows down 1/100th of the original speed.

Thanks in advance
Balazs


.386
.model flat,stdcall
option casemap:none
include \masm32\include\windows.inc
include \masm32\include\user32.inc
include \masm32\include\kernel32.inc
includelib \masm32\lib\user32.lib
includelib \masm32\lib\kernel32.lib

; lebegopontos proba

.data

elsodim dq  -1.34     ;ezek a tombok, NAGYON!!!!!!!! fontos, hogy legyen tizedespont
dq 45.1                  ;meg az egesz szamok mogott is!!! Ezen szivtam egy orat!
dq 4.67                  ;csak peldaszamok, a vegeredmeny 9.408, csak ezerrel beszorzom
dq 0.01
dq -3.5

masodim dq 2.0
dq 0.2
dq -5.1
dq 200.0
dq -7.11

osszeg dq 0

ezer dd 1000
egeszosszeg dd 0
ciklusvaltozo dd 0

keszvan db 'A progi vegzett',0

.code
start:
FINIT

mov ciklusvaltozo,0
nagyciklus:

mov ecx,5 ;ennyi elem
mov edx,0 ;kezdo elem
ciklusfej:
mov eax,edx ;elem az eax-be, hogy az edx ne vesszen el, itt szamolok
shl eax,3 ;x8 ;mert qword
mov esi,eax ;pointerbe mozgatom

FLD [elsodim+esi] ;verem tetejere az elso
FLD [masodim+esi] ;verem tetejere a masodik, az elozo lejjebb megy
FMUL ;verem tetejen levo ket szam szorzasa, eredmeny a tetore
FLD osszeg ;verem tetejere a korabbi osszeg, a szorzas eredemnye lejjebb csuszik
FADD ;verem tetejen levo ket szam osszeadasa, eredmeny a tetejere
FSTP osszeg ; a verem tetejen levo szam valos (tenbyte) formaban valo tarolasa
inc edx ;kovetkezo elem
loop ciklusfej ;ha az ecx nem 0, ciklusfejez ismet

FLD osszeg ;az osszeg verem tetejere
FIMUL ezer ;szorzom ezerrel, hogy a tizedesek is latsszanak
FIST egeszosszeg;egesz szamkent tarolom egy dword-ben

inc ciklusvaltozo
cmp ciklusvaltozo,10000000
jne nagyciklus

invoke MessageBox, NULL, addr keszvan, addr keszvan, MB_OK ;kisablakban
invoke ExitProcess, eax ;kilepek a programbol

end start


raymond

I tried your code as posted and my timing was 21 secs. I then changed one single instruction and the time dropped to 3 secs. Your "FIST egeszosszeg;egesz szamkent tarolom egy dword-ben" should have been FISTP. If you had looked at the answer from your original code, you would have noticed that it would have been garbage. The FPU registers got fully loaded after the first 7 cycles out of the 10,000,000. Trying to load data into a non-free register triggers an internal exception handling procedure which then got repeated over 100,000,000 times.

A slight modification of your code then got the time down to about 0.5 sec on my P3-550. It is thus possible that the time could be reduced to 135 ms on a very fast computer, even in VB. Try the following.

.code
start:
      FINIT

;      mov ciklusvaltozo,0    ;ALREADY INITIALIZED TO ZERO
nagyciklus:

      mov ecx,5     ;ennyi elem
      mov edx,0     ;kezdo elem
      FLD osszeg     ;MOVE IT HERE TO AVOID LOADING AND STORING IT EVERYTIME
ciklusfej:
;      mov eax,edx      ;elem az eax-be, hogy az edx ne vesszen el, itt szamolok
;      shl eax,3        ;x8 ;mert qword
;      mov esi,eax      ;pointerbe mozgatom

      FLD [elsodim+edx*8]    ;ALLOWED SYNTAX IN MASM
;      FLD [masodim+esi]     ;verem tetejere a masodik, az elozo lejjebb megy
      FMUL [masodim+edx*8]
      FADD          ;verem tetejen levo ket szam osszeadasa, eredmeny a tetejere
;      FSTP osszeg      ; a verem tetejen levo szam valos (tenbyte) formaban valo tarolasa
      inc edx           ;kovetkezo elem
      DEC ECX
      JNZ ciklusfej
;      loop ciklusfej      ;ha az ecx nem 0, ciklusfejez ismet
;THE LOOP INSTRUCTION IS VERY SLOW

;      FLD osszeg      ;az osszeg verem tetejere
      FIMUL ezer      ;szorzom ezerrel, hogy a tizedesek is latsszanak
      FISTP egeszosszeg     ;egesz szamkent tarolom egy dword-ben

      inc ciklusvaltozo
      cmp ciklusvaltozo,10000000
      jne nagyciklus

      invoke MessageBox, NULL, addr keszvan, addr keszvan, MB_OK     ;kisablakban
      invoke ExitProcess, eax     ;kilepek a programbol

end start


Raymond
When you assume something, you risk being wrong half the time
http://www.ray.masmcode.com

BBalazs

Hi, Raymond

thank you for your answer, comment and correction.
I have to apologize for my minimal knowledge about the coprocessor.
I have never had an idea about this exception handling. This phenomenon explains the slowdown.



raymond

Hi BBalazs

In case you are not already aware of it, you can certainly gain a bit more knowledge about the FPU from the following:

http://www.ray.masmcode.com/fpu.html

Raymond
When you assume something, you risk being wrong half the time
http://www.ray.masmcode.com

BBalazs

Raymond,

I highly appreciate your help. I downloaded your tutorial and I also pass the link to my friend, and perhaps he will be infected with this 'assembly-amok'  :bg
Thank you again. :U

Balazs

raymond

I certainly appreciate your enthusiasm to expand your knowledge and also sharing it with others. Don't be shy to ask for clarifications if anything is not sufficiently described.

Raymond
When you assume something, you risk being wrong half the time
http://www.ray.masmcode.com

BBalazs

It is also impressive to get an answer within such a short time. You, Hutch and Donkey (and many others, of course) are unfailing sources of knowledge. I think your forum is a warm and nice company and a store of pure wisdom without any self-importance.
Thank you for beeing here.  :U

Mark_Larson

Quote from: BBalazs on November 22, 2006, 03:32:27 PM

loop ciklusfej ;ha az ecx nem 0, ciklusfejez ismet


  Using 'loop' is very slow.  Change it to "sub ecx,1", "jnz ciklusfej"


   sub     ecx,1
   jnz      ciklusfej

BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

raymond

QuoteIt is also impressive to get an answer within such a short time.

However, do not expect to always get an answer within the same time frame. For example, if you had made your original post a few hours later, I may not have seen it until some 24 hours later. Or, if you had made it last week, I was on the road for several days without access to my computer.

And the same applies to other responders. Nobody is around 24 hours a day. For less specialized requests, you may expect to get numerous answers from several responders within the next 24 hours.

Never despair. Someone will be around to help.

Raymond


When you assume something, you risk being wrong half the time
http://www.ray.masmcode.com

gabor

Szevasz Balázs!

Talán nem rúgnak ki a fórumról, ha magyarul üdvözöllek! :)
- This is just a greeting in hungarian, I am really glad to see a fellow hungarian here  :bg
Well, I am not really sure, but is not the FPU slower than the SSE or SSE2 unit of the CPU? Mark, Raymond what do you know about this?

Greets, Gabor

Draakie

Just my 2 cents... in regards gabor's question - which he full well knows the answer to - but retorically
asks anyway  :P...

YES - SSE and SSE2 is faster - BUT won't help our novice coder yet.....

hungarian hmmm ..... makes me hungry tooo... :lol

Welcome !

Ta
Draakie




Does this code make me look bloated ? (wink)

BBalazs

Hi, Gabor

I have sent a private message to you...