News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

m32lib GetPercent need a correction

Started by ToutEnMasm, November 07, 2010, 06:54:01 AM

Previous topic - Next topic

hutch--

I wrote this years ago to plug up a simple requirement, things like integer sizing for screen display and it has never been a performance issue where it was normally used. Now I have no doubt that it can be optimised but in the context of its use, its a "who cares" issue.

Now I understand what Yves has said but the procedure has always handled high call counts with no problems at all. Here is a test piece that calls it in a loop 1 million times.


IF 0  ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
                      Build this template with "CONSOLE ASSEMBLE AND LINK"
ENDIF ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    include \masm32\include\masm32rt.inc

    .code

start:
   
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    call main
    inkey
    exit

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

main proc

    push esi

    mov esi, 1000000

  lbl0:
    print str$(rv(GetPercent,esi,50)),13,10
    sub esi, 1
    jnz lbl0

    pop esi

    ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end start
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

hutch--

 :bg

JJ,


Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)
14      cycles for GetPercent
6       cycles for GetPercentJJ1
5       cycles for GetPercentJJ2

16      cycles for GetPercent
6       cycles for GetPercentJJ1
5       cycles for GetPercentJJ2

16      cycles for GetPercent
6       cycles for GetPercentJJ1
5       cycles for GetPercentJJ2

Code sizes:
32      for GetPercentJJ1
32      for GetPercentJJ2

--- ok ---


Removing the FDIV certainly improves the time.  :P
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Quote from: hutch-- on November 08, 2010, 12:14:55 AM
Removing the FDIV certainly improves the time.  :P

Certainly. SSE speeds it up once more, but I ran into a problem with the mulsd instruction: It is fast if the xmm register is already in float format, but if not, it costs over 200 cycles! The first version below works fine but needs lots of conversions.

GetPercentSSE_s:
GPJ005 REAL8 0.01
GetPercentSSE proc source:DWORD, percent:DWORD
cvtsi2sd xmm0, dword ptr [esp+4] ; source
cvtsi2sd xmm1, dword ptr [esp+8] ; percent
mulsd xmm0, xmm1
mulsd xmm0, GPJ005 ; multiply with 0.01, i.e. divide source by 100
cvtsd2si eax, xmm0
ret 8 ; 11 cycles
GetPercentSSE endp
GetPercentSSE_e:

GetPercentSSEs_s:
GPJ005s REAL8 0.01
GetPercentSSEs proc source:DWORD, percent:DWORD
; xorps xmm0, xmm0
; movaps xmm1, xmm0 ; no effect
movd xmm0, dword ptr [esp+4] ; source
movd xmm1, dword ptr [esp+8] ; percent
; int 3 ; OPT_Olly 2
pmuludq xmm0, xmm1 ; source * percent, OK (pm xmm0, qword ptr [esp+8] possible but slow)
MakeSlow = 1
if MakeSlow ; first branch: result is correct but >200 cycles
mulsd xmm0, GPJ005s ; multiply with 0.01, i.e. divide source by 100 - SLOOOOOOW ###
movd eax, xmm0
else ; second branch: result correct, 27 cycles
cvtdq2pd xmm0, xmm0 ; convert integer to float (Convert  Packed Doubleword Integers to Packed Double-Precision Floating-Point Values)
mulsd xmm0, GPJ005s ; multiply with 0.01, i.e. divide source by 100 - fast
cvtsd2si eax, xmm0
endif
ret 8 ; 259 or 348 cycles
GetPercentSSEs endp
GetPercentSSEs_e:


New testbed attached, now with more realistic cycle counts (there is a REPEAT 64 ... ENDM followed by shr eax, 6).
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
30      cycles for GetPercent
14      cycles for GetPercentJJ2
11      cycles for GetPercentSSE
263     cycles for GetPercentSSEs

Code sizes:
36      bytes for GetPercent, result=6790123
32      bytes for GetPercentJJ2, result=6790123
39      bytes for GetPercentSSE, result=6790123
39      bytes for GetPercentSSEs, result=6790123

Antariy


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
44      cycles for GetPercent
21      cycles for GetPercentJJ2
22      cycles for GetPercentSSE
1104    cycles for GetPercentSSEs

Code sizes:
36      bytes for GetPercent, result=6790123
32      bytes for GetPercentJJ2, result=6790123
39      bytes for GetPercentSSE, result=6790123
39      bytes for GetPercentSSEs, result=6790123


jj2007

Thanks, Alex. So the ordinary FPU version is one cycle faster than the fast SSE2 version... and a whopping 1100 cycles for the bad SSE stuff ::)

hutch--

JJ,

Inspired by you reciprocal multiply, this one has abot 40% legs on the old version. I changed the calculation order as well.


; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

GetPercent2 proc source:DWORD, percent:DWORD

    fild DWORD PTR [esp+8]  ; load percent
    fld10 0.01              ; load reciprocal of 100
    fmul                    ; mul by reciprocal = div by 100
    fild DWORD PTR [esp+4]  ; load the source
    fmul                    ; multiply by previous result
    fistp DWORD PTR [esp+8] ; pop FP stack and store result in stack variable
    mov eax, [esp+8]        ; write result to EAX for return value
    ret 8

GetPercent2 endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

#21
Great, so now we are waiting for Lingo :bg
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
11      cycles for GetPercentSSE
36      cycles for GetPercent
13      cycles for GetPercent2c
14      cycles for GetPercent2nc
14      cycles for GetPercentJJ2

11      cycles for GetPercentSSE
36      cycles for GetPercent
14      cycles for GetPercent2c
14      cycles for GetPercent2nc
14      cycles for GetPercentJJ2

11      cycles for GetPercentSSE
36      cycles for GetPercent
13      cycles for GetPercent2c
14      cycles for GetPercent2nc
14      cycles for GetPercentJJ2

Code sizes:
39      bytes for GetPercentSSE, result=6790123
36      bytes for GetPercent, result=6790123
41      bytes for GetPercent2c, result=6790123
45      bytes for GetPercent2nc, result=6790123
32      bytes for GetPercentJJ2, result=6790123


P.S.: GetPercent2c and GetPercent2nc are two variants of Hutch' new algo. I like the JJ2 variant, it fits into 2 paras and is reasonably fast.

Edit: Prescott P4:
20      cycles for GetPercentSSE
47      cycles for GetPercent
24      cycles for GetPercent2c
26      cycles for GetPercent2nc
20      cycles for GetPercentJJ2

hutch--


Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)
14      cycles for GetPercentSSE
21      cycles for GetPercent
8       cycles for GetPercent2c
9       cycles for GetPercent2nc
10      cycles for GetPercentJJ2

13      cycles for GetPercentSSE
21      cycles for GetPercent
8       cycles for GetPercent2c
9       cycles for GetPercent2nc
10      cycles for GetPercentJJ2

13      cycles for GetPercentSSE
21      cycles for GetPercent
8       cycles for GetPercent2c
9       cycles for GetPercent2nc
10      cycles for GetPercentJJ2

Code sizes:
39      bytes for GetPercentSSE, result=6790123
36      bytes for GetPercent, result=6790123
41      bytes for GetPercent2c, result=6790123
45      bytes for GetPercent2nc, result=6790123
32      bytes for GetPercentJJ2, result=6790123

--- ok ---

Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

ToutEnMasm

Quote
Intel(R) Celeron(R) CPU 2.80GHz (SSE3)
21      cycles for GetPercentSSE
47      cycles for GetPercent
24      cycles for GetPercent2c
26      cycles for GetPercent2nc
21      cycles for GetPercentJJ2

21      cycles for GetPercentSSE
47      cycles for GetPercent
24      cycles for GetPercent2c
26      cycles for GetPercent2nc
20      cycles for GetPercentJJ2

21      cycles for GetPercentSSE
48      cycles for GetPercent
24      cycles for GetPercent2c
26      cycles for GetPercent2nc
21      cycles for GetPercentJJ2

Code sizes:
39      bytes for GetPercentSSE, result=6790123
36      bytes for GetPercent, result=6790123
41      bytes for GetPercent2c, result=6790123
45      bytes for GetPercent2nc, result=6790123
32      bytes for GetPercentJJ2, result=6790123

--- ok ---

Antariy

Here is my integer version of GetPercent. By my tests - fast and small.


OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
AxGetPercentInt proc source:DWORD, percent:DWORD

mov eax,[esp+4]
imul edx,[esp+8],28F5C29h
jl @F

mul edx
mov eax,edx
shr edx,32-2
sub eax,edx
@@:
ret 8

AxGetPercentInt endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef


The code is indended for calculation of values by percents in range 0-100. If percent is greater than, or equal to 100, then returned source value.
Code speed have no dependencyes from the value of the source number or value of percent.
Unusual correction is used.

If algo will always used with percents less than 100, then first 4 lines of the code can be replace to:


imul eax,[esp+8],28F5C29h
jl @F

mul dword ptr [esp+4]


Then timings by 2 clocks faster.


For testing is used latest Jochen's testbed. But I have fix flaw in the ChkPrecision code, which now calls to the right FPU calculation code.

Here is my timings:


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
24      cycles for GetPercentSSE
47      cycles for GetPercent
23      cycles for GetPercent2c
26      cycles for GetPercent2nc
24      cycles for GetPercentJJ1
20      cycles for GetPercentJJ2
16      cycles for AxGetPercentInt

24      cycles for GetPercentSSE
46      cycles for GetPercent
23      cycles for GetPercent2c
26      cycles for GetPercent2nc
24      cycles for GetPercentJJ1
20      cycles for GetPercentJJ2
15      cycles for AxGetPercentInt

Code sizes:
39      bytes for GetPercentSSE, result=6790123
36      bytes for GetPercent, result=-2147483648
41      bytes for GetPercent2c, result=-2147483648
45      bytes for GetPercent2nc, result=6790123
37      bytes for GetPercentJJ1, result=6790123
32      bytes for GetPercentJJ2, result=6790123
26      bytes for AxGetPercentInt, result=6790122


I have asking for testing. Thanks!



Alex

ToutEnMasm

Quote
Intel(R) Celeron(R) CPU 2.80GHz (SSE3)
24      cycles for GetPercentSSE
47      cycles for GetPercent
24      cycles for GetPercent2c
26      cycles for GetPercent2nc
24      cycles for GetPercentJJ1
21      cycles for GetPercentJJ2
16      cycles for AxGetPercentInt

24      cycles for GetPercentSSE
47      cycles for GetPercent
24      cycles for GetPercent2c
26      cycles for GetPercent2nc
24      cycles for GetPercentJJ1
21      cycles for GetPercentJJ2
16      cycles for AxGetPercentInt

Code sizes:
39      bytes for GetPercentSSE, result=6790123
36      bytes for GetPercent, result=-2147483648
41      bytes for GetPercent2c, result=-2147483648
45      bytes for GetPercent2nc, result=6790123
37      bytes for GetPercentJJ1, result=6790123
32      bytes for GetPercentJJ2, result=6790123
26      bytes for AxGetPercentInt, result=6790122

--- ok ---

hutch--

Alex,

I get a GP fault out of your last zip on xp sp2.


Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)
-2125943931 for 7FFF0000/1, case 26750588
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Antariy

Quote from: hutch-- on December 01, 2010, 10:34:56 PM
Alex,

I get a GP fault out of your last zip on xp sp2.

That's something with ChkPrecision/FPU code, apparently. AxGetPercentInt have no instructions which can cause GPF.

I have checked this, and found a reason, Jochen intentionally made FPU stack overflow in the MACRO:

Algo MACRO arg
finit
REPEAT 8
fldpi
ENDM


ChkPrecision code uses simple FPU code as etalone. And this simple code is not handle cases when FPU stack is full. This is reason why you get exception.

Now I have commented FLDPI line, and it should work properly.

As sayed, that's not bug in my code, just Jochen not write documentation for his testing variant :bg

Hutch, and all, test this one, please.



Alex

Antariy

Quote from: ToutEnMasm on December 01, 2010, 03:26:00 PM
Quote
Intel(R) Celeron(R) CPU 2.80GHz (SSE3)
24      cycles for GetPercentSSE
47      cycles for GetPercent
24      cycles for GetPercent2c
26      cycles for GetPercent2nc
24      cycles for GetPercentJJ1
21      cycles for GetPercentJJ2
16      cycles for AxGetPercentInt

24      cycles for GetPercentSSE
47      cycles for GetPercent
24      cycles for GetPercent2c
26      cycles for GetPercent2nc
24      cycles for GetPercentJJ1
21      cycles for GetPercentJJ2
16      cycles for AxGetPercentInt

Code sizes:
39      bytes for GetPercentSSE, result=6790123
36      bytes for GetPercent, result=-2147483648
41      bytes for GetPercent2c, result=-2147483648
45      bytes for GetPercent2nc, result=6790123
37      bytes for GetPercentJJ1, result=6790123
32      bytes for GetPercentJJ2, result=6790123
26      bytes for AxGetPercentInt, result=6790122

--- ok ---

Thanks, Luce!

dedndave

prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
24      cycles for GetPercentSSE
47      cycles for GetPercent
24      cycles for GetPercent2c
29      cycles for GetPercent2nc
26      cycles for GetPercentJJ1
21      cycles for GetPercentJJ2
16      cycles for AxGetPercentInt

25      cycles for GetPercentSSE
47      cycles for GetPercent
30      cycles for GetPercent2c
30      cycles for GetPercent2nc
26      cycles for GetPercentJJ1
21      cycles for GetPercentJJ2
16      cycles for AxGetPercentInt

:U