News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Can this be converted to sse

Started by zemtex, December 16, 2011, 05:18:28 PM

Previous topic - Next topic

zemtex

Can this be optimized using SSE? Can anyone do it for me, it calculates multipliers of 3 and 5, if the modulus is 0, the number should be added to the result.
I have been puzzling with lego bricks all my life. I know how to do this. When Peter, at age 6 is competing with me, I find it extremely neccessary to show him that I can puzzle bricks better than him, because he is so damn talented that all that is called rational has gone haywire.

dedndave

not sure about SSE
but you could certainly speed it up a lot by using multiply-to-divide
a while back, qWord made an excellent tool for getting the "magic" numbers
http://www.masm32.com/board/index.php?topic=9937.msg72815#msg72815

another way to go would be to do something like a Sieve of Eratosthenes
http://en.wikipedia.org/wiki/Sieve_of_Eratosthenes

qWord

Quote from: zemtex on December 16, 2011, 05:18:28 PM
Can this be optimized using SSE?
maybe, when using doubles for division. This would require to examine the fraction bits after division.
FPU in a trice: SmplMath
It's that simple!

zemtex

I converted it to FPU instructions using various mixes of these instructions

            ;FTST ; Compares st0 with zero and sets FPU flags
;FCMOVNE ST(0), ST(i) ; move if not equal
;FINIT ; clear all
;FLDZ ; st(0) = 0 (everything else is pushed further down)
;FLD1 ; st(0) = 1 (everything else is pushed further down)
;FINCSTP ; rotates stackregisters upwards,   st(0) becomes st(7) and st(1) becomes st(0) etc.
;FDECSTP ; rotates stackregisters downwards, st(7) becomes st(0) and st(7) becomes st(0) etc.
;FPREM1 ; st(0) / st(1) og kalkulerer remainder og lagrer i st(0)
;FCOMI st(0), st(1) ; cmp st0 with st1, eflags will be set. carry flag is set if st0 less than st1

These are not actual code, it is just a reference that I use when coding, the order of these instructions matters not. After I picked a selection of a few of these, the code ended up being slower than the code you see in the first post.
I have been puzzling with lego bricks all my life. I know how to do this. When Peter, at age 6 is competing with me, I find it extremely neccessary to show him that I can puzzle bricks better than him, because he is so damn talented that all that is called rational has gone haywire.

dedndave

even though you have to go through 2 loops, the addition method is probably fastest
make a loop for multiples of 3 and another for multiples of 5

zemtex

I have been puzzling with lego bricks all my life. I know how to do this. When Peter, at age 6 is competing with me, I find it extremely neccessary to show him that I can puzzle bricks better than him, because he is so damn talented that all that is called rational has gone haywire.

dedndave

well - you just add the "divisor" to an accumulated total in each loop
the result is always a multiple of the divisor, so you always add it
the loop is reasonably fast, because all passes are valid and because you are adding instead of dividing

i will review your code - and see what i can come up with for some test code

dedndave

ok - give this a try...

dedndave

you could probably speed that up some more by using another register
in the 5's loop....
use one register to hold 5 and another to hold 10
swap the 2 registers on each pass

jj2007

Quote from: dedndave on December 16, 2011, 06:48:25 PM
you could probably speed that up

Hey, Dave, make a serious effort please. Your code is only a factor 70 faster - add a proc that's called Fast, not just Faster :green

dedndave

 :P

proof that the code design phase is more important than instruction optimization

qWord

is it intended, that the result overflows?
FPU in a trice: SmplMath
It's that simple!

dedndave

ummm - no - lol

let me look at it

i figured he was happy with the answer his code generated - and mine generated the same answer

dedndave

wow - it overflows by a lot, too   :P

i get a total of 2333333416666668 - 16 decimal digits

is this for Project Euler ?
it looks like problem #1, which only goes up to 1000

zemtex

dedndave, it doesn't really matter, it works for testing the speed.  :U

It's amazing what kind of terror a div instruction can do. Can it be further optimized?
I have been puzzling with lego bricks all my life. I know how to do this. When Peter, at age 6 is competing with me, I find it extremely neccessary to show him that I can puzzle bricks better than him, because he is so damn talented that all that is called rational has gone haywire.