The MASM Forum Archive 2004 to 2012

General Forums => The Workshop => Topic started by: zemtex on December 16, 2011, 05:18:28 PM

Title: Can this be converted to sse
Post by: zemtex on December 16, 2011, 05:18:28 PM
Can this be optimized using SSE? Can anyone do it for me, it calculates multipliers of 3 and 5, if the modulus is 0, the number should be added to the result.
Title: Re: Can this be converted to sse
Post by: dedndave on December 16, 2011, 05:55:02 PM
not sure about SSE
but you could certainly speed it up a lot by using multiply-to-divide
a while back, qWord made an excellent tool for getting the "magic" numbers
http://www.masm32.com/board/index.php?topic=9937.msg72815#msg72815

another way to go would be to do something like a Sieve of Eratosthenes
http://en.wikipedia.org/wiki/Sieve_of_Eratosthenes
Title: Re: Can this be converted to sse
Post by: qWord on December 16, 2011, 06:02:41 PM
Quote from: zemtex on December 16, 2011, 05:18:28 PM
Can this be optimized using SSE?
maybe, when using doubles for division. This would require to examine the fraction bits after division.
Title: Re: Can this be converted to sse
Post by: zemtex on December 16, 2011, 06:08:35 PM
I converted it to FPU instructions using various mixes of these instructions

            ;FTST ; Compares st0 with zero and sets FPU flags
;FCMOVNE ST(0), ST(i) ; move if not equal
;FINIT ; clear all
;FLDZ ; st(0) = 0 (everything else is pushed further down)
;FLD1 ; st(0) = 1 (everything else is pushed further down)
;FINCSTP ; rotates stackregisters upwards,   st(0) becomes st(7) and st(1) becomes st(0) etc.
;FDECSTP ; rotates stackregisters downwards, st(7) becomes st(0) and st(7) becomes st(0) etc.
;FPREM1 ; st(0) / st(1) og kalkulerer remainder og lagrer i st(0)
;FCOMI st(0), st(1) ; cmp st0 with st1, eflags will be set. carry flag is set if st0 less than st1

These are not actual code, it is just a reference that I use when coding, the order of these instructions matters not. After I picked a selection of a few of these, the code ended up being slower than the code you see in the first post.
Title: Re: Can this be converted to sse
Post by: dedndave on December 16, 2011, 06:09:03 PM
even though you have to go through 2 loops, the addition method is probably fastest
make a loop for multiples of 3 and another for multiples of 5
Title: Re: Can this be converted to sse
Post by: zemtex on December 16, 2011, 06:10:32 PM
Can you show me a sample of that?
Title: Re: Can this be converted to sse
Post by: dedndave on December 16, 2011, 06:15:43 PM
well - you just add the "divisor" to an accumulated total in each loop
the result is always a multiple of the divisor, so you always add it
the loop is reasonably fast, because all passes are valid and because you are adding instead of dividing

i will review your code - and see what i can come up with for some test code
Title: Re: Can this be converted to sse
Post by: dedndave on December 16, 2011, 06:43:04 PM
ok - give this a try...
Title: Re: Can this be converted to sse
Post by: dedndave on December 16, 2011, 06:48:25 PM
you could probably speed that up some more by using another register
in the 5's loop....
use one register to hold 5 and another to hold 10
swap the 2 registers on each pass
Title: Re: Can this be converted to sse
Post by: jj2007 on December 16, 2011, 06:52:26 PM
Quote from: dedndave on December 16, 2011, 06:48:25 PM
you could probably speed that up

Hey, Dave, make a serious effort please. Your code is only a factor 70 faster - add a proc that's called Fast, not just Faster :green
Title: Re: Can this be converted to sse
Post by: dedndave on December 16, 2011, 06:55:15 PM
 :P

proof that the code design phase is more important than instruction optimization
Title: Re: Can this be converted to sse
Post by: qWord on December 16, 2011, 07:13:37 PM
is it intended, that the result overflows?
Title: Re: Can this be converted to sse
Post by: dedndave on December 16, 2011, 07:16:11 PM
ummm - no - lol

let me look at it

i figured he was happy with the answer his code generated - and mine generated the same answer
Title: Re: Can this be converted to sse
Post by: dedndave on December 16, 2011, 07:20:19 PM
wow - it overflows by a lot, too   :P

i get a total of 2333333416666668 - 16 decimal digits

is this for Project Euler ?
it looks like problem #1, which only goes up to 1000
Title: Re: Can this be converted to sse
Post by: zemtex on December 16, 2011, 07:29:47 PM
dedndave, it doesn't really matter, it works for testing the speed.  :U

It's amazing what kind of terror a div instruction can do. Can it be further optimized?
Title: Re: Can this be converted to sse
Post by: dedndave on December 16, 2011, 07:45:18 PM
it seems like you should be able to speed it up with SSE using the same algorithm  :P

let me see if the register swap method has merit
Title: Re: Can this be converted to sse
Post by: dedndave on December 16, 2011, 08:04:11 PM
doesn't look good for swapping a 5 with a 10   :P
Title: Re: Can this be converted to sse
Post by: zemtex on December 18, 2011, 01:51:59 PM
dedndave, I have found a very fast algorithm for the same kind of function I posted here. If you want to try to convert it from c++ to asm, I highly appreciate it. The guy who wrote this had 15 ms using msvc. It is pretty darn fast for c++. Can you do it faster?  :P


static int sumlol(int m) {
int sum = 0, left = m % 15;
for (int i = 0; i < m - left; i += 15) {
sum += 7 * i + 3 + 5 + 6 + 9 + 10 + 12;
}
for (int i = m - left; i < m; i++) {
if(i % 3 == 0 || i % 5 == 0) {
sum += i;
}
}
return sum;
}


15 milliseconds for this in c++, it is twice as fast as the optimized version you gave me, it gave me 31 ms. But if you take this algorithm and convert it, perhaps we can get down to 12 perhaps?
Title: Re: Can this be converted to sse
Post by: dedndave on December 18, 2011, 02:22:15 PM
i think i understand it - lol
not very good with C operators
Title: Re: Can this be converted to sse
Post by: zemtex on December 18, 2011, 03:11:32 PM
I have converted it to pseudo language if you need it, I am not entirely sure if the pseudo is correct though:

result = 0

left = max / 15

i = 0

while i < (Max-left)
   i := i + 15
   resultat  = resultat + 7 * i + 3 + 5 + 6 + 9 + 10 + 12

while i < (Max)
   i := i + 1
   if (i / 3 == 0 || i / 5 == 0
      result = resultat + i
return result
Title: Re: Can this be converted to sse
Post by: zemtex on December 18, 2011, 03:12:28 PM
His compiler outputs something like this, so you see the compiler is very "smart", it uses the lea for calculation and so forth.

xor ecx, ecx
xor eax, eax
loop1: lea ecx, [ecx+eax+0x2d]
add eax, 0x69
cmp eax, 0x29b926ba
jl loop1

if you can make something out of this...
Title: Re: Can this be converted to sse
Post by: dedndave on December 18, 2011, 03:26:51 PM
i have a loop that works - kinda
i am not getting the correct answer - let me see if i can figure out why
it takes about 60% as much time as my original loop - considerably faster
Title: Re: Can this be converted to sse
Post by: zemtex on December 18, 2011, 03:43:04 PM
The guy who wrote the algo have written a new one, this new one runs in 0 ms for 100 million iterations, or so he claims. I don't know what to say, I think i need a cup of pepsi max.  :bdg


static unsigned long sumlol2(unsigned long m) {
unsigned long left = m % 15; m -= left;
unsigned long sum = (m - 1) / 15 * ((m - 1) / 15 + 1) / 2 * 15 * 7 + 45 * (m / 15);
for (; left != 0; m++, left--) {
if(m % 3 == 0 || m % 5 == 0) {
sum += m;
}
}
return sum;
Title: Re: Can this be converted to sse
Post by: dedndave on December 18, 2011, 03:54:02 PM
yes - lol
i was just working that out - it can be calculated directly without loops   :P
Title: Re: Can this be converted to sse
Post by: zemtex on December 18, 2011, 04:04:24 PM
Compilers seems to be extremely efficient breaking down long arithmetic statements and finding the right instructions to do it efficiently.
Title: Re: Can this be converted to sse
Post by: dedndave on December 18, 2011, 04:05:33 PM
they are - that does not mean it cannot be improved upon   :U

back to working on my problem...
http://projecteuler.net/problem=361
not as bad as it looks, i think
Title: Re: Can this be converted to sse
Post by: zemtex on December 18, 2011, 04:10:43 PM
True that  :U

We need a "compiler" type tool for asm too. Perhaps write a small tool that can quickly parse an assermbly source file to find register stalls. That would be a good start. And then later expand the program to find more optimizations. Alternatively, write an assembly code editor, and with an optimization module implemented it will highlight lines in the source file if it is stalling.

The only way to get complete control of the vastness of optimizations, is if you store all of them in a database and let a program tool let you know about them. If assembly programmers always have to discover slow code by accident, it will be very time consuming. An editor with a line highlighter will force you to make good habits, over time you will learn to avvoid the same mistakes.

One could expand this functionality to automatically show the size of any given code section on the far left side for example. So that size optimizations can be easily controlled. This would require to store all instructions in a database of course.

Perhaps also implement a virtual execution unit that will simulate execution of the instructions and show output on the far right. The editor must be able to switch between different arcitectures, each architecture has its own instruction database, so that the editor is portable to other instruction sets.

A hotkey to pop up instruction listings by category. you could scroll through instructions using arrow keys, kind of like a dos listing system. And press enter to go into the description of each instruction.

Maybe later the editor could be expanded to make COM calls easier, even easier than macro calls. Find a system or a way to make this automatic.

If the editor fits, you could make an automatic coffee machine too, it will output coffee and an automatic boredom detection unit.
Title: Re: Can this be converted to sse
Post by: dedndave on December 18, 2011, 05:01:04 PM
just write the code in C and generate an assembler listing   :U
Title: Re: Can this be converted to sse
Post by: zemtex on December 18, 2011, 05:11:09 PM
Yes you could of course. Me, personally have no c++ experience. My high level language of choice have always been Turbo Pascal and Borland Delphi. In Delphi you can make a big program in minutes, but the size of the executable will grow to 700 KB almost instantly.  :bdg

A few years ago, I spent a month trying to learn c++, I managed to create a few smaller programs, but I quit after a month.
Title: Re: Can this be converted to sse
Post by: dedndave on December 18, 2011, 06:35:45 PM
i think i did the same thing about 25 years ago   :bg
i didn't care for C - but i had to have some familiarity for a job
Title: Re: Can this be converted to sse
Post by: MichaelW on December 18, 2011, 06:36:02 PM
The attachment contains a test of the sumlol and sumlol2 functions. I started out trying to do the test in C, but as noted the Microsoft C compiler chokes on the m -= left statement in the middle of the variable declarations. Instead of rearranging the statements I decided to go with C++, since that was apparently the target. Here are the results running on my P3 system:


0:      0       1145324609
1:      0       1145324609
2:      0       1145324609
3:      0       1145324609
4:      3       1145324612
5:      3       1145324612
6:      8       1145324617
7:      14      1145324623
8:      14      1145324623
9:      14      1145324623
10:     23      1145324632
11:     33      1145324642
12:     33      1145324642
13:     45      1145324654
14:     45      1145324654
15:     45      45
16:     60      60
17:     60      60
18:     60      60
19:     78      78
20:     78      78
21:     98      98
22:     119     119
23:     119     119
24:     119     119
25:     143     143
26:     168     168
27:     168     168
28:     195     195
29:     195     195
30:     195     195
31:     225     225
32:     225     225
33:     225     225
34:     258     258
35:     258     258
36:     293     293
37:     329     329
38:     329     329
39:     329     329
40:     368     368
41:     408     408
42:     408     408
43:     450     450
44:     450     450
45:     450     450
46:     495     495
47:     495     495
48:     495     495
49:     543     543
50:     543     543
51:     593     593
52:     644     644
53:     644     644
54:     644     644
55:     698     698
56:     753     753
57:     753     753
58:     810     810
59:     810     810
60:     810     810
61:     870     870
62:     870     870
63:     870     870
64:     933     933
65:     933     933
66:     998     998
67:     1064    1064
68:     1064    1064
69:     1064    1064
70:     1133    1133
71:     1203    1203
72:     1203    1203
73:     1275    1275
74:     1275    1275
75:     1275    1275
76:     1350    1350
77:     1350    1350
78:     1350    1350
79:     1428    1428
80:     1428    1428
81:     1508    1508
82:     1589    1589
83:     1589    1589
84:     1589    1589
85:     1673    1673
86:     1758    1758
87:     1758    1758
88:     1845    1845
89:     1845    1845
90:     1845    1845
91:     1935    1935
92:     1935    1935
93:     1935    1935
94:     2028    2028
95:     2028    2028
96:     2123    2123
97:     2219    2219
98:     2219    2219
99:     2219    2219

0       50 cycles
99      701 cycles

0       66 cycles
99      698 cycles


BTW, I used /O2 /G6 without verifying that these options are a good choice, or running tests to determine which options produce the fastest code.

http://msdn.microsoft.com/en-US/library/19z1t1wy(v=VS.80).aspx

Edit: The /G6 option seems to have no effect when combined with /O2. Combining /O2 and /arch:SSE2 produced faster code for the first function, that still ran on my P3 and produced the same results, but I didn't have time to examine the ASM output to determine what the differences are.

0       50 cycles
99      688 cycles

0       74 cycles
99      699 cycles


See second attachment.