Can this be optimized using SSE? Can anyone do it for me, it calculates multipliers of 3 and 5, if the modulus is 0, the number should be added to the result.
not sure about SSE
but you could certainly speed it up a lot by using multiply-to-divide
a while back, qWord made an excellent tool for getting the "magic" numbers
http://www.masm32.com/board/index.php?topic=9937.msg72815#msg72815
another way to go would be to do something like a Sieve of Eratosthenes
http://en.wikipedia.org/wiki/Sieve_of_Eratosthenes
Quote from: zemtex on December 16, 2011, 05:18:28 PM
Can this be optimized using SSE?
maybe, when using doubles for division. This would require to examine the fraction bits after division.
I converted it to FPU instructions using various mixes of these instructions
;FTST ; Compares st0 with zero and sets FPU flags
;FCMOVNE ST(0), ST(i) ; move if not equal
;FINIT ; clear all
;FLDZ ; st(0) = 0 (everything else is pushed further down)
;FLD1 ; st(0) = 1 (everything else is pushed further down)
;FINCSTP ; rotates stackregisters upwards, st(0) becomes st(7) and st(1) becomes st(0) etc.
;FDECSTP ; rotates stackregisters downwards, st(7) becomes st(0) and st(7) becomes st(0) etc.
;FPREM1 ; st(0) / st(1) og kalkulerer remainder og lagrer i st(0)
;FCOMI st(0), st(1) ; cmp st0 with st1, eflags will be set. carry flag is set if st0 less than st1
These are not actual code, it is just a reference that I use when coding, the order of these instructions matters not. After I picked a selection of a few of these, the code ended up being slower than the code you see in the first post.
even though you have to go through 2 loops, the addition method is probably fastest
make a loop for multiples of 3 and another for multiples of 5
Can you show me a sample of that?
well - you just add the "divisor" to an accumulated total in each loop
the result is always a multiple of the divisor, so you always add it
the loop is reasonably fast, because all passes are valid and because you are adding instead of dividing
i will review your code - and see what i can come up with for some test code
ok - give this a try...
you could probably speed that up some more by using another register
in the 5's loop....
use one register to hold 5 and another to hold 10
swap the 2 registers on each pass
Quote from: dedndave on December 16, 2011, 06:48:25 PM
you could probably speed that up
Hey, Dave, make a serious effort please. Your code is only a factor 70 faster - add a proc that's called
Fast, not just
Faster :green
:P
proof that the code design phase is more important than instruction optimization
is it intended, that the result overflows?
ummm - no - lol
let me look at it
i figured he was happy with the answer his code generated - and mine generated the same answer
wow - it overflows by a lot, too :P
i get a total of 2333333416666668 - 16 decimal digits
is this for Project Euler ?
it looks like problem #1, which only goes up to 1000
dedndave, it doesn't really matter, it works for testing the speed. :U
It's amazing what kind of terror a div instruction can do. Can it be further optimized?
it seems like you should be able to speed it up with SSE using the same algorithm :P
let me see if the register swap method has merit
doesn't look good for swapping a 5 with a 10 :P
dedndave, I have found a very fast algorithm for the same kind of function I posted here. If you want to try to convert it from c++ to asm, I highly appreciate it. The guy who wrote this had 15 ms using msvc. It is pretty darn fast for c++. Can you do it faster? :P
static int sumlol(int m) {
int sum = 0, left = m % 15;
for (int i = 0; i < m - left; i += 15) {
sum += 7 * i + 3 + 5 + 6 + 9 + 10 + 12;
}
for (int i = m - left; i < m; i++) {
if(i % 3 == 0 || i % 5 == 0) {
sum += i;
}
}
return sum;
}
15 milliseconds for this in c++, it is twice as fast as the optimized version you gave me, it gave me 31 ms. But if you take this algorithm and convert it, perhaps we can get down to 12 perhaps?
i think i understand it - lol
not very good with C operators
I have converted it to pseudo language if you need it, I am not entirely sure if the pseudo is correct though:
result = 0
left = max / 15
i = 0
while i < (Max-left)
i := i + 15
resultat = resultat + 7 * i + 3 + 5 + 6 + 9 + 10 + 12
while i < (Max)
i := i + 1
if (i / 3 == 0 || i / 5 == 0
result = resultat + i
return result
His compiler outputs something like this, so you see the compiler is very "smart", it uses the lea for calculation and so forth.
xor ecx, ecx
xor eax, eax
loop1: lea ecx, [ecx+eax+0x2d]
add eax, 0x69
cmp eax, 0x29b926ba
jl loop1
if you can make something out of this...
i have a loop that works - kinda
i am not getting the correct answer - let me see if i can figure out why
it takes about 60% as much time as my original loop - considerably faster
The guy who wrote the algo have written a new one, this new one runs in 0 ms for 100 million iterations, or so he claims. I don't know what to say, I think i need a cup of pepsi max. :bdg
static unsigned long sumlol2(unsigned long m) {
unsigned long left = m % 15; m -= left;
unsigned long sum = (m - 1) / 15 * ((m - 1) / 15 + 1) / 2 * 15 * 7 + 45 * (m / 15);
for (; left != 0; m++, left--) {
if(m % 3 == 0 || m % 5 == 0) {
sum += m;
}
}
return sum;
yes - lol
i was just working that out - it can be calculated directly without loops :P
Compilers seems to be extremely efficient breaking down long arithmetic statements and finding the right instructions to do it efficiently.
they are - that does not mean it cannot be improved upon :U
back to working on my problem...
http://projecteuler.net/problem=361
not as bad as it looks, i think
True that :U
We need a "compiler" type tool for asm too. Perhaps write a small tool that can quickly parse an assermbly source file to find register stalls. That would be a good start. And then later expand the program to find more optimizations. Alternatively, write an assembly code editor, and with an optimization module implemented it will highlight lines in the source file if it is stalling.
The only way to get complete control of the vastness of optimizations, is if you store all of them in a database and let a program tool let you know about them. If assembly programmers always have to discover slow code by accident, it will be very time consuming. An editor with a line highlighter will force you to make good habits, over time you will learn to avvoid the same mistakes.
One could expand this functionality to automatically show the size of any given code section on the far left side for example. So that size optimizations can be easily controlled. This would require to store all instructions in a database of course.
Perhaps also implement a virtual execution unit that will simulate execution of the instructions and show output on the far right. The editor must be able to switch between different arcitectures, each architecture has its own instruction database, so that the editor is portable to other instruction sets.
A hotkey to pop up instruction listings by category. you could scroll through instructions using arrow keys, kind of like a dos listing system. And press enter to go into the description of each instruction.
Maybe later the editor could be expanded to make COM calls easier, even easier than macro calls. Find a system or a way to make this automatic.
If the editor fits, you could make an automatic coffee machine too, it will output coffee and an automatic boredom detection unit.
just write the code in C and generate an assembler listing :U
Yes you could of course. Me, personally have no c++ experience. My high level language of choice have always been Turbo Pascal and Borland Delphi. In Delphi you can make a big program in minutes, but the size of the executable will grow to 700 KB almost instantly. :bdg
A few years ago, I spent a month trying to learn c++, I managed to create a few smaller programs, but I quit after a month.
i think i did the same thing about 25 years ago :bg
i didn't care for C - but i had to have some familiarity for a job
The attachment contains a test of the sumlol and sumlol2 functions. I started out trying to do the test in C, but as noted the Microsoft C compiler chokes on the m -= left statement in the middle of the variable declarations. Instead of rearranging the statements I decided to go with C++, since that was apparently the target. Here are the results running on my P3 system:
0: 0 1145324609
1: 0 1145324609
2: 0 1145324609
3: 0 1145324609
4: 3 1145324612
5: 3 1145324612
6: 8 1145324617
7: 14 1145324623
8: 14 1145324623
9: 14 1145324623
10: 23 1145324632
11: 33 1145324642
12: 33 1145324642
13: 45 1145324654
14: 45 1145324654
15: 45 45
16: 60 60
17: 60 60
18: 60 60
19: 78 78
20: 78 78
21: 98 98
22: 119 119
23: 119 119
24: 119 119
25: 143 143
26: 168 168
27: 168 168
28: 195 195
29: 195 195
30: 195 195
31: 225 225
32: 225 225
33: 225 225
34: 258 258
35: 258 258
36: 293 293
37: 329 329
38: 329 329
39: 329 329
40: 368 368
41: 408 408
42: 408 408
43: 450 450
44: 450 450
45: 450 450
46: 495 495
47: 495 495
48: 495 495
49: 543 543
50: 543 543
51: 593 593
52: 644 644
53: 644 644
54: 644 644
55: 698 698
56: 753 753
57: 753 753
58: 810 810
59: 810 810
60: 810 810
61: 870 870
62: 870 870
63: 870 870
64: 933 933
65: 933 933
66: 998 998
67: 1064 1064
68: 1064 1064
69: 1064 1064
70: 1133 1133
71: 1203 1203
72: 1203 1203
73: 1275 1275
74: 1275 1275
75: 1275 1275
76: 1350 1350
77: 1350 1350
78: 1350 1350
79: 1428 1428
80: 1428 1428
81: 1508 1508
82: 1589 1589
83: 1589 1589
84: 1589 1589
85: 1673 1673
86: 1758 1758
87: 1758 1758
88: 1845 1845
89: 1845 1845
90: 1845 1845
91: 1935 1935
92: 1935 1935
93: 1935 1935
94: 2028 2028
95: 2028 2028
96: 2123 2123
97: 2219 2219
98: 2219 2219
99: 2219 2219
0 50 cycles
99 701 cycles
0 66 cycles
99 698 cycles
BTW, I used /O2 /G6 without verifying that these options are a good choice, or running tests to determine which options produce the fastest code.
http://msdn.microsoft.com/en-US/library/19z1t1wy(v=VS.80).aspx
Edit: The /G6 option seems to have no effect when combined with /O2. Combining /O2 and /arch:SSE2 produced faster code for the first function, that still ran on my P3 and produced the same results, but I didn't have time to examine the ASM output to determine what the differences are.
0 50 cycles
99 688 cycles
0 74 cycles
99 699 cycles
See second attachment.