I've been reading Visual C++ Optimization with Assembly Code by Yury Magda (I like it; do you have an opinion?) but I noticed that in Chapter 2 he shows an optimized version of an integer add (actually adding up the elements of an integer array) using assembly language with the VC++ __asm{}. He uses fiadd. The disassembly of the C++ uses an add. Is it often considered better to use an fp instruction and the fp stack rather than an integer instruction to do integer arithmetic? I'm surprised. The book doesn't address that issue other than to say that the asm version is better performing.
The asm code looks like
__asm{
mov ECX, sf ; sf is the number of elements in the array
dec ECX
mov ESI, DWORD PTR piarray
finit
fild DWORD PTR [ESI]
next:
add ESI, 4
fiadd DWORD PTR [ESI]
loop next
fistp DWORD PTR isum
fwait
}
while the C++ is (I know, pointer arithmetic!)
for (int cnt=0;cnt<sf;cnt++) {
isum += *piarray;
piarray++;
}
(Why does it seem like I'm talking to myself here? :toothy)
Floating-point instructions are generally slow. And almost always considerably slower than integer operations.
So unless you specifically need to use floating point - don't :wink And if you do, then everyone will tell you to use SSE anyway ::)
That example code could be trying to demonstrate a point, but out of context, using (integer) add would be much better (and MMX would be better still :green)
I doubt that the code is a demonstration of speed optimization. What point could it be trying to demonstrate?
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
include \masm32\include\masm32rt.inc
.586
include timers.asm
_fiadd PROTO :DWORD,:DWORD
_add PROTO :DWORD,:DWORD
__add PROTO :DWORD,:DWORD
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.data
array dd 1000 dup(1)
.code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
invoke _fiadd, ADDR array, 1000
print ustr$(eax), 32
invoke _add, ADDR array, 1000
print ustr$(eax), 32
invoke __add, ADDR array, 1000
print ustr$(eax), 13,10
LOOP_COUNT EQU 200000
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
invoke _fiadd, ADDR array, 1000
counter_end
print ustr$(eax)," cycles", 13, 10
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
invoke _add, ADDR array, 1000
counter_end
print ustr$(eax)," cycles", 13, 10
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
invoke __add, ADDR array, 1000
counter_end
print ustr$(eax)," cycles", 13, 10
mov eax, input(13,10,"Press enter to exit...")
exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
align 4
_fiadd proc uses esi arr:DWORD,cnt:DWORD
LOCAL iSum:DWORD
mov ecx, cnt
dec ecx
mov esi, arr
finit
fild DWORD PTR[esi]
@@:
add esi, 4
fiadd DWORD PTR[esi]
loop @B
fistp iSum
fwait
mov eax, iSum
ret
_fiadd endp
align 4
_add proc uses esi arr:DWORD,cnt:DWORD
mov ecx, cnt
dec ecx
mov esi, arr
mov eax, [esi]
@@:
add eax,[esi]
loop @B
ret
_add endp
; The optimization technique used here was borrowed
; from the MASM32 arradd procedure.
align 4
__add proc arr:DWORD,cnt:DWORD
mov ecx, cnt
mov edx, arr
add ecx, ecx
xor eax, eax
add ecx, ecx
add edx, ecx
neg ecx
jmp @F
align 16
@@:
add eax,[edx+ecx]
add ecx, 4
jnz @B
ret
__add endp
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start
Results on my P3
1000 1000 1000
9180 cycles
6317 cycles
2031 cycles
[attachment deleted by admin]
Not revelant to the above demonstartion, but one advantage of fiadd is that it'll work with 64bit ints.