I've to test if an integer number is odd or even.
I've thought about a couple of solutions:
1. shift right by 1 position the number MOVed into a register and test the carry flag [I think at least]
2. Testing the 0 bit of the register into which I moved the number to test, and acting accordingly.
I'd like to ask:
1. What is the fastest method, and if there is a better one than the 2 methods I'm thinking about?
2. What is the syntax of both these methods if I want to print "The number is ODD" or "The number is EVEN"?
Thanks for your help
and eax, 1
cmp eax, 1
jne NotEven
Quote from: bomz on August 27, 2010, 09:53:20 AM
and eax, 1
cmp eax, 1
jne NotEven
Thanks bomz.
What about this solution:
;-------------------------------------------------------------------------------------------------------------------------
; Test if a number is ODD or EVEN
; Date: 27 august 2010
;-------------------------------------------------------------------------------------------------------------------------
INCLUDE \masm32\include\masm32rt.inc
;------------------------------------------------------------------------------------------------------------------------------
.DATA
number db 12
;------------------------------------------------------------------------------------------------------------------------------
.CODE
start:
xor eax,eax
mov al, number
shr eax, 1
jc isodd
print "The number is EVEN",13,10
inkey
jmp end_test
isodd:
print "The number is ODD",13,10
inkey
end_test:
ret
end start
is it slower? or what else?
and eax, 1
jz ODD
Quote from: bomz on August 27, 2010, 10:00:28 AM
and eax, 1
jz ODD
This is probably faster than the previous one. :U
I'm looking at the BT mnemonic to see how it can be used.
Maybe something like:
BT number, 0
jc ODD
thats all I know about SHR - I think GOOGLE knows how many each operation processor tact(?) needs
(http://s42.radikal.ru/i096/1008/c1/6c6bdc07c292.gif)
Thanks bomz. These examples should be enough for now :U
Nice gif :P
Just a thought,
and eax,1
This zero's all bits except bit 0, so if you want to preserve your original number, and is not the answer.
You might like to look at this thread :-
http://www.masm32.com/board/index.php?topic=10109.msg74010#msg74010
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
4552 cycles for 1000*test eax, 1, result=500
10634 cycles for 1000*bt eax, 0, result=500
4595 cycles for 1000*test eax, 1, result=500
10640 cycles for 1000*bt eax, 0, result=500
Testbed attached.
;value in EAX:
test al,1
jnz value_is_odd
Quote from: dedndave on August 27, 2010, 11:33:59 AM
;value in EAX:
test al,1
jnz value_is_odd
You won - three bytes shorter than my version :bg
However, it's a lot slower:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
2885 cycles for 1000*test eax, 1, result=500
4023 cycles for 1000*test al, 1, result=500
3001 cycles for 1000*bt eax, 0, result=500
hiyas Jochen - good morning (here) - good afternoon (there)
i can't imagine that much difference in speed :P
must be something wrong with the test ::) :lol
"The thieves are always liars"[/U]
As you saw his test program is proven crap, hence his results too... :lol
Morning/Evening/Night Masters. Oh Mr Lingo is here. How are you Sir?
So from the tests Jochen has posted the faster instruction is:
test eax,1
Probably because on 32/64 bit machines dealing with 32/64 bits at a time
is the faster solution for the CPU.
I run the test on my pc and post here afterwards.
Edit: Well on my CPU BT is faster than TEST:
Quote
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz (SSE4)
2043 cycles for 1000*test eax, 1, result=500
2037 cycles for 1000*bt eax, 0, result=500
2019 cycles for 1000*test eax, 1, result=500
2035 cycles for 1000*bt eax, 0, result=500
--- ok ---
They are quite the same.
actually lingo - i was making the remark and rolling my eyes because of all your nonsense - lol
Frank - if the contents of the register are "disposable", you can use SHR
shr eax,1
jc value_was_odd
the original value may be recovered, if desired with RCL EAX,1
as long as the carry flag has not been altered
Quote from: dedndave on August 27, 2010, 12:46:55 PM
Frank - if the contents of the register are "disposable", you can use SHR
shr eax,1
jc value_was_odd
the original value may be recovered, if desired with RCL EAX,1
as long as the carry flag has not been altered
Thanks Dave. Good to know, it's one of the solutions I tested.
What about avoiding the eax MOV and testing the variable directly?
;-------------------------------------------------------------------------------------------------------------------------
; Test if a number is ODD or EVEN
; Date: 27 august 2010
;-------------------------------------------------------------------------------------------------------------------------
INCLUDE \masm32\include\masm32rt.inc
;------------------------------------------------------------------------------------------------------------------------------
.DATA
number dd 12
;------------------------------------------------------------------------------------------------------------------------------
.CODE
start:
bt number, 0
jc isodd
print "The number is EVEN",13,10
inkey
jmp end_test
isodd:
print "The number is ODD",13,10
inkey
end_test:
ret
end start
Well testing the number directly has some small advantages:
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz (SSE4)
2072 cycles for 1000*test eax, 1, result=500
2028 cycles for 1000*bt eax, 0, result=500
2016 cycles for 1000*bt number, 0, result=1000
2041 cycles for 1000*test eax, 1, result=500
2036 cycles for 1000*bt eax, 0, result=500
2025 cycles for 1000*bt number, 0, result=1000
--- ok ---
I tested with number = 13.
Hi,Frank
"What about avoiding the eax MOV and testing the variable directly?"
Will be faster coz you will skip mov eax, number.
dedndave,
actually dedndave - have you some news about your wife coz it is not healthy for a man to be alone? :snooty:
Quote from: lingo on August 27, 2010, 01:02:23 PM
Hi,Frank
"What about avoiding the eax MOV and testing the variable directly?"
Will be faster coz you will skip mov eax, number.
Well, there is infact a quite small advantage as I realized with my test.
Tell me Lingo, do you really think these tests are useless, or are you
just joking?
Will be better for you to test it. Just make two algos equal and test them. The times must be the same but I think they will be different... :U
Quote from: lingo on August 27, 2010, 01:13:19 PM
Will be better for you to test it. Just make two algos equal and test them. The times must be the same but I think they will be different... :U
On modern OSes, with dozens of tasks around all the time, is almost impossible
to have two equal results. I think we can only have a broad idea of the speed, and
this should be the target of the tests. Thanks anyway for your comments. :U
Quote from: dedndave on August 27, 2010, 12:46:55 PM
the original value may be recovered, if desired with RCL EAX,1
as long as the carry flag has not been altered
Cute, Dave. Didn't know that, probably since OpCodes.chm tells a different story:
QuoteRotates the bits in the destination to the left "count" times with all data pushed out the left side re-entering on the right. The Carry Flag holds the last bit rotated out.
Nothing on carry flag rotating in etc...
But I checked it with Olly, and you are right, as usual :U
i am a little surprised, Jochen, that you don't have a solid handle on SH/RO/RC instructions
that's some of the first asm stuff i learned :P
in the beginning, i didn't have many tools - had to figure out how things worked manually
when i first got a PC, i thought DEBUG was the bees knees - lol
on an 8008, we considered CF to be the "9th bit" :lol
Quote from: dedndave on August 27, 2010, 02:51:19 PM
i am a little surprised, Jochen, that you don't have a solid handle on SH/RO/RC instructions
that's some of the first asm stuff i learned :P
My solid handles are \masm32\help and Google, and I thought \masm32\help\opcodes.chm was enough. Usually it is correct.
Hello Sr frktons;
I'm getting float results here; while at every firsts test "test" wins, but after some 10 tests the result invert and "bt" wins.
If I put some huge program to be load, while doing the test program, this results preserve here. At night I can do more tests to you.
regards.
at my age, my solid handles seem to be around my waist - lol
all my life, i have worn slacks and jeans with 34 to 36 waist
had to get some new stuff - 38" now :lol
Quote from: dedndave on August 27, 2010, 12:10:43 PM
hiyas Jochen - good morning (here) - good afternoon (there)
i can't imagine that much difference in speed :P
must be something wrong with the test ::) :lol
Yes, sure! On a P4,
test al is three millicycles slower:
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
4034 1000*test eax, 1, result=500
4037 1000*test al, 1, result=500
10676 1000*bt eax, 0, result=500
By the way: Lingo is so quiet. I think he is cooking up an SSE5 version that runs in -50 cycles - on his CPU only :green2
P.S.:
> had to get some new stuff - 38" now :lol
My condolencies - I know the problem :thumbu
P.P.S.: For those who draw their skills from OpCode.chm, here a "second opinion" (MasmReference):
QuoteRCL/RCR/ROL/ROR Rotate
Rotates the bits in the destination operand the number of times specified in the
source operand. RCL and ROL rotate the bits left; RCR and ROR rotate right.
ROL and ROR rotate the number of bits in the operand. For each rotation, the
leftmost or rightmost bit is copied to the carry flag as well as rotated. RCL and
RCR rotate through the carry flag. The carry flag becomes an extension of the
operand so that a 9-bit rotation is done for 8-bit operands, or a 17-bit rotation
for 16-bit operands.
On the 8088 and 8086, the source operand can be either CL or 1. On the
80186–80486, the source operand can be CL or an 8-bit constant. On the
80186–80486, rotate counts larger than 31 are masked off, but on the 8088 and
8086, larger rotate counts are performed despite the inefficiency involved. The
overflow flag is modified only by single-bit variations of the instruction; for
multiple-bit variations, the overflow flag is undefined.
masm32 applications limits by windows procedure, and serious code optimization needs only for huge computation.
(http://s47.radikal.ru/i116/1008/62/5ab791728ac3.png)
whether there Patch to remove this header ?
when looking at the shift instructions, a picture is worth a thousand words (or dwords)
http://www.arl.wustl.edu/~lockwood/class/cs306/books/artofasm/Chapter_6/CH06-3.html#HEADING3-42
Quote from: mineiro on August 27, 2010, 02:57:05 PM
Hello Sr frktons;
I'm getting float results here; while at every firsts test "test" wins, but after some 10 tests the result invert and "bt" wins.
If I put some huge program to be load, while doing the test program, this results preserve here. At night I can do more tests to you.
regards.
Thanks miniero. I suppose
test and
bt performs quite the same.
If you run some more tests, let me know.
Quote from: dedndave on August 27, 2010, 04:27:58 PM
when looking at the shift instructions, a picture is worth a thousand words (or dwords)
http://www.arl.wustl.edu/~lockwood/class/cs306/books/artofasm/Chapter_6/CH06-3.html#HEADING3-42
Yes Master, I use them essentially when I've to divide/multiply by 2-4-8-16 in a fast way,
taking care of the
carry-boy :P
Quote from: bomz on August 27, 2010, 03:41:54 PM
masm32 applications limits by windows procedure, and serious code optimization needs only for huge computation.
Wow, that sounds like old Chinese wisdom. How old are you? And what do you mean concretely, in plain and correct English?
I recall BT performing fairly badly on old CPU's (386/486/P1 era) and at that time it was really only performance-useful for its ability to treat memory itself (rather than registers) as a huge array of bits
In the case of:
bt [esi], ecx
ecx can be any integer value (not just in the range 0..31) so effectively the BT instruction calculated the proper word displacement in memory..
esi + (ecx >> 5)
...and bit mask...
1 << (ecx & 31)
..for you...
In the case of in-register (bt eax, ecx) usage it was easily beaten at the time in any number of ways.
These days it is very efficient, but I suspect that it will compete with nearby non-dependent lea instructions (which has similar shift-then-add capabilities) for the same silicon
Intel(R) Pentium(R) Dual CPU E2160 @ 1.80GHz (SSE4)
2136 cycles for 1000*test eax, 1, result=500
2063 cycles for 1000*bt eax, 0, result=500
2344 cycles for 1000*test eax, 1, result=500
2028 cycles for 1000*bt eax, 0, result=500
--- ok ---
I really don't understand these results, this is a first try of the long nigh and "kabum", chaos teory again.
Without look at assembly level (opcode table,...) I tried to figure how both logics are done.
The "test" is a logical "and", and "and" is sequential, so, need run in all the path to reach the end.
In "bt", I imagine some "and"(sequential) with "or"(organizational/option)(multiplexer,demultiplexer), this can have more velocity, but, the grandfather 8086 cannot deal with one bit, his characteristics are from byte. You cannot deal with only one bit, if you like to set only one bit, you need work at least with one byte, then change the bit and send the byte again.
I humild cannot vote to a better based in my tests, but if I need vote in one, I will vote in one that get better results in all other cpu's.
regards. (sorry about my language).
Quote from: jj2007 on August 27, 2010, 06:15:21 PM
Quote from: bomz on August 27, 2010, 03:41:54 PM
masm32 applications limits by windows procedure, and serious code optimization needs only for huge computation.
Wow, that sounds like old Chinese wisdom. How old are you? And what do you mean concretely, in plain and correct English?
windows is not assembler code, this is code which need last modern processors or two, big hdd, modern video cards ...... MASM32 widely , accept own macros, use API functions. system distribute time according it own consideration. deep optimization of concrete code have no big sence, as optimization own style of programming
http://www.kolibrios.org/ - system need one floppy disk. poor IBM
Quote from: bomz on August 28, 2010, 12:01:25 PM
windows is not assembler code, this is code which need last modern processors or two, big hdd, modern video cards ...... MASM32 widely , accept own macros, use API functions. system distribute time according it own consideration. deep optimization of concrete code have no big sence, as optimization own style of programming
К сожалению, мой русский немного ржавый. Можете ли вы объяснить это на английском языке, пожалуйста? И можете ли вы привести конкретный пример? Спасибо.
ну какой тут можно привести пример. ты приведи пример как ты умудрился деоптимизировать код на ассемблере, что это сказалось на скорости выполнения программы
даже если пустой цикл вставишь. в любом случае есть разумное соотношение оптимизации программы к затраченному на это времени, при аксиме что любой код может быть оптимизирован.
Google believes you said:
QuoteWell what is there to give you an example. you give an example how you managed to deoptimizirovat code in assembler, it affected the speed of execution of the program even if the empty cycle to insert. In any case, there is a reasonable ratio to optimize the program spent for this time, Auxemite that any code can be optimized
bomz, if you believe you can do better, download the testbed, add your algo and show us that you are the champion. We respect good coders. Otherwise I would suggest that you spend some time learning English, because that is the language of this forum.
http://translate.google.com/
This problem with TICKS still alive and I remember about it. Absolutely accidentally I find this FASM code. It's complicate and I already not very good understand how it works, but I do seems working for MASM32. see below
;============================================================================
format pe gui
include '%fasminc%\win32a.inc'
;============================================================================
; выровнено на 4096 ; В А Ш И Д А Н Н Ы Е
;++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
;++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
iter = 10 ; кол-во проходов testing:
;============================================================================
section '.test' code readable writeable executable
;++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
tics dq 0 ;
overhead dd 0 ;
counter dd 0 ;
resultlist rq iter ;
templist rb iter*10+2 ; "технические данные"
message rb iter*26+1 ;
caption rb 64 ;
lpfmtm db '%.8X%.8X%.8X',13,10,0 ;
lpfmtc db '%0.8X / %u bytes / %u passes',0
;============================================================================
align 1024 ;
;============================================================================
entry $
invoke GetCurrentProcess
invoke SetPriorityClass,eax,REALTIME_PRIORITY_CLASS
invoke GetCurrentThread
invoke SetThreadPriority,eax,THREAD_PRIORITY_TIME_CRITICAL
;============================================================================
mov ebp,5 ;
align 16 ;
@@: mov eax,0 ;
cpuid ;
rdtsc ;
mov dword [tics],eax ; подсчет overhead
mov dword [tics+4],edx ; (используется для вычета
xor eax,eax ; тактов, ушедших на
cpuid ; "технические" детали)
xor eax,eax ;
cpuid ;
rdtsc ;
sub eax,dword [tics] ;
mov [overhead],eax ;
dec ebp ;
jnz @B ;
;============================================================================
; используйте esi edi ebp И Н И Ц И А Л И З А Ц И Я
;++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
;++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
; цикл проходов (итераций)
;============================================================================
align 16
testloop: times 8 : nop ; для выравнивания testing:
mov eax,0
cpuid
rdtsc
mov dword [tics],eax
mov dword [tics+4],edx
xor eax,eax
cpuid ; eax ecx edx ebx не сохраняется
;============================================================================
testing: ; выровнено на 16 ; Т Е С Т И Р У Е М Ы Е И Н С Т Р У К Ц И И
;++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
;++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
testsize = $-testing ; К О Н Е Ц И Н С Т Р У К Ц И Й
;============================================================================
xor eax,eax ; eax ecx edx ebx не сохраняется
cpuid
rdtsc
mov ebx,[counter]
mov dword [resultlist+ebx],eax
mov dword [resultlist+ebx+4],edx
mov eax,dword [tics]
mov edx,dword [tics+4]
add eax,[overhead]
adc edx,0
sub dword [resultlist+ebx],eax
sbb dword [resultlist+ebx+4],edx
add ebx,8
mov [counter],ebx
cmp ebx,iter * 8
jb testloop
;============================================================================
; вывод результатов
;============================================================================
invoke GetCurrentThread
invoke SetThreadPriority,eax,THREAD_PRIORITY_NORMAL
invoke GetCurrentProcess
invoke SetPriorityClass,eax,NORMAL_PRIORITY_CLASS
;============================================================================
finit
mov esi,resultlist
mov ebp,templist
mov edi,message
align 4
@@: fild qword [esi]
fabs
fbstp [ebp]
invoke wsprintf,edi,lpfmtm,[ebp+8],[ebp+4],[ebp]
add esp,20
add esi,8
add ebp,10
add edi,eax
sub [counter],8
jnz @B
invoke wsprintf,caption,lpfmtc,testing,testsize,iter
add esp,20
invoke MessageBox,0,message,caption,MB_ICONINFORMATION
finit
invoke ExitProcess,0
;============================================================================
data import
library kernel32,'kernel32.dll',user32,'user32.dll'
include '%fasminc%\apia\kernel32.inc'
include '%fasminc%\apia\user32.inc'
end data
;============================================================================
.586
.model flat, stdcall
option casemap :none
include \MASM32\INCLUDE\windows.inc
include \MASM32\INCLUDE\user32.inc
include \MASM32\INCLUDE\kernel32.inc
include \masm32\include\masm32.inc
includelib \MASM32\LIB\user32.lib
includelib \MASM32\LIB\kernel32.lib
includelib \masm32\lib\masm32.lib
.data?
Ticks LONG64 ?
Ticks1 LONG64 ?
CReg dt ?
FString db 21 dup(?)
.code
start:
invoke GetCurrentProcess
invoke SetPriorityClass,eax,REALTIME_PRIORITY_CLASS
invoke GetCurrentThread
invoke SetThreadPriority,eax,THREAD_PRIORITY_TIME_CRITICAL
mov ecx, 2
@@:
rdtsc
mov dword ptr [Ticks], eax
mov dword ptr [Ticks+4], edx
push ecx
;===============================================================
;================== Here insert testing code ===================
;===============================================================
pop ecx
rdtsc
mov dword ptr [Ticks1],eax
mov dword ptr [Ticks1+4],edx
loop @B
invoke GetCurrentThread
invoke SetThreadPriority,eax,THREAD_PRIORITY_NORMAL
invoke GetCurrentProcess
invoke SetPriorityClass,eax,NORMAL_PRIORITY_CLASS
fild Ticks
fild Ticks1
fsubr
fbstp CReg
lea edi,FString+18
lea esi,CReg
mov ecx, 10
@@:
xor eax, eax
lodsb
ror ax,4
shr ah, 4
add ax, 3030h
std
stosw
cld
loop @B
invoke MessageBox,0, addr FString,0,0
invoke ExitProcess,0
end start
At first run it empty and see how ticks use it, at my computer 88 ticks.
I find very interesting article about asm code optimization (sad Russian) there author talking about processors history from very beginning. SHR is not the slow command
hi hi in russian. Ray Duncan ??? Optimization asm program. can't find it in English
the code looks quite similar to MichaelW's code that we use...
http://www.masm32.com/board/index.php?topic=770.msg5281#msg5281
i think there is room for improvement - which is one of the projects i am currently working on
the problem lies with the use of CPUID for serializing instructions
on some CPU's, CPUID is, let's call it "erratic", as it does not always take the same number of ticks to execute
on most newer CPU's, it is more stable
still, it takes something like 80 clock cycles, which is kinda long :P
i think i have found a better way, but i have to wade through some other things to get there - lol
any code may be simplified
this 1.5 time quickly
mov ebx, eax
shr eax, 2
add eax, ebx
shr eax, 1
than
mov ebx, 10
mul ebx
strange. IntelGenuine (http://smiles.kolobok.us/big_standart/mocking.gif)
xor or and - very quickly.
MUL is pretty fast on newer processors
and - this code has a lot of dependancies
mov ebx,eax
shl eax,2
add eax,ebx
shl eax,1
not to mention code bytes
in a loop, MUL might be the better choice
also interesting....
on a P4, MUL wants the register multiplier, EAX, and EDX "settled"
it's a good idea to perform some unrelated instruction just before the MUL
IMUL is not so sensitive - it even seems to like it :P
also probably not true on newer cores
mul more universal. it multiply not only eax but put result to edx. also it mul not only 10. but in many cases we don't need mul edx and it possible simplify code
my code is not very correct. command CLD and STD get 48 ticks. PUSHF POPF too more than 50. may be thats true . of course CLD don't need so many ticks, but call some process.
yah - those instructions are slow
i think the OS has to check for priviledge level changes or something for POPFD :P
for the direction flag, it must keep track of the current state because API's expect it to be cleared
PUSHFD isn't too bad, because it does not alter flags
but CLD, STD, and POPFD are slow - about 100 cycles on my machine
CLC and STC are fast, at least
;std
;stosw
;cld
;mov word ptr[edi], ax
;sub edi, 2
stosw
sub edi, 4
:bg
mov [edi],ax
i suspect which is fastest depends on which generation of CPU it's running on
need mov [edi], ax
I think STOS use special line
word ptr - seems no differences Pentium 4
STOS slowly
(http://smiles.kolobok.us/light_skin/wacko.gif)
strange that all "combine" directives slowly. because than cpu execute code it never now the next command....may be my cpu very old
QuoteNEXT:
loop NEXT
slowly than 25%
QuoteNEXT:
sub ecx, 1
cmp ecx, 0
jnz NEXT
Quoteinc eax
slowly than
Quoteadd eax, 1
in my code was interesting situation lea quickly than sub
scasb lodsb movsb stosb cmpsb - slow
NEXT:
sub ecx, 1
cmp ecx, 0
jnz NEXT
no need to CMP
SUB will set or clear the zero flag :P
NEXT:
sub ecx, 1
jnz NEXT
DEC works, too - and is a single-byte instruction for dword general registers
NEXT:
dec ecx
jnz NEXT
with cmp do the same loop