Testing an odd number

dedndave · August 27, 2010, 04:27:58 PM

when looking at the shift instructions, a picture is worth a thousand words (or dwords)

http://www.arl.wustl.edu/~lockwood/class/cs306/books/artofasm/Chapter_6/CH06-3.html#HEADING3-42

frktons · August 27, 2010, 05:21:24 PM

Quote from: mineiro on August 27, 2010, 02:57:05 PM
Hello Sr frktons;
I'm getting float results here; while at every firsts test "test" wins, but after some 10 tests the result invert and "bt" wins.
If I put some huge program to be load, while doing the test program, this results preserve here. At night I can do more tests to you.
regards.

Thanks miniero. I suppose test and bt performs quite the same.
If you run some more tests, let me know.

Quote from: dedndave on August 27, 2010, 04:27:58 PM
when looking at the shift instructions, a picture is worth a thousand words (or dwords)

http://www.arl.wustl.edu/~lockwood/class/cs306/books/artofasm/Chapter_6/CH06-3.html#HEADING3-42

Yes Master, I use them essentially when I've to divide/multiply by 2-4-8-16 in a fast way,
taking care of the carry-boy :P

jj2007 · August 27, 2010, 06:15:21 PM

Quote from: bomz on August 27, 2010, 03:41:54 PM
masm32 applications limits by windows procedure, and serious code optimization needs only for huge computation.

Wow, that sounds like old Chinese wisdom. How old are you? And what do you mean concretely, in plain and correct English?

Rockoon · August 27, 2010, 07:29:22 PM

I recall BT performing fairly badly on old CPU's (386/486/P1 era) and at that time it was really only performance-useful for its ability to treat memory itself (rather than registers) as a huge array of bits

In the case of:

bt [esi], ecx

ecx can be any integer value (not just in the range 0..31) so effectively the BT instruction calculated the proper word displacement in memory..

esi + (ecx >> 5)

...and bit mask...

1 << (ecx & 31)

..for you...

In the case of in-register (bt eax, ecx) usage it was easily beaten at the time in any number of ways.

These days it is very efficient, but I suspect that it will compete with nearby non-dependent lea instructions (which has similar shift-then-add capabilities) for the same silicon

mineiro · August 27, 2010, 09:57:15 PM

Code Select


Intel(R) Pentium(R) Dual  CPU  E2160  @ 1.80GHz (SSE4)
2136    cycles for 1000*test eax, 1, result=500
2063    cycles for 1000*bt eax, 0, result=500

2344    cycles for 1000*test eax, 1, result=500
2028    cycles for 1000*bt eax, 0, result=500
--- ok ---

I really don't understand these results, this is a first try of the long nigh and "kabum", chaos teory again.
Without look at assembly level (opcode table,...) I tried to figure how both logics are done.
The "test" is a logical "and", and "and" is sequential, so, need run in all the path to reach the end.
In "bt", I imagine some "and"(sequential) with "or"(organizational/option)(multiplexer,demultiplexer), this can have more velocity, but, the grandfather 8086 cannot deal with one bit, his characteristics are from byte. You cannot deal with only one bit, if you like to set only one bit, you need work at least with one byte, then change the bit and send the byte again.
I humild cannot vote to a better based in my tests, but if I need vote in one, I will vote in one that get better results in all other cpu's.
regards. (sorry about my language).

bomz · August 28, 2010, 12:01:25 PM

Quote from: jj2007 on August 27, 2010, 06:15:21 PM
Quote from: bomz on August 27, 2010, 03:41:54 PM
masm32 applications limits by windows procedure, and serious code optimization needs only for huge computation.

Wow, that sounds like old Chinese wisdom. How old are you? And what do you mean concretely, in plain and correct English?

windows is not assembler code, this is code which need last modern processors or two, big hdd, modern video cards ...... MASM32 widely , accept own macros, use API functions. system distribute time according it own consideration. deep optimization of concrete code have no big sence, as optimization own style of programming

http://www.kolibrios.org/ - system need one floppy disk. poor IBM

jj2007 · August 28, 2010, 03:27:23 PM

Quote from: bomz on August 28, 2010, 12:01:25 PM
windows is not assembler code, this is code which need last modern processors or two, big hdd, modern video cards ...... MASM32 widely , accept own macros, use API functions. system distribute time according it own consideration. deep optimization of concrete code have no big sence, as optimization own style of programming

К сожалению, мой русский немного ржавый. Можете ли вы объяснить это на английском языке, пожалуйста? И можете ли вы привести конкретный пример? Спасибо.

bomz · August 28, 2010, 04:49:17 PM

ну какой тут можно привести пример. ты приведи пример как ты умудрился деоптимизировать код на ассемблере, что это сказалось на скорости выполнения программы

даже если пустой цикл вставишь. в любом случае есть разумное соотношение оптимизации программы к затраченному на это времени, при аксиме что любой код может быть оптимизирован.

jj2007 · August 28, 2010, 05:11:47 PM

Google believes you said:

QuoteWell what is there to give you an example. you give an example how you managed to deoptimizirovat code in assembler, it affected the speed of execution of the program even if the empty cycle to insert. In any case, there is a reasonable ratio to optimize the program spent for this time, Auxemite that any code can be optimized

bomz, if you believe you can do better, download the testbed, add your algo and show us that you are the champion. We respect good coders. Otherwise I would suggest that you spend some time learning English, because that is the language of this forum.

bomz · August 28, 2010, 05:16:28 PM

http://translate.google.com/

bomz · June 05, 2011, 12:09:13 PM

This problem with TICKS still alive and I remember about it. Absolutely accidentally I find this FASM code. It's complicate and I already not very good understand how it works, but I do seems working for MASM32. see below

Code Select

;============================================================================
format	    pe gui
include     '%fasminc%\win32a.inc'
;============================================================================
	  ; выровнено на 4096                         ; В А Ш И   Д А Н Н Ы Е
;++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++



;++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
iter	    =	    10				  ;  кол-во проходов testing:
;============================================================================
section     '.test' code readable writeable executable
;++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
tics	    dq	    0				  ;
overhead    dd	    0				  ;
counter     dd	    0				  ;
resultlist  rq	    iter			  ;
templist    rb	    iter*10+2			  ;      "технические данные"
message     rb	    iter*26+1			  ;
caption     rb	    64				  ;
lpfmtm	    db	    '%.8X%.8X%.8X',13,10,0	  ;
lpfmtc	    db	    '%0.8X / %u bytes / %u passes',0
;============================================================================
align	    1024				  ;
;============================================================================
entry	    $
	    invoke  GetCurrentProcess
	    invoke  SetPriorityClass,eax,REALTIME_PRIORITY_CLASS
	    invoke  GetCurrentThread
	    invoke  SetThreadPriority,eax,THREAD_PRIORITY_TIME_CRITICAL
;============================================================================
	    mov     ebp,5			  ;
align	    16					  ;
@@:	    mov     eax,0			  ;
	    cpuid				  ;
	    rdtsc				  ;
	    mov     dword [tics],eax		  ;          подсчет overhead
	    mov     dword [tics+4],edx		  ;  (используется для вычета
	    xor     eax,eax			  ;        тактов, ушедших на
	    cpuid				  ;     "технические" детали)
	    xor     eax,eax			  ;
	    cpuid				  ;
	    rdtsc				  ;
	    sub     eax,dword [tics]		  ;
	    mov     [overhead],eax		  ;
	    dec     ebp 			  ;
	    jnz     @B				  ;
;============================================================================
	  ; используйте esi edi ebp                 И Н И Ц И А Л И З А Ц И Я
;++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++



;++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
;                                                    цикл проходов (итераций)
;============================================================================
align	    16
testloop:   times   8 : nop			  ; для выравнивания testing:
	    mov     eax,0
	    cpuid
	    rdtsc
	    mov     dword [tics],eax
	    mov     dword [tics+4],edx
	    xor     eax,eax
	    cpuid			    ; eax ecx edx ebx  не сохраняется
;============================================================================
testing:  ; выровнено на 16     ; Т Е С Т И Р У Е М Ы Е   И Н С Т Р У К Ц И И
;++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++



;++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
testsize    =	    $-testing		    ; К О Н Е Ц   И Н С Т Р У К Ц И Й
;============================================================================
	    xor     eax,eax		    ; eax ecx edx ebx  не сохраняется
	    cpuid
	    rdtsc
	    mov     ebx,[counter]
	    mov     dword [resultlist+ebx],eax
	    mov     dword [resultlist+ebx+4],edx
	    mov     eax,dword [tics]
	    mov     edx,dword [tics+4]
	    add     eax,[overhead]
	    adc     edx,0
	    sub     dword [resultlist+ebx],eax
	    sbb     dword [resultlist+ebx+4],edx
	    add     ebx,8
	    mov     [counter],ebx
	    cmp     ebx,iter * 8
	    jb	    testloop
;============================================================================
;                                                           вывод результатов
;============================================================================
	    invoke  GetCurrentThread
	    invoke  SetThreadPriority,eax,THREAD_PRIORITY_NORMAL
	    invoke  GetCurrentProcess
	    invoke  SetPriorityClass,eax,NORMAL_PRIORITY_CLASS
;============================================================================
	    finit
	    mov     esi,resultlist
	    mov     ebp,templist
	    mov     edi,message
align	    4
@@:	    fild    qword [esi]
	    fabs
	    fbstp   [ebp]
	    invoke  wsprintf,edi,lpfmtm,[ebp+8],[ebp+4],[ebp]
	    add     esp,20
	    add     esi,8
	    add     ebp,10
	    add     edi,eax
	    sub     [counter],8
	    jnz     @B
	    invoke  wsprintf,caption,lpfmtc,testing,testsize,iter
	    add     esp,20
	    invoke  MessageBox,0,message,caption,MB_ICONINFORMATION
	    finit
	    invoke  ExitProcess,0
;============================================================================
data	    import
library     kernel32,'kernel32.dll',user32,'user32.dll'
include     '%fasminc%\apia\kernel32.inc'
include     '%fasminc%\apia\user32.inc'
end	    data
;============================================================================

Code Select

.586

.model flat, stdcall
option casemap :none

	include \MASM32\INCLUDE\windows.inc
	include \MASM32\INCLUDE\user32.inc
	include \MASM32\INCLUDE\kernel32.inc
	include \masm32\include\masm32.inc
	includelib \MASM32\LIB\user32.lib
	includelib \MASM32\LIB\kernel32.lib
	includelib \masm32\lib\masm32.lib

.data?
	Ticks		LONG64 ?
	Ticks1		LONG64 ?
	CReg		dt ?
	FString		db 21 dup(?)
.code
start:
	invoke  GetCurrentProcess
	invoke  SetPriorityClass,eax,REALTIME_PRIORITY_CLASS
	invoke  GetCurrentThread
	invoke  SetThreadPriority,eax,THREAD_PRIORITY_TIME_CRITICAL
	mov ecx, 2
@@:
	rdtsc
	mov dword ptr [Ticks], eax
	mov dword ptr [Ticks+4], edx
	push ecx
;===============================================================
;================== Here insert testing code ===================
;===============================================================
	pop ecx
	rdtsc
	mov dword ptr [Ticks1],eax
	mov dword ptr [Ticks1+4],edx
	loop @B

	invoke  GetCurrentThread
	invoke  SetThreadPriority,eax,THREAD_PRIORITY_NORMAL
	invoke  GetCurrentProcess
	invoke  SetPriorityClass,eax,NORMAL_PRIORITY_CLASS

	fild Ticks
	fild Ticks1
	fsubr
	fbstp CReg

	lea edi,FString+18
	lea esi,CReg
	mov ecx, 10
@@:
	xor eax, eax
	lodsb
	ror ax,4
	shr ah, 4
	add ax, 3030h
	std
	stosw
	cld
	loop @B

	invoke MessageBox,0, addr FString,0,0

	invoke ExitProcess,0

end start

At first run it empty and see how ticks use it, at my computer 88 ticks.

I find very interesting article about asm code optimization (sad Russian) there author talking about processors history from very beginning. SHR is not the slow command

hi hi in russian. Ray Duncan ??? Optimization asm program. can't find it in English

dedndave · June 05, 2011, 01:02:38 PM

the code looks quite similar to MichaelW's code that we use...

http://www.masm32.com/board/index.php?topic=770.msg5281#msg5281

i think there is room for improvement - which is one of the projects i am currently working on
the problem lies with the use of CPUID for serializing instructions
on some CPU's, CPUID is, let's call it "erratic", as it does not always take the same number of ticks to execute
on most newer CPU's, it is more stable
still, it takes something like 80 clock cycles, which is kinda long :P

i think i have found a better way, but i have to wade through some other things to get there - lol

bomz · June 05, 2011, 01:43:14 PM

any code may be simplified

bomz · June 05, 2011, 07:57:43 PM

this 1.5 time quickly

Code Select

	mov ebx, eax
	shr eax, 2
	add eax, ebx
	shr eax, 1

than

Code Select

mov ebx, 10
mul ebx

strange. IntelGenuine

bomz · June 05, 2011, 07:59:51 PM

xor or and - very quickly.

News:

Testing an odd number