News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Testing an odd number

Started by frktons, August 27, 2010, 09:40:15 AM

Previous topic - Next topic

dedndave


frktons

Quote from: mineiro on August 27, 2010, 02:57:05 PM
Hello Sr frktons;
I'm getting float results here; while at every firsts test "test" wins, but after some 10 tests the result invert and "bt" wins.
If I put some huge program to be load, while doing the test program, this results preserve here. At night I can do more tests to you.
regards.


Thanks miniero. I suppose test and bt performs quite the same.
If you run some more tests, let me know.

Quote from: dedndave on August 27, 2010, 04:27:58 PM
when looking at the shift instructions, a picture is worth a thousand words (or dwords)

http://www.arl.wustl.edu/~lockwood/class/cs306/books/artofasm/Chapter_6/CH06-3.html#HEADING3-42

Yes Master, I use them essentially when I've to divide/multiply by 2-4-8-16 in a fast way,
taking care of the carry-boy  :P

Mind is like a parachute. You know what to do in order to use it :-)

jj2007

Quote from: bomz on August 27, 2010, 03:41:54 PM
masm32 applications limits by windows procedure, and serious code optimization needs only for huge computation.

Wow, that sounds like old Chinese wisdom. How old are you? And what do you mean concretely, in plain and correct English?

Rockoon

I recall BT performing fairly badly on old CPU's (386/486/P1 era) and at that time it was really only performance-useful for its ability to treat memory itself (rather than registers) as a huge array of bits 

In the case of:

bt [esi], ecx

ecx can be any integer value (not just in the range 0..31) so effectively the BT instruction calculated the proper word displacement in memory..

esi + (ecx >> 5)

...and bit mask...

1 << (ecx & 31)

..for you...

In the case of in-register (bt eax, ecx) usage it was easily beaten at the time in any number of ways.

These days it is very efficient, but I suspect that it will compete with nearby non-dependent lea instructions (which has similar shift-then-add capabilities) for the same silicon
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

mineiro


Intel(R) Pentium(R) Dual  CPU  E2160  @ 1.80GHz (SSE4)
2136    cycles for 1000*test eax, 1, result=500
2063    cycles for 1000*bt eax, 0, result=500

2344    cycles for 1000*test eax, 1, result=500
2028    cycles for 1000*bt eax, 0, result=500
--- ok ---

I really don't understand these results, this is a first try of the long nigh and "kabum", chaos teory again.
Without look at assembly level (opcode table,...) I tried to figure how both logics are done.
The "test" is a logical "and", and "and" is sequential, so, need run in all the path to reach the end.
In "bt", I imagine some "and"(sequential) with "or"(organizational/option)(multiplexer,demultiplexer), this can have more velocity, but, the grandfather 8086 cannot deal with one bit, his characteristics are from byte. You cannot deal with only one bit, if you like to set only one bit, you need work at least with one byte, then change the bit and send the byte again.
I humild cannot vote to a better based in my tests, but if I need vote in one, I will vote in one that get better results in all other cpu's.
regards. (sorry about my language).

bomz

Quote from: jj2007 on August 27, 2010, 06:15:21 PM
Quote from: bomz on August 27, 2010, 03:41:54 PM
masm32 applications limits by windows procedure, and serious code optimization needs only for huge computation.

Wow, that sounds like old Chinese wisdom. How old are you? And what do you mean concretely, in plain and correct English?

windows is not assembler code, this is code which need last modern processors or two, big hdd, modern video cards ...... MASM32 widely , accept own macros, use API functions. system  distribute time according it own consideration. deep optimization of concrete code have no big sence, as optimization own style of programming

http://www.kolibrios.org/ - system need one floppy disk. poor IBM

jj2007

Quote from: bomz on August 28, 2010, 12:01:25 PM
windows is not assembler code, this is code which need last modern processors or two, big hdd, modern video cards ...... MASM32 widely , accept own macros, use API functions. system  distribute time according it own consideration. deep optimization of concrete code have no big sence, as optimization own style of programming

К сожалению, мой русский немного ржавый. Можете ли вы объяснить это на английском языке, пожалуйста? И можете ли вы привести конкретный пример? Спасибо.

bomz

ну какой тут можно привести пример. ты приведи пример как ты умудрился деоптимизировать код на ассемблере, что это сказалось на скорости выполнения программы

даже если пустой цикл вставишь. в любом случае есть разумное соотношение оптимизации программы к затраченному на это времени, при аксиме что любой код может быть оптимизирован.

jj2007

Google believes you said:
QuoteWell what is there to give you an example. you give an example how you managed to deoptimizirovat code in assembler, it affected the speed of execution of the program even if the empty cycle to insert. In any case, there is a reasonable ratio to optimize the program spent for this time, Auxemite that any code can be optimized

bomz, if you believe you can do better, download the testbed, add your algo and show us that you are the champion. We respect good coders. Otherwise I would suggest that you spend some time learning English, because that is the language of this forum.

bomz

#39

bomz

This problem with TICKS still alive and I remember about it. Absolutely accidentally I find this FASM code. It's complicate and I already not very good understand how it works, but I do seems working for MASM32. see below

;============================================================================
format     pe gui
include     '%fasminc%\win32a.inc'
;============================================================================
  ; выровнено на 4096                         ; В А Ш И   Д А Н Н Ы Е
;++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++



;++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
iter     =     10   ;  кол-во проходов testing:
;============================================================================
section     '.test' code readable writeable executable
;++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
tics     dq     0   ;
overhead    dd     0   ;
counter     dd     0   ;
resultlist  rq     iter   ;
templist    rb     iter*10+2   ;      "технические данные"
message     rb     iter*26+1   ;
caption     rb     64   ;
lpfmtm     db     '%.8X%.8X%.8X',13,10,0   ;
lpfmtc     db     '%0.8X / %u bytes / %u passes',0
;============================================================================
align     1024   ;
;============================================================================
entry     $
    invoke  GetCurrentProcess
    invoke  SetPriorityClass,eax,REALTIME_PRIORITY_CLASS
    invoke  GetCurrentThread
    invoke  SetThreadPriority,eax,THREAD_PRIORITY_TIME_CRITICAL
;============================================================================
    mov     ebp,5   ;
align     16   ;
@@:     mov     eax,0   ;
    cpuid   ;
    rdtsc   ;
    mov     dword [tics],eax   ;          подсчет overhead
    mov     dword [tics+4],edx   ;  (используется для вычета
    xor     eax,eax   ;        тактов, ушедших на
    cpuid   ;     "технические" детали)
    xor     eax,eax   ;
    cpuid   ;
    rdtsc   ;
    sub     eax,dword [tics]   ;
    mov     [overhead],eax   ;
    dec     ebp   ;
    jnz     @B   ;
;============================================================================
  ; используйте esi edi ebp                 И Н И Ц И А Л И З А Ц И Я
;++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++



;++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
;                                                    цикл проходов (итераций)
;============================================================================
align     16
testloop:   times   8 : nop   ; для выравнивания testing:
    mov     eax,0
    cpuid
    rdtsc
    mov     dword [tics],eax
    mov     dword [tics+4],edx
    xor     eax,eax
    cpuid     ; eax ecx edx ebx  не сохраняется
;============================================================================
testing:  ; выровнено на 16     ; Т Е С Т И Р У Е М Ы Е   И Н С Т Р У К Ц И И
;++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++



;++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
testsize    =     $-testing     ; К О Н Е Ц   И Н С Т Р У К Ц И Й
;============================================================================
    xor     eax,eax     ; eax ecx edx ebx  не сохраняется
    cpuid
    rdtsc
    mov     ebx,[counter]
    mov     dword [resultlist+ebx],eax
    mov     dword [resultlist+ebx+4],edx
    mov     eax,dword [tics]
    mov     edx,dword [tics+4]
    add     eax,[overhead]
    adc     edx,0
    sub     dword [resultlist+ebx],eax
    sbb     dword [resultlist+ebx+4],edx
    add     ebx,8
    mov     [counter],ebx
    cmp     ebx,iter * 8
    jb     testloop
;============================================================================
;                                                           вывод результатов
;============================================================================
    invoke  GetCurrentThread
    invoke  SetThreadPriority,eax,THREAD_PRIORITY_NORMAL
    invoke  GetCurrentProcess
    invoke  SetPriorityClass,eax,NORMAL_PRIORITY_CLASS
;============================================================================
    finit
    mov     esi,resultlist
    mov     ebp,templist
    mov     edi,message
align     4
@@:     fild    qword [esi]
    fabs
    fbstp   [ebp]
    invoke  wsprintf,edi,lpfmtm,[ebp+8],[ebp+4],[ebp]
    add     esp,20
    add     esi,8
    add     ebp,10
    add     edi,eax
    sub     [counter],8
    jnz     @B
    invoke  wsprintf,caption,lpfmtc,testing,testsize,iter
    add     esp,20
    invoke  MessageBox,0,message,caption,MB_ICONINFORMATION
    finit
    invoke  ExitProcess,0
;============================================================================
data     import
library     kernel32,'kernel32.dll',user32,'user32.dll'
include     '%fasminc%\apia\kernel32.inc'
include     '%fasminc%\apia\user32.inc'
end     data
;============================================================================


.586

.model flat, stdcall
option casemap :none

include \MASM32\INCLUDE\windows.inc
include \MASM32\INCLUDE\user32.inc
include \MASM32\INCLUDE\kernel32.inc
include \masm32\include\masm32.inc
includelib \MASM32\LIB\user32.lib
includelib \MASM32\LIB\kernel32.lib
includelib \masm32\lib\masm32.lib

.data?
Ticks LONG64 ?
Ticks1 LONG64 ?
CReg dt ?
FString db 21 dup(?)
.code
start:
invoke  GetCurrentProcess
invoke  SetPriorityClass,eax,REALTIME_PRIORITY_CLASS
invoke  GetCurrentThread
invoke  SetThreadPriority,eax,THREAD_PRIORITY_TIME_CRITICAL
mov ecx, 2
@@:
rdtsc
mov dword ptr [Ticks], eax
mov dword ptr [Ticks+4], edx
push ecx
;===============================================================
;================== Here insert testing code ===================
;===============================================================
pop ecx
rdtsc
mov dword ptr [Ticks1],eax
mov dword ptr [Ticks1+4],edx
loop @B

invoke  GetCurrentThread
invoke  SetThreadPriority,eax,THREAD_PRIORITY_NORMAL
invoke  GetCurrentProcess
invoke  SetPriorityClass,eax,NORMAL_PRIORITY_CLASS

fild Ticks
fild Ticks1
fsubr
fbstp CReg

lea edi,FString+18
lea esi,CReg
mov ecx, 10
@@:
xor eax, eax
lodsb
ror ax,4
shr ah, 4
add ax, 3030h
std
stosw
cld
loop @B

invoke MessageBox,0, addr FString,0,0

invoke ExitProcess,0

end start


At first run it empty and see how ticks use it, at my computer 88 ticks.

I find very interesting article about asm code optimization (sad Russian) there author talking about processors history from very beginning. SHR is not the slow command



hi hi in russian. Ray Duncan ??? Optimization asm program. can't find it in English

dedndave

the code looks quite similar to MichaelW's code that we use...

http://www.masm32.com/board/index.php?topic=770.msg5281#msg5281

i think there is room for improvement - which is one of the projects i am currently working on
the problem lies with the use of CPUID for serializing instructions
on some CPU's, CPUID is, let's call it "erratic", as it does not always take the same number of ticks to execute
on most newer CPU's, it is more stable
still, it takes something like 80 clock cycles, which is kinda long   :P

i think i have found a better way, but i have to wade through some other things to get there - lol

bomz

any code may be simplified

bomz

this 1.5 time quickly
mov ebx, eax
shr eax, 2
add eax, ebx
shr eax, 1

than
mov ebx, 10
mul ebx


strange.  IntelGenuine

bomz

xor or and - very quickly.