How to optimize a tradeoff : Or is there a better way ?
je jmp, excellent branch prediction but a duplicate subroutine
.versus.
conditional call ret, ? horrible branch prediction ? has stack use and overhead but a single subroutine
The issue is code that conditionally jumps to a subroutine,
but when done should return to the next instruction like a call.
The issues are, multiple places in the code would call this subroutine
thus needing duplicates of the subroutine to insure a correct return
when using je/jmp and a jmp to return.
Using call wouldn't have the simple conditional ability and that means
usually having a branch penalty.
mov m0, 1234567
mov m1, 1234999
mov esi, 0
mov edi, 0
L0:
cmp esi, m0
je Display0 ; Excellent branch prediction, Alternative using some sort of conditional call
L1:
inc esi
cmp esi, m1
je L2 ; is this better than jne L0 __ jmp Display1
jmp L0
L2:
jmp Display1 ; Alternative use a call
L3:
inc edi
cmp edi, 2
je Past
jmp L0
Display0:
mov _esi, esi
mov _edi, edi
invoke wsprintf, ADDR szBufHex, ADDR szRegHex, esi, edi
invoke MessageBox, 0, ADDR szBufHex, Addr szCap, MB_OK
mov esi, _esi
mov edi, _edi
jmp L1
Display1:
mov _esi, esi
mov _edi, edi
invoke wsprintf, ADDR szBufHex, ADDR szRegHex, esi, edi
invoke MessageBox, 0, ADDR szBufHex, Addr szCap, MB_OK
mov esi, _esi
mov edi, _edi
jmp L3
Past:
.versus.
L0:
cmp esi, m0
jne L1 ; Branch taken almost always. Will this suffer a high branch prediction penalty ?
call Display ; Very rarely falls through to this. Call overhead, stack use
L1:
inc esi
cmp esi, m1
je L2 ; Branch rarely taken. Minimal branch prediction penalty ? Is jne L0 better ?
jmp L0
L2:
call Display ; Very rarely taken. Still call overhead, stack use
inc edi
cmp edi, 2
je Past
jmp L0
Display: ; single subroutine, return overhead
mov _esi, esi
mov _edi, edi
invoke wsprintf, ADDR szBufHex, ADDR szRegHex, esi, edi
invoke MessageBox, 0, ADDR szBufHex, Addr szCap, MB_OK
mov esi, _esi
mov edi, _edi
ret
Past:
[attachment deleted by admin]
After testing different versions,
with proper alignment the je/jmp with jmp back is faster
taking 11 seconds versus 22 seconds for the call versions.
Can you not combine the best of both worlds?
Make sure you have a spare register. Put a return label everywhere you want to jump back to after the "subroutine". Before executing the jump test do an LEA reg, thislabel so that on entry to the routine you have a record of where you need to jump back to. Now just JMP reg to go back at the end of the "subroutine" to effectively do a RET. If you set OPTION NOSCOPED you can use your routine from multiple procedures without being restricted only to jump points within them.
Shouldn't this work, or am I missing something?
Ian_B
Ian, Great technique !
Thank You
Just tested it, it works !
mov m0, 1234567
mov m1, 1234999
mov esi, 0
mov edi, 0
L0:
cmp esi, m0
lea ebx, L1
je Display ; je Display0
L1:
inc esi
cmp esi, m1
lea ebx, L3
je Display ; je Display1
jne L0
L3:
inc edi
cmp edi, 2
je Past
jmp L0
Display:
mov _ebx, ebx
mov _edi, edi
mov _esi, esi
invoke wsprintf, ADDR szBufHex, ADDR szRegHex, esi, edi, ebx
invoke MessageBox, 0, ADDR szBufHex, Addr szCap, MB_OK
mov esi, _esi
mov edi, _edi
mov ebx, _ebx
jmp ebx
It does two cycles of showing the MessageBox twice
each with different trigger values, loop count and returning jump location
then it shows the exit MessageBox and finishes.
Each use of the single Display subroutine returns to the correct differing locations
using the calculated jump/faked return.
[attachment deleted by admin]
jmp and calls have as most very limited branchprediction as in remember last jmp/call was to and it flushes code cache to re-read a new cacheline
I think you should use Jne/Je combination as conditional jump have very advanced branch prediction
its also smaller code
you could always try it and replace them with jmp's if masm reports error because the unconditional jump is too far
also I think if its a small sub, it might be worth to make a macro version of it and place directly after conditional jump, if you wanna get maxspeed but on cost of bigger size
was actually thinking doing this codestyle to the max, inside a workerthread, because no invoke/call/ret's pushing and popping means I have access to use all 8 general regs
The original issue was max speed, since the Display routine is called rarely (only when the value is zero),
having the Display code outside of the main loop would shorten it.
Also having the Display code separated out allows changing its size without affecting
alignment in the main loop.
A secondary issue was code size/reuse, allowing the same code to be called from multiple places
and being able to return to the correct spot was needed.
The speed increase by not pushing and popping
and messing with the stack in general in this testcode is significant.
In a worker thread wouldn't you still need the two stack registers
or at the very least one of them ?
Would you use some sort of global psuedo registers like
.data
_esp dd 0
_ebp dd 0
to hold the esp and ebp on entry to the worker thread and restore them
before any API calls or the termination of the worker thread ?
Quote from: dsouza123 on March 10, 2007, 12:07:11 PM
In a worker thread wouldn't you still need the two stack registers
or at the very least one of them ?
Would you use some sort of global psuedo registers like
.data
_esp dd 0
_ebp dd 0
to hold the esp and ebp on entry to the worker thread and restore them
before any API calls or the termination of the worker thread ?
by design all apicalls will be in winmain or wndproc/TIMER
apicalls to create a workerthread the OS is called and creates a own set of regs separate from winmain-thread
So your worker threads would be short time,
with no need to wait to start, pause, or stop
or any need for information transfer with the main thread ?
Would it be do some short time work or calculation
write the result to global memory and to some other signal variable
or return a value or pointer in a register when the thread is done ?
Like a regular proc but which takes too long and would make the system
unresponsive.