News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

call ret versus je jmp

Started by dsouza123, March 04, 2007, 12:09:43 AM

Previous topic - Next topic

dsouza123

How to optimize a tradeoff :          Or is there a better way ?
je jmp, excellent branch prediction but a duplicate subroutine
.versus.
conditional call ret,  ? horrible branch prediction ? has stack use and overhead but a single subroutine

The issue is code that conditionally jumps to a subroutine,
but when done should return to the next instruction like a call.

The issues are, multiple places in the code would call this subroutine
thus needing duplicates of the subroutine to insure a correct return
when using je/jmp and a jmp to return.

Using call wouldn't have the simple conditional ability and that means
usually having a branch penalty.


  mov m0, 1234567
  mov m1, 1234999
  mov esi, 0
  mov edi, 0

L0:
  cmp esi, m0
  je  Display0    ; Excellent branch prediction, Alternative using some sort of conditional call
L1:
  inc esi
  cmp esi, m1
  je  L2           ; is this better than   jne L0 __ jmp Display1
  jmp L0
L2:
  jmp  Display1    ; Alternative use a call
L3:
  inc edi
  cmp edi, 2
  je  Past
  jmp L0

Display0:
  mov _esi, esi
  mov _edi, edi
  invoke wsprintf, ADDR szBufHex, ADDR szRegHex, esi, edi
  invoke MessageBox, 0, ADDR szBufHex, Addr szCap, MB_OK
  mov esi, _esi
  mov edi, _edi
  jmp L1

Display1:
  mov _esi, esi
  mov _edi, edi
  invoke wsprintf, ADDR szBufHex, ADDR szRegHex, esi, edi
  invoke MessageBox, 0, ADDR szBufHex, Addr szCap, MB_OK
  mov esi, _esi
  mov edi, _edi
  jmp L3

Past:


.versus.


L0:
  cmp esi, m0
  jne L1         ; Branch taken almost always. Will this suffer a high branch prediction penalty ?
  call Display   ; Very rarely falls through to this.  Call overhead, stack use
L1:
  inc esi
  cmp esi, m1
  je  L2         ; Branch rarely taken. Minimal branch prediction penalty ?  Is jne L0 better ?
  jmp L0
L2:
  call Display   ; Very rarely taken.  Still call overhead, stack use
  inc edi
  cmp edi, 2
  je  Past
  jmp L0

Display:         ; single subroutine, return overhead
  mov _esi, esi
  mov _edi, edi
  invoke wsprintf, ADDR szBufHex, ADDR szRegHex, esi, edi
  invoke MessageBox, 0, ADDR szBufHex, Addr szCap, MB_OK
  mov esi, _esi
  mov edi, _edi
  ret

Past:

[attachment deleted by admin]

dsouza123

After testing different versions,
with proper alignment the je/jmp with jmp back is faster
taking 11 seconds versus 22 seconds for the call versions.

Ian_B

Can you not combine the best of both worlds?

Make sure you have a spare register. Put a return label everywhere you want to jump back to after the "subroutine". Before executing the jump test do an LEA reg, thislabel so that on entry to the routine you have a record of where you need to jump back to. Now just JMP reg to go back at the end of the "subroutine" to effectively do a RET. If you set OPTION NOSCOPED you can use your routine from multiple procedures without being restricted only to jump points within them.

Shouldn't this work, or am I missing something?

Ian_B

dsouza123

Ian, Great technique !
Thank You

Just tested it, it works !


  mov m0, 1234567
  mov m1, 1234999
  mov esi, 0
  mov edi, 0

L0:
  cmp esi, m0
  lea ebx, L1
  je  Display     ;  je  Display0
L1:
  inc esi
  cmp esi, m1
  lea ebx, L3
  je  Display     ;  je  Display1
  jne L0
L3:
  inc edi
  cmp edi, 2
  je  Past
  jmp L0


Display:
  mov _ebx, ebx
  mov _edi, edi
  mov _esi, esi
  invoke wsprintf, ADDR szBufHex, ADDR szRegHex, esi, edi, ebx
  invoke MessageBox, 0, ADDR szBufHex, Addr szCap, MB_OK
  mov esi, _esi
  mov edi, _edi
  mov ebx, _ebx
  jmp ebx


It does two cycles of showing the MessageBox twice
each with different trigger values, loop count and returning jump location
then it shows the exit MessageBox and finishes.

Each use of the single Display subroutine returns to the correct differing locations
using the calculated jump/faked return.

[attachment deleted by admin]

daydreamer

jmp and calls have as most very limited branchprediction as in remember last jmp/call was to and it flushes code cache to re-read a new cacheline
I think you should use Jne/Je combination as conditional jump have very advanced branch prediction
its also smaller code
you could always try it and replace them with jmp's if masm reports error because the unconditional jump is too far

also I think if its a small sub, it might be worth to make a macro version of it and place directly after conditional jump, if you wanna get maxspeed but on cost of bigger size
was actually thinking doing this codestyle to the max, inside a workerthread, because no invoke/call/ret's pushing and popping means I have access to use all 8 general regs

dsouza123

The original issue was max speed, since the Display routine is called rarely (only when the value is zero),
having the Display code outside of the main loop would shorten it.
Also having the Display code separated out allows changing its size without affecting
alignment in the main loop.
A secondary issue was code size/reuse, allowing the same code to be called from multiple places
and being able to return to the correct spot was needed.

The speed increase by not pushing and popping
and messing with the stack in general in this testcode is significant.

In a worker thread wouldn't you still need the two stack registers
or at the very least one of them ?

Would you use some sort of global psuedo registers like

.data
  _esp dd 0
  _ebp dd 0

to hold the esp and ebp on entry to the worker thread and restore them
before any API calls or the termination of the worker thread ?

daydreamer

Quote from: dsouza123 on March 10, 2007, 12:07:11 PM
In a worker thread wouldn't you still need the two stack registers
or at the very least one of them ?

Would you use some sort of global psuedo registers like

.data
  _esp dd 0
  _ebp dd 0

to hold the esp and ebp on entry to the worker thread and restore them
before any API calls or the termination of the worker thread ?
by design all apicalls will be in winmain or wndproc/TIMER
apicalls to create a workerthread the OS is called and creates a own set of regs separate from winmain-thread



dsouza123

So your worker threads would be short time,
with no need to wait to start, pause, or stop
or any need for information transfer with the main thread ?

Would it be do some short time work or calculation
write the result to global memory and to some other signal variable
or return a value or pointer in a register when the thread is done ?

Like a regular proc but which takes too long and would make the system
unresponsive.