Hello, I wonder if this function could be optimized. I need to XOR 4 bits from one byte and RCL the result to 16 byte buffer.
I started with single AND instruction and used the parity flag, equal to XNOR result:
// bits to xor: 7,6,5,0
mov al, somebyte
and al, 11100001 // PF = al.7 xnor al.6 xnor al.5 xnor al.0
// rotate PF flag to CF flag
pushf
shr byte[esp],2
popf
// rotate the buffer with carry
rcl dword [esi],1 ...
It was slow, because of the additional stack operations. Summary 432ms for 18 mega loops, on centrino M 1.86GHz
First optimization: replace push/pop with stc + jp + clc. Much better: 184ms.
Second optimization: current version with 4*xor; 157ms
I see that every instruction inside __lfsr_128 must wait for previous instruction to complete. Any ideas?
main:
// required subsystem: console
// press any key to quit
sub esp, 20 ; 128 bit buffer for LFSR + 4 bytes for counter
; seed the buffer
mov [esp ],esp
mov [esp+ 4],esp
mov [esp+ 8],esp
mov [esp+12],esp
; enter main loop
invoke GetCurrentThread
invoke SetThreadPriority, eax, THREAD_PRIORITY_TIME_CRITICAL
align 4
mainloop:
invoke _kbhit
and al,al
jnz quit
invoke GetTickCount
mov [esp+16],eax
; number of loops equal to 10 mega bytes
mov ecx,1024*1024*8
childloop:
call __lfsr_128, esp
loop childloop
invoke GetTickCount
sub eax,[esp+16]
cinvoke printf, "time: %d ms\n", eax
invoke Sleep, 1
jmp mainloop
quit:
invoke GetCurrentThread
invoke SetThreadPriority, eax, THREAD_PRIORITY_NORMAL
add esp, 20
ret
;------------------------
align 4
__lfsr_128:
; 128 bit linear feedback shift register
; Feedback bits: 127, 126, 125, 120
mov edx,[esp+4] ; m_bufer ; AL bits 7 6 5 0
mov al,[edx+15] ; m_bufer[3]; bits 127, 126, 125, ..., 120
mov ah,al ; ah=765....0
add ah,ah ; ah=65....0.
xor al,ah ; al = al.7 xor al.6
add ah,ah ; ah=5....0..
xor al,ah ; al = al.7 xor al.6 xor al.5
shl ah,5 ; ah=0.......
xor al,ah ; al = al.7 xor al.6 xor al.5 xor al.0
add al,al ; put al.7 into CY
; now shift the 128 bits and append the bit from XOR result
; ___
; m_bufer[3].31 -> |XOR| ___
; m_bufer[3].30 -> |___| -> |XOR| ->--------------->-----+
; m_bufer[3].29 -> |XOR| -> |___| |
; m_bufer[3].24 -> |___| |
; |
; m_bufer[3] <- m_bufer[2] <- m_bufer[1] <- m_bufer[0] <-+
rcl dword [edx ],1
rcl dword [edx+ 4],1
rcl dword [edx+ 8],1
rcl dword [edx+12],1
ret 4
I'm not entirely sure what you're doing here - but your idea of using the parity flag is a good one. The method you use to move the flag is the problem...
Secondly I'm not entirely sure why you want to do this long string of rcls. The order is deterministic, the first bit you put in will end up in bit 127, and the last will end up in bit 0... Why not put them there to start with?
mov ecx, 080000000h
mov edx, 3
@@:
xor eax, eax
test somebyte, 0E1h
cmovp eax, ecx
or result[edx * 4], eax
ror ecx, 1
sbb edx, 0
jl @B
Also I would look at the timing code others have posted on the board, GetTickCount isn't very good as it has too high a margin of error.
Mirno