News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Optimizing 128 bit linear feedback shift register

Started by akane, October 02, 2008, 02:52:50 PM

Previous topic - Next topic

akane

Hello, I wonder if this function could be optimized. I need to XOR 4 bits from one byte and RCL the result to 16 byte buffer.

I started with single AND instruction and used the parity flag, equal to XNOR result:

// bits to xor: 7,6,5,0
mov al, somebyte
and al, 11100001 // PF = al.7 xnor al.6 xnor al.5 xnor al.0

// rotate PF flag to CF flag
pushf
shr byte[esp],2
popf

// rotate the buffer with carry
rcl dword [esi],1 ...


It was slow, because of the additional stack operations. Summary 432ms for 18 mega loops, on centrino M 1.86GHz
First optimization: replace push/pop with stc + jp + clc. Much better: 184ms.
Second optimization: current version with 4*xor; 157ms

I see that every instruction inside __lfsr_128 must wait for previous instruction to complete. Any ideas?

main:
// required subsystem: console
// press any key to quit

sub esp, 20 ; 128 bit buffer for LFSR + 4 bytes for counter

; seed the buffer
mov [esp   ],esp
mov [esp+ 4],esp
mov [esp+ 8],esp
mov [esp+12],esp

; enter main loop
invoke GetCurrentThread
invoke SetThreadPriority, eax, THREAD_PRIORITY_TIME_CRITICAL

align 4

mainloop:
invoke _kbhit
and    al,al
jnz    quit

invoke GetTickCount
mov    [esp+16],eax

; number of loops equal to 10 mega bytes
mov    ecx,1024*1024*8

childloop:
call   __lfsr_128, esp
loop   childloop

invoke  GetTickCount
sub     eax,[esp+16]

cinvoke printf, "time: %d ms\n", eax
invoke  Sleep, 1
jmp     mainloop

quit:
invoke GetCurrentThread
invoke SetThreadPriority, eax, THREAD_PRIORITY_NORMAL
add    esp, 20
ret


;------------------------
align 4

__lfsr_128:
; 128 bit linear feedback shift register
; Feedback bits: 127, 126, 125, 120

mov edx,[esp+4] ; m_bufer ; AL bits   7    6    5         0
mov al,[edx+15] ; m_bufer[3];  bits 127, 126, 125, ..., 120

mov ah,al ; ah=765....0
add ah,ah ; ah=65....0.
xor al,ah ;             al = al.7 xor al.6
add ah,ah ; ah=5....0..
xor al,ah ;             al = al.7 xor al.6 xor al.5
shl ah,5  ; ah=0.......
xor al,ah ;             al = al.7 xor al.6 xor al.5 xor al.0
add al,al ; put al.7 into CY

; now shift the 128 bits and append the bit from XOR result
;                   ___
; m_bufer[3].31 -> |XOR|     ___
; m_bufer[3].30 -> |___| -> |XOR| ->--------------->-----+
; m_bufer[3].29 -> |XOR| -> |___|                        |
; m_bufer[3].24 -> |___|                                 |
;                                                        |
; m_bufer[3] <- m_bufer[2] <- m_bufer[1] <- m_bufer[0] <-+

rcl dword [edx   ],1
rcl dword [edx+ 4],1
rcl dword [edx+ 8],1
rcl dword [edx+12],1

ret 4

Mirno

I'm not entirely sure what you're doing here - but your idea of using the parity flag is a good one. The method you use to move the flag is the problem...
Secondly I'm not entirely sure why you want to do this long string of rcls. The order is deterministic, the first bit you put in will end up in bit 127, and the last will end up in bit 0... Why not put them there to start with?


   mov ecx, 080000000h
   mov edx, 3

@@:
   xor eax, eax
   test somebyte, 0E1h
   cmovp eax, ecx
   or result[edx * 4], eax

   ror ecx, 1
   sbb edx, 0
   jl @B


Also I would look at the timing code others have posted on the board, GetTickCount isn't very good as it has too high a margin of error.

Mirno