News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

MULPS/MULSS/MULSD all zero-out SSE register

Started by bozo, August 20, 2007, 10:17:43 AM

Previous topic - Next topic

bozo

so what i have done will not work on some occasions??
i haven't had chance to test it for incorrect collisions yet.

drizz

Quote from: Kernel_Gaddafi on August 27, 2007, 01:08:05 AM
         nr ^= (((nr & 63) + add) * tmp)+ (nr << 8);
         nr2 += (nr2 << 8) ^ nr;
         add += tmp;


As you can see,  (nr & 3F) + add is multiplied by tmp.
The problem is that add variable is sequentially updated with tmp. (the current character from the password string)
This add variable eventually exceeds 16-bits, and makes PMULLW unsuitable for further multiplication because of
a carry bit.

To solve this, i convert the 32-bit integers to precision, multiply them, and convert back again.
as far as i can see multiplication will not overflow 0xFFFF if string is less than 257 characters (taking 0xFF as the max possible character), therefore the pmullw+pmulhuw will work.

256 characters is not enough???
The truth cannot be learned ... it can only be recognized.

bozo

well, i tried pmullw on its own.
haven't tried pmulhuw, i'll try this later
thanks.

drizz

pmullw only gets the lower words of product.

.686
.XMM
.MODEL FLAT,STDCALL
OPTION CASEMAP:NONE,PROLOGUE:NONE,EPILOGUE:NONE

public hashpassdouble,resl

.DATA
doublepasslen equ 1
align 8
pass dq doublepasslen dup(00000041000000FFh)
nr dq 5030573550305735h
nr2 dq 01234567112345671h
add_ dq 0000000700000007h
and63 dq 0000003F0000003Fh
andsgn dq 7FFFFFFF7FFFFFFFh

.DATA?
align 8
resl dq 2 dup(?)

.CODE
;; space and tab not processed
hashpassdouble:
movq mm0,pass
movq mm1,nr
movq mm2,nr2
movq mm7,add_
movq mm6,and63
xor ecx,ecx
.repeat
movq mm0,pass[ecx*8]
movq mm3,mm1
pand mm3,mm6
paddd mm3,mm7
paddd mm7,mm0
movq mm4,mm0
movq mm5,mm3
pmulhuw mm4,mm5
pmullw mm0,mm3
pslld mm4,16
por mm0,mm4
movq mm3,mm1
pslld mm3,8
paddd mm0,mm3
pxor mm1,mm0
movq mm4,mm2
pslld mm4,8
pxor mm4,mm1
paddd mm2,mm4
inc ecx
.until ecx == doublepasslen
movq mm3,mm1
movq mm4,mm2
punpckldq mm3,mm2
punpckhdq mm1,mm4
pand mm3,andsgn
pand mm1,andsgn
movq resl[0*8],mm3
movq resl[1*8],mm1
retn
end
The truth cannot be learned ... it can only be recognized.

bozo

movq mm4,mm0
movq mm5,mm3
pmulhuw mm4,mm5
pmullw mm0,mm3
pslld mm4,16
por mm0,mm4
movq mm3,mm1


i've not had a chance to test this yet, but i am guessing that if using SSE, there would be twice as many instructions?
or would it be the exact same, just changing mm* to xmm* ?

bozo

ok, i tested the code, and it worked fine :) good work drizz  :clap:
thanks to daydreamer for bringing it up.

here is code i tested

.686
.XMM
.MODEL FLAT,STDCALL

include windows.inc
include stdio.inc
include msvcrt.inc

OPTION CASEMAP:NONE,PROLOGUE:NONE,EPILOGUE:NONE

public hashpassdouble

.DATA

align 16
pass dd 40*4 dup ("z")
pass_len equ $-pass

nr     dd 4 dup (50305735h)
nr2    dd 4 dup (12345671h)
add_   dd 4 dup (7)
and63  dd 4 dup (63)
andsgn dd 4 dup (7FFFFFFFh)

.DATA?
align 16
res_nr dd 4 dup(?)
res_nr2 dd 4 dup (?)

.CODE
align 16
;; space and tab not processed
hashpassdouble:
movdqa xmm0,[pass]
movdqa xmm1,[nr]

movdqa xmm2,[nr2]
movdqa xmm7,[add_]

movdqa xmm6,[and63]
xor ecx,ecx

.repeat
movdqa xmm0,[pass+ecx]
movdqa xmm3,xmm1
pand xmm3,xmm6        ; nr & 63
paddd xmm3,xmm7       ; nr + add
paddd xmm7,xmm0       ; nr + tmp

movdqa xmm4,xmm0        ; ((nr & 63) + add) * tmp
movdqa xmm5,xmm3
pmulhuw xmm4,xmm5
pmullw xmm0,xmm3
pslld xmm4,16
por xmm0,xmm4

                movdqa xmm3,xmm1
pslld xmm3,8
paddd xmm0,xmm3
pxor xmm1,xmm0
movdqa xmm4,xmm2
pslld xmm4,8
pxor xmm4,xmm1
paddd xmm2,xmm4
add ecx,16
.until ecx == pass_len

movdqa xmm3,xmm1
movdqa xmm4,xmm2

punpckldq xmm3,xmm2
punpckhdq xmm1,xmm4

pand xmm3,[andsgn]
pand xmm1,[andsgn]

movdqa [res_nr],xmm3
movdqa [res_nr2],xmm1

invoke printf,CStr(<10,"Hash 1:%08X%08X Hash 2:%08X%08X",10,"Hash 3:%08X%08X Hash 4:%08X%08X">),
               [res_nr+0],[res_nr+4],
               [res_nr+8],[res_nr+12],
               [res_nr2+0],[res_nr2+4],
               [res_nr+8],[res_nr2+12]

invoke exit,0

end hashpassdouble



results:

QuoteHash 1:26FEC475192CCD89 Hash 2:26FEC475192CCD89
Hash 3:26FEC475192CCD89 Hash 4:26FEC475192CCD89

the variables might be a bit mixed up, but it works none the less.
but which is faster? :)

drizz

Quote from: Kernel_Gaddafi on August 30, 2007, 05:44:08 PMi've not had a chance to test this yet, but i am guessing that if using SSE, there would be twice as many instructions?
or would it be the exact same, just changing mm* to xmm* ?
sorry Kernel_Gaddafi i don't have SSE2 so i couldn't make the quad version
Quote from: Kernel_Gaddafi on August 30, 2007, 06:15:38 PM
the variables might be a bit mixed up, but it works none the less.
but which is faster? :)
faster compared to...? i guess that by working on 4 chars at once with SSE2, its at least 2x faster than my SSE1 code.
you should do some 'clocking'.

The truth cannot be learned ... it can only be recognized.

daydreamer

Quote from: Kernel_Gaddafi on August 30, 2007, 06:15:38 PM
ok, i tested the code, and it worked fine :) good work drizz  :clap:
thanks to daydreamer for bringing it up.

here is code i tested

.686
.XMM
.MODEL FLAT,STDCALL

include windows.inc
include stdio.inc
include msvcrt.inc

OPTION CASEMAP:NONE,PROLOGUE:NONE,EPILOGUE:NONE

public hashpassdouble

.DATA

align 16
pass dd 40*4 dup ("z")
pass_len equ $-pass

nr     dd 4 dup (50305735h)
nr2    dd 4 dup (12345671h)
add_   dd 4 dup (7)
and63  dd 4 dup (63)
andsgn dd 4 dup (7FFFFFFFh)

.DATA?
align 16
res_nr dd 4 dup(?)
res_nr2 dd 4 dup (?)

.CODE
align 16 ;you should try forum for macro that handles align 128
;; space and tab not processed
hashpassdouble:
movdqa xmm0,[pass] ;this constants should be instead kept as local variables
movdqa xmm1,[nr]    ;so you can issue 3+ at a time of pmulw in a row+ other slow instructions

movdqa xmm2,[nr2]
movdqa xmm7,[add_]

movdqa xmm6,[and63]
xor ecx,ecx

.repeat
movdqa xmm0,[pass+ecx]
movdqa xmm3,xmm1
pand xmm3,xmm6        ; nr & 63
paddd xmm3,xmm7       ; nr + add
paddd xmm7,xmm0       ; nr + tmp

movdqa xmm4,xmm0        ; ((nr & 63) + add) * tmp
movdqa xmm5,xmm3
pmulhuw xmm4,xmm5 ;please unroll these to pmulhuws and pmuls
pmullw xmm0,xmm3    ;with no dependency to previous 3 above instructions
pslld xmm4,16           
por xmm0,xmm4

                movdqa xmm3,xmm1
pslld xmm3,8
paddd xmm0,xmm3
pxor xmm1,xmm0
movdqa xmm4,xmm2
pslld xmm4,8
pxor xmm4,xmm1
paddd xmm2,xmm4
add ecx,16
.until ecx == pass_len

movdqa xmm3,xmm1
movdqa xmm4,xmm2

punpckldq xmm3,xmm2
punpckhdq xmm1,xmm4

pand xmm3,[andsgn]
pand xmm1,[andsgn]

movdqa [res_nr],xmm3
movdqa [res_nr2],xmm1

invoke printf,CStr(<10,"Hash 1:%08X%08X Hash 2:%08X%08X",10,"Hash 3:%08X%08X Hash 4:%08X%08X">),
               [res_nr+0],[res_nr+4],
               [res_nr+8],[res_nr+12],
               [res_nr2+0],[res_nr2+4],
               [res_nr+8],[res_nr2+12]

invoke exit,0

end hashpassdouble



results:

QuoteHash 1:26FEC475192CCD89 Hash 2:26FEC475192CCD89
Hash 3:26FEC475192CCD89 Hash 4:26FEC475192CCD89

the variables might be a bit mixed up, but it works none the less.
but which is faster? :)


daydreamer

Quote from: drizz on August 31, 2007, 06:49:08 AM
Quote from: Kernel_Gaddafi on August 30, 2007, 05:44:08 PMi've not had a chance to test this yet, but i am guessing that if using SSE, there would be twice as many instructions?
or would it be the exact same, just changing mm* to xmm* ?
sorry Kernel_Gaddafi i don't have SSE2 so i couldn't make the quad version
faster compared to...? i guess that by working on 4 chars at once with SSE2, its at least 2x faster than my SSE1 code.
you should do some 'clocking'.
you can do ony minor changes to SSE2 code and you turn it into 1/2 execution speed and if you dont have SSE2 caps, your cpu will ignore 066h ahead of them and execute them as the MMX versions of the instructions instead