News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

MULPS/MULSS/MULSD all zero-out SSE register

Started by bozo, August 20, 2007, 10:17:43 AM

Previous topic - Next topic

bozo

hi all

its really just this code, i'm stumped on, and tired looking at.

    pslld xmm2,8               ;    shl ebx,8
    paddd xmm1,xmm3            ;    add eax,[add_]

    paddd xmm3,xmm7            ;    add [add_],tmp
    mulps xmm1,xmm7            ;    imul eax,tmp


after mulps - xmm1 is zero.
even when i move 4 32-bit values of 2 into xmm7 and xmm1, xmm1 2*2=4, but xmm1 still goes to zero after using mulps...
surely this isn't normal?

anyone know what i'm doing wrong?

NightWare

very strange code... i don't know what you are trying to do, but :

xxxxps use IEEE format (like fpu...) in xmm registers

pxxxx use integer in mmx registers

mmx register are st(x) fpu registers, but it's not the case for xmm register, they are independant....


bozo

i realise the problem now after you mentioned IEEE, so the following works ok

.data

align 16
array1 dd 4 dup (2)
array2 dd 4 dup (2)

array_result dd 4 dup (?)

.code
start:
    ;int 3
align 16

    movq mm0,qword ptr[array1]
    movq mm1,qword ptr[array2]

    cvtpi2ps xmm1,mm0
    cvtpi2ps xmm2,mm1

    mulps xmm1,xmm2
    cvtps2pi mm0,xmm1

    movq qword ptr[array_result],mm0


i haven't figured how to multiply the 4 integers yet, but i'm only just out of bed  :8)

thanks nightware

bozo

i might go back to bed :P - all i want to do is multiply 4 intgers by each other, like (2*2) (2*2) (2*2) (2*2) all at once.
thought that mulps would do it, but returned zero, so atleast i know the reason for that was because i needed to convert to IEEE format.
but now i have done that, there seems no way to do it conveniently, and that i'd be better off just using 32-bit MUL instruction.

anyone done this kind of thing before?

bozo

just did a quick search of the forum, and found a post by Tedd describing cvtdq2ps which i didn't know even existed.

so the code i have now is

    movdqa xmm1,oword ptr[array1]
    movdqa xmm2,oword ptr[array2]

    cvtdq2ps xmm3,xmm1
    cvtdq2ps xmm4,xmm2

    mulps xmm3,xmm4
    cvtps2dq xmm1,xmm3

    movdqa oword ptr[array_result],xmm1


and that worked fine! :)

bozo

i've another problem now, with xorps

.data

align 16
array1 dd 4 dup (2)
array2 dd 4 dup (3)
xmm_xor dd 4 dup (80h)

align 16
array_result dd 4 dup (?)

.code

start:
   align 16
    mov eax,[array1]
    cvtdq2ps xmm1,[array1]

    mov ebx,[array2]
    cvtdq2ps xmm2,[array2]

    mov ecx,[xmm_xor]

    imul eax,ebx
    mulps xmm1,xmm2

    add eax,[array2]
    cvtdq2ps xmm5,[array2]
    addps xmm1,xmm5

    xor eax,ecx
    cvtdq2ps xmm3,[xmm_xor]
    xorps xmm1,xmm3

    cvtps2dq xmm6,xmm1
    movdqa [array_result],xmm6


after xorps xmm1,xmm3 - xmm1 equals zero, when it should equal 89h

bozo

alright, i see now that there is instruction to do this called PMULLD, but its only available with SSE4 or CORE 2 processors.

The solution was to use PMULLW, because the algorithm i use works on bytes, however, there is an overflow in certain circumstances
where 16-bits isn't enough to accomodate the result of the multiplication.

i'm surprised tbh that PMULLD was only added since CORE 2, when SSE has been around for.. 8 years?

the xorps doesn't work, because it can't handle unsigned values, atleast this is the only answer i can give for the incorrect value
being given after its operation..please correct me if i'm wrong.

PXOR works, but there is no multiplication instruction, i would have to convert the result of the MULPS to integers, and before the MULPS, convert to floating point..unless someone knows a better way?

i see no good solution except to upgrade to core 2, or test for carry using PMULLW when working with bytes, and then add 1 to each 32-bit integer accordingly where carry occurs... even then, its not very efficient.

well, i give up.

NightWare

pmuld isn't necessary on P2+ coz there is no speed up comparing to classic mul instruction...
they probably add it in P4 to just complete the instruction set (i don't see another reason...)

concerning xorps, it act exactly like xor, but remember here you have ieee format...

if you really want using mmx, pmulw is the instruction you have to use, (but don't expect speed up...)

in general,
mmx is powerfull for byte/word manipulation. but for nothing else i know.
sse+ is extremely powerfull for matrix calculation and for floating point manipulation.
in your case, you are trying to use mmx and sse like a fast classic instructions... but it's not the purpose of those instruction set... mmx is made for color (3 or 4 bytes) manipulation. and sse+ is made for 3D stuff... (believe me, i know what i'm talking about, coz i code a full asm 3D engine since several month...)

OldTimer

Hi Kernel Gaddafi and NightWare,
                                               I must confess that your conversation is almost incomprehensible to this Senior Citizen.  My compiler had similar thoughts, giving me 7 errors from the 7 lines of your code.

    I would like to suggest that this forum is used by both oldies and newbies to increase their knowledge of assembler programming and that fully commented source code is like 'manna from heaven', even if the comments are blindingly obvious to the authors.  To paraphrase a famous Australian, HG Nelson, "Too many comments are never enough".

changing
pxxxx use integer in mmx registers

to
pXXXX use integer in mmx registers
made a little more sense to me, but not much.

    When you've successfully completed your alKuwarizmi to pick the Lotto numbers, don't keep it to youselves.

    Seeing that you two obviously know what you're talking about, I'll put on my straitjacket and return to the safety of my padded cell.

BTW My first search on Google returned me to this forum, so your audience is larger than you think.
:bg

bozo

hey, just wanted to post the code i've finished up with to highlight what i was doing.

for those of you who don't know, i study computer security subjects, and writing an article at the moment on MySQL password algorithm, and some of its design flaws - just don't want anyone thinking i'm trying to attack a system.

and i don't see what i'm doing as irresponsible either, just highlighting insecurities in software.

the paper isn't finished yet, but here is the main algorithm in C, and then code in Assembler.

// begin listing C code
//
void hash_password(ulong *result,const char* password,uint password_len)
{
    register ulong nr=1345345333L, add=7, nr2=0x12345671L;
    ulong tmp;
    const char *password_end = password + password_len;
   
    for (; password < password_end; password++)
    {
     if (*password == ' ' || *password == '\t')
         continue;                                 /* skip space in password */

         tmp = (ulong) (uchar) *password;
         nr ^= (((nr & 63) + add) * tmp)+ (nr << 8);
         nr2 += (nr2 << 8) ^ nr;
         add += tmp;
     }
    result[0]=nr & (((ulong) 1L << 31) -1L); /* Don't use sign bit (str2int) */;
    result[1]=nr2 & (((ulong) 1L << 31) -1L);
}
//
// end listing


and here is the SSE + x86 code in asm

.data
             align 16
xmm_nr       dd 4 dup (1345345333)
xmm_nr2      dd 4 dup (012345671h)
xmm_add      dd 4 dup (7)
xmm_sign     dd 4 dup (7FFFFFFFh)

xmm_mask     dd 4 dup (63)

             align 16
xmm_result_nr      dd 4 dup (2)
xmm_result_nr2     dd 4 dup (2)

x86_password  dd 1 dup ('A')

              align 16
xmm_password  dd 4 dup ('A')

password_size equ $-xmm_password

.code
   
   align 16

xmm_sql_hash_sse4:

    movdqa xm_sign,[xmm_sign]
    movdqa xm_nr,[xmm_nr]
    movdqa xm_nr2,[xmm_nr2]
    movdqa xm_add,[xmm_add]
    movdqa xm_mask,[xmm_mask]

    xor ebp,ebp

xmm_hash_loop:                        ;    tmp = (ulong) (uchar) *password;
    movdqa xm_tmp,[xmm_password+ebp]
    mov tmp,[x86_password+ebp]
   
    movdqa xmm1,xm_nr          ;    nr ^= (((nr & 63) + add) * tmp) + (nr << 8);
    mov eax,nr

    movdqa xmm2,xm_nr          ;   
    mov ebx,nr

    pand xmm1,xm_mask          ;   
    and eax,63

    pslld xmm2,8               ;   
    shl ebx,8

    paddd xmm1,xm_add          ;   
    add eax,add_

    paddd xm_add,xm_tmp        ;   
    add add_,tmp

    mul tmp

IFDEF USE_SSE4                 ;    intel core 2 cpus or better
    pmulld xmm1,xm_tmp         ;   
ELSE
    ; =========================     PMULLD is only available since SSE4

    cvtdq2ps xmm1,xmm1            ; convert double qword to parallel scaler
    cvtdq2ps xm_tmp,xm_tmp

    mulps xmm1,xm_tmp             ; multiply parallel scaler

    cvtps2dq xmm1,xmm1            ; convert back to double qword integers
    cvtps2dq xm_tmp,xm_tmp

    ; =========================
ENDIF                                 

    paddd xmm1,xmm2            ;   
    add eax,ebx

    pxor xm_nr,xmm1            ;   
    xor nr,eax

    ; --------------------------

    movdqa xmm2,xm_nr2         ;    nr2 += (nr2 << 8) ^ nr;  
    mov ebx,nr2

    pslld xmm2,8               ;   
    shl ebx,8

    pxor xmm2,xm_nr            ;   
    xor ebx,nr

    paddd xm_nr2,xmm2          ;   
    add nr2,ebx

    ; --------------------------

    add ebp,16                 ;    increase index
    cmp ebp,[password_size]    ;    password_size should be size of array
    jne xmm_hash_loop

    pand xm_nr,xm_sign
    pand xm_nr2,xm_sign

    movdqa [xmm_result_nr],xm_nr
    movdqa [xmm_result_nr2],xm_nr2
   
    ret


you might be wondering why there is only 1 letter for each password?
the reason for this is based on a flaw in the way hash_password() function works..
it updates the hash for every byte in the password string, so if you change the last byte, you can check a new hash without
computing the same steps over and over again.

normally, the C code computes roughly 10 million k/s on my amd64 3200+
the sse + x86 routine, i still haven't tested properly yet, but the optimised C version does about 60 million k/s now.
some rough tests gave about 130 million k/s for sse + x86 - big improvement.

the whole point of the article i'm writing, is why NOT to use hash_password() algorithm to protect passwords from attack.

btw, if anyone can see a way to optimize the code better, ideas appreciated.

daydreamer

mulps use 23 bits precision, is that enough or you get dataloss so it works buggy 64bit to 23 bits and to 64bit again?
I think the way to go is combine the two flavours of pmullw and pmulhw where pmulhw gives you the upper 16bit result

bozo

MULPS works fine - providing the integer values are first converted to precision values.
however, with XORPS, it gives incorrect results, which is possibly to do with signed/unsigned input.

PXOR works fine with either signed or unsigned values, BUT there is no multiplication instruction available for 32-bit words.
PMULLW will work... but, check this:

         nr ^= (((nr & 63) + add) * tmp)+ (nr << 8);
         nr2 += (nr2 << 8) ^ nr;
         add += tmp;


As you can see,  (nr & 3F) + add is multiplied by tmp.
The problem is that add variable is sequentially updated with tmp. (the current character from the password string)
This add variable eventually exceeds 16-bits, and makes PMULLW unsuitable for further multiplication because of
a carry bit.

To solve this, i convert the 32-bit integers to precision, multiply them, and convert back again.

PMULLD would solve this minor inconvenience, but its only available since CORE 2 cpus with SSE4

NightWare

Quote from: NightWare on August 21, 2007, 11:14:54 PM
concerning xorps, it act exactly like xor, but remember here you have ieee format...
it means (exponent xor exponent = no more exponent....)

and daydreamer is right, cvtxxx are quite slow instructions...

bozo

Quoteit means (exponent xor exponent = no more exponent....)

and daydreamer is right, cvtxxx are quite slow instructions...

hmm, i hadn't noticed tbh.
if any of you can get xorps to work for you with this algorithm, i'd love to know, thanks.
i just want it to work - xorps wouldn't do what i wanted.

unless of course i was doing something wrong, but nobody corrected the code to make it actually work.

Rockoon

Remember that exponents are biased in IEEE...

An exponent of all 0 bits does not mean the exponent = 0

For IEEE single precision (32-bit) floats, the bias is -127, that is to say that when the exponent field = x when taken as an an 8-bit number, then the exponents real value is taken to be (x - 127)

When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.