News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Does anyone have a genuinely fast SHA1 algo ?

Started by hutch--, January 01, 2011, 02:37:05 AM

Previous topic - Next topic

Antariy

Glenn, is Delphi's inline ASM support MMX instructions? Have a look into my previous post then.

hutch--

With the algo I am playing with the following code is the main bottleneck. I modified the 5 memory copies but no timing difference.


  calcsha_lbl3:
    lea esi, [ecx*4]
    add esi, pWrd
    mov eax, [esi+52]
    xor eax, [esi+32]
    xor eax, [esi+8]
    xor eax, [esi]
    mov edx, eax
    shl eax, 1
    shr edx, 31
    or  eax, edx
    mov [esi+64], eax
    add ecx, 1
    cmp ecx, 64
    jne calcsha_lbl3


unrolling it made no difference, it is the sum total number of memory accesses in this loop that limit the speed.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Antariy

Quote from: hutch-- on January 05, 2011, 12:41:12 AM
With the algo I am playing with the following code is the main bottleneck. I modified the 5 memory copies but no timing difference.

Yes, memory references is probably slowest thing to do. Hutch have a look to code suggested on previous page - to MMX packet DWORDs bytes swap. Since packets and non temporal writing back - this should get not small speed up in this point.


add esi, pWrd


Is pWrd a local variable? What if make FPO, and free EBP reg?

hutch--

Alex,

The problem is the algo has 12 DWORD local variables and no way to reduce them. to make the memory copy a bit faster with the 5 memory copies I used 5 registers to separate the read after write problems but it did not change the timing, the part that slows it all down is the loop code I posted above. i had a look at your MMX replacement for BSWAP but I have reason not to use MMX registers, XMM would be worth a try as they don't conflist with FP operations but I doubt thats the problem.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Antariy

Quote from: hutch-- on January 05, 2011, 01:18:19 AM
The problem is the algo has 12 DWORD local variables and no way to reduce them.

Is possible to refer to all locals by ESP, and free EBP for something else?

hutch--

Alex,

Its a point of diminishing returns, for the additional complexity of managing that many locals in ESP, all you get is one register and that will not solve the problem. The loop code that is the bottle neck is already full register code apart from the memory access that it is directly modifying, the algo itself does not have the room to change this. It may be possible in x64 with the extra registers but I still have my doubts.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Antariy

Quote from: hutch-- on January 05, 2011, 01:58:55 AM
the algo itself does not have the room to change this. It may be possible in x64 with the extra registers but I still have my doubts.

what if use MMX/XMM regs to temporary holding of the other regs data?
I.e. for eaxample, write EBX not to local var, but to XMM:

movd xmm0,ebx
... do something else with EBX ...
movd ebx,xmm0
...


It looks like storage of 4 DWORD per an XMM reg. Just shifting them and store-retrieve values. Such stuff is slow on a PIV, but not sure about Core+

Glenn9999

Quote from: Antariy on January 04, 2011, 11:42:52 PM
Glenn, is Delphi's inline ASM support MMX instructions? Have a look into my previous post then.


I'll have a look at it when I get done with what I'm working on now.  That said, I tested a inline procedure implementation (all Delphi code) without the memory moves I referenced, and got a speed of 61ms slower than what I posted in this thread with the 25MB file, so the change definitely looks promising assuming there's a way to inline it in ASM as well as remove the Delphi ROL emulation (the big known time killer in the Delphi implementation).

KeepingRealBusy

Quote from: hutch-- on January 05, 2011, 12:41:12 AM
With the algo I am playing with the following code is the main bottleneck. I modified the 5 memory copies but no timing difference.


  calcsha_lbl3:
    lea esi, [ecx*4]
    add esi, pWrd
    mov eax, [esi+52]
    xor eax, [esi+32]
    xor eax, [esi+8]
    xor eax, [esi]
    mov edx, eax
    shl eax, 1
    shr edx, 31
    or  eax, edx
    mov [esi+64], eax
    add ecx, 1
    cmp ecx, 64
    jne calcsha_lbl3


unrolling it made no difference, it is the sum total number of memory accesses in this loop that limit the speed.

Hutch,

What about this:


  calcsha_lbl3:
    lea esi, [ecx*4]
    add esi, pWrd

    mov edx, [esi]        avoid stalls.
    mov eax, [esi+8]
    xor edx, [esi+32]
    xor eax, [esi+52]
    xor edx, eax
    mov eax, edx

    shl eax, 1        I think this does the same thing.
    rcr edx, 31

    add ecx, 1          Avoid stalls.
    mov [esi+64], eax
    cmp ecx, 64
    jne calcsha_lbl3


Dave.

KeepingRealBusy

Hutch,
Forget that last two line change, that won't work at all.
Dave.

Glenn9999

Finally got the code I was trying written out, but MASM turned out something I didn't understand.  I got 100 lines or so of this:


Assembling: sha1procs.asm
sha1procs.asm(128) : error A2070: invalid instruction operands
Sha1Loop1(17): Macro Called From
  sha1procs.asm(128): Main Line Code
sha1procs.asm(128) : error A2070: invalid instruction operands
Sha1Loop1(18): Macro Called From
  sha1procs.asm(128): Main Line Code
sha1procs.asm(128) : error A2023: instruction operand must have size
Sha1Loop1(20): Macro Called From
  sha1procs.asm(128): Main Line Code
sha1procs.asm(129) : error A2070: invalid instruction operands
Sha1Loop1(17): Macro Called From
  sha1procs.asm(129): Main Line Code
sha1procs.asm(129) : error A2070: invalid instruction operands
Sha1Loop1(18): Macro Called From
  sha1procs.asm(129): Main Line Code
sha1procs.asm(129) : error A2023: instruction operand must have size
Sha1Loop1(20): Macro Called From
  sha1procs.asm(129): Main Line Code


I assume the errors are referencing the Sha1Loop1 macro, but I'm not seeing how to reference the code to see what lines it is talking about.  Any help?

dedndave

the more errors you have - the simpler the problem usually is - lol

are you missing...
        .686p
        .MODEL  Flat,StdCall
        OPTION  CaseMap:None
        .MMX
        .XMM

Glenn9999

Quote from: dedndave on January 05, 2011, 05:11:46 AM
the more errors you have - the simpler the problem usually is - lol

are you missing...
        .686p
        .MODEL  Flat,StdCall
        OPTION  CaseMap:None
        .MMX
        .XMM


Nope.  The thing of it is, I can presume that the numbers in the first lines are source lines, since line #128 has the first macro call.  The problem I'm having is that 17, 18 and 20 make no sense in the context of the source.  Of course I could be missing something.

Edit: Here's the macro.  Maybe someone can see something?


Sha1Loop1    macro a, b, cf, d, e, w

MOV  EBP, d
XOR  EBP, cf
AND  EBP, b
XOR  EBP, d
ADD  e, EBP
;----------------------------------------
MOV  EBP, a
ROL  EBP, 5
ADD  e, EBP
;----------------------------------------
ADD  e, w
ADD  e, 05A827999h
;----------------------------------------
ROL  b, 30
;----------------------------------------
endm

dedndave

i tried
Sha1Loop1 eax,eax,eax,eax,eax,eax
it assembled without error
show us the line with the very first error in the list
you may have to look in C:\masm32\bin\asmbl.txt

hutch--

Glen,

If you want to try and optimise it once you get it going, inline all of the macros back into the source code.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php