Does anyone have a genuinely fast SHA1 algo ?

Antariy · January 04, 2011, 11:42:52 PM

Glenn, is Delphi's inline ASM support MMX instructions? Have a look into my previous post then.

hutch-- · January 05, 2011, 12:41:12 AM

With the algo I am playing with the following code is the main bottleneck. I modified the 5 memory copies but no timing difference.

Code Select


  calcsha_lbl3:
    lea esi, [ecx*4]
    add esi, pWrd
    mov eax, [esi+52]
    xor eax, [esi+32]
    xor eax, [esi+8]
    xor eax, [esi]
    mov edx, eax
    shl eax, 1
    shr edx, 31
    or  eax, edx
    mov [esi+64], eax
    add ecx, 1
    cmp ecx, 64
    jne calcsha_lbl3

unrolling it made no difference, it is the sum total number of memory accesses in this loop that limit the speed.

Antariy · January 05, 2011, 12:49:53 AM

Quote from: hutch-- on January 05, 2011, 12:41:12 AM
With the algo I am playing with the following code is the main bottleneck. I modified the 5 memory copies but no timing difference.

Yes, memory references is probably slowest thing to do. Hutch have a look to code suggested on previous page - to MMX packet DWORDs bytes swap. Since packets and non temporal writing back - this should get not small speed up in this point.

Code Select


add esi, pWrd

Is pWrd a local variable? What if make FPO, and free EBP reg?

hutch-- · January 05, 2011, 01:18:19 AM

Alex,

The problem is the algo has 12 DWORD local variables and no way to reduce them. to make the memory copy a bit faster with the 5 memory copies I used 5 registers to separate the read after write problems but it did not change the timing, the part that slows it all down is the loop code I posted above. i had a look at your MMX replacement for BSWAP but I have reason not to use MMX registers, XMM would be worth a try as they don't conflist with FP operations but I doubt thats the problem.

Antariy · January 05, 2011, 01:28:26 AM

Quote from: hutch-- on January 05, 2011, 01:18:19 AM
The problem is the algo has 12 DWORD local variables and no way to reduce them.

Is possible to refer to all locals by ESP, and free EBP for something else?

hutch-- · January 05, 2011, 01:58:55 AM

Alex,

Its a point of diminishing returns, for the additional complexity of managing that many locals in ESP, all you get is one register and that will not solve the problem. The loop code that is the bottle neck is already full register code apart from the memory access that it is directly modifying, the algo itself does not have the room to change this. It may be possible in x64 with the extra registers but I still have my doubts.

Antariy · January 05, 2011, 02:08:29 AM

Quote from: hutch-- on January 05, 2011, 01:58:55 AM
the algo itself does not have the room to change this. It may be possible in x64 with the extra registers but I still have my doubts.

what if use MMX/XMM regs to temporary holding of the other regs data?
I.e. for eaxample, write EBX not to local var, but to XMM:

Code Select


movd xmm0,ebx
... do something else with EBX ...
movd ebx,xmm0
...

It looks like storage of 4 DWORD per an XMM reg. Just shifting them and store-retrieve values. Such stuff is slow on a PIV, but not sure about Core+

Glenn9999 · January 05, 2011, 02:28:06 AM

Quote from: Antariy on January 04, 2011, 11:42:52 PM
Glenn, is Delphi's inline ASM support MMX instructions? Have a look into my previous post then.

I'll have a look at it when I get done with what I'm working on now. That said, I tested a inline procedure implementation (all Delphi code) without the memory moves I referenced, and got a speed of 61ms slower than what I posted in this thread with the 25MB file, so the change definitely looks promising assuming there's a way to inline it in ASM as well as remove the Delphi ROL emulation (the big known time killer in the Delphi implementation).

KeepingRealBusy · January 05, 2011, 02:54:26 AM

Quote from: hutch-- on January 05, 2011, 12:41:12 AM
With the algo I am playing with the following code is the main bottleneck. I modified the 5 memory copies but no timing difference.

Code Select Expand
calcsha_lbl3: lea esi, [ecx*4] add esi, pWrd mov eax, [esi+52] xor eax, [esi+32] xor eax, [esi+8] xor eax, [esi] mov edx, eax shl eax, 1 shr edx, 31 or eax, edx mov [esi+64], eax add ecx, 1 cmp ecx, 64 jne calcsha_lbl3

unrolling it made no difference, it is the sum total number of memory accesses in this loop that limit the speed.

Hutch,

What about this:

Code Select


  calcsha_lbl3:
    lea esi, [ecx*4]
    add esi, pWrd

    mov edx, [esi]        avoid stalls.
    mov eax, [esi+8]
    xor edx, [esi+32]
    xor eax, [esi+52]
    xor edx, eax
    mov eax, edx

    shl eax, 1        I think this does the same thing.
    rcr edx, 31

    add ecx, 1          Avoid stalls.
    mov [esi+64], eax
    cmp ecx, 64
    jne calcsha_lbl3

Dave.

KeepingRealBusy · January 05, 2011, 04:03:40 AM

Hutch,
Forget that last two line change, that won't work at all.
Dave.

Glenn9999 · January 05, 2011, 05:06:59 AM

Finally got the code I was trying written out, but MASM turned out something I didn't understand. I got 100 lines or so of this:

Code Select


 Assembling: sha1procs.asm
sha1procs.asm(128) : error A2070: invalid instruction operands
 Sha1Loop1(17): Macro Called From
  sha1procs.asm(128): Main Line Code
sha1procs.asm(128) : error A2070: invalid instruction operands
 Sha1Loop1(18): Macro Called From
  sha1procs.asm(128): Main Line Code
sha1procs.asm(128) : error A2023: instruction operand must have size
 Sha1Loop1(20): Macro Called From
  sha1procs.asm(128): Main Line Code
sha1procs.asm(129) : error A2070: invalid instruction operands
 Sha1Loop1(17): Macro Called From
  sha1procs.asm(129): Main Line Code
sha1procs.asm(129) : error A2070: invalid instruction operands
 Sha1Loop1(18): Macro Called From
  sha1procs.asm(129): Main Line Code
sha1procs.asm(129) : error A2023: instruction operand must have size
 Sha1Loop1(20): Macro Called From
  sha1procs.asm(129): Main Line Code

I assume the errors are referencing the Sha1Loop1 macro, but I'm not seeing how to reference the code to see what lines it is talking about. Any help?

dedndave · January 05, 2011, 05:11:46 AM

the more errors you have - the simpler the problem usually is - lol

are you missing...

Code Select

        .686p
        .MODEL  Flat,StdCall
        OPTION  CaseMap:None
        .MMX
        .XMM

Glenn9999 · January 05, 2011, 05:22:40 AM

Quote from: dedndave on January 05, 2011, 05:11:46 AM
the more errors you have - the simpler the problem usually is - lol

are you missing...
Code Select Expand
.686p .MODEL Flat,StdCall OPTION CaseMap:None .MMX .XMM

Nope. The thing of it is, I can presume that the numbers in the first lines are source lines, since line #128 has the first macro call. The problem I'm having is that 17, 18 and 20 make no sense in the context of the source. Of course I could be missing something.

Edit: Here's the macro. Maybe someone can see something?

Code Select


Sha1Loop1    macro a, b, cf, d, e, w

MOV  EBP, d
XOR  EBP, cf
AND  EBP, b
XOR  EBP, d
ADD  e, EBP
;----------------------------------------
MOV  EBP, a
ROL  EBP, 5
ADD  e, EBP
;----------------------------------------
ADD  e, w
ADD  e, 05A827999h
;----------------------------------------
ROL  b, 30
;----------------------------------------
endm

dedndave · January 05, 2011, 05:39:03 AM

i tried

Code Select

Sha1Loop1 eax,eax,eax,eax,eax,eax
it assembled without error
show us the line with the very first error in the list
you may have to look in C:\masm32\bin\asmbl.txt

hutch-- · January 05, 2011, 05:43:04 AM

Glen,

If you want to try and optimise it once you get it going, inline all of the macros back into the source code.

News:

Does anyone have a genuinely fast SHA1 algo ?