Does anyone have a genuinely fast SHA1 algo ?

hutch-- · January 01, 2011, 02:37:05 AM

This type of code is something I am not all that familiar with but I have some folks that need to do work like this so I wondered if anyone has a fast SHA1 algo up and going that I can play with ?

Glenn9999 · January 02, 2011, 04:26:34 AM

Quote from: hutch-- on January 01, 2011, 02:37:05 AM
This type of code is something I am not all that familiar with but I have some folks that need to do work like this so I wondered if anyone has a fast SHA1 algo up and going that I can play with ?

In trying to get efficient hashing algorithms (as I've been posting about on and off here), SHA1 is the only one that I really haven't been able to completely optimize to the point of having no reservations about having anything more that could be done. At least to my general knowledge. But it's faster than what I've found so far.

However, it's in Delphi, but with some Delphi assembler modules. The stuff that does not relate to API calls (was going with a Delphi target, so going with what I know) are mainly what is in the Delphi assembler. I can post it and/or maybe work on some MASM substitutes for the Delphi assembler parts if it would help.

Glenn9999 · January 03, 2011, 03:35:22 AM

Okay. I tried writing some MASM code and realized that I didn't have any resources immediately handy to answer a couple of things I realized I didn't know (addressing a pointer to a STRUCT - can I do it like I did it with the TASM/Delphi ASM? - and a compiler error on a macro call line I have). So here's the Delphi with assembler as I have it right now, all the ideas I had tried out and the speed improvements left in. Hopefully it should be a good start since I couldn't get it converted in a timely manner, and the ASM can be done from it that I haven't already done. I'd definitely be interested to see if it could be made faster (at least within the realm of X86 code).

Stats:
1) It runs about 17000ms or so on my testbed (660MB of files, Athlon XP 2000). The best code I originally had was around 30000ms, so consider it a positive improvement, at least in that regard. It's very comparable to the MD5 code and CRC32 code I have right now, speed wise.
2) I have a profiler I've been using which understands Delphi code. According to that, the majority of the time is spent in Sha1Compress and Sha1LoadW, so the biggest gains could probably be made there, though I'm sure there might be other things.
3) The "big reservation" I mentioned is in Sha1Compress. I'm thinking that the values need not be rotated in memory and can be called in sequence via macro like the MD5 thread, but I wasn't too successful in getting a correct result. I had another thing to tend to, so I really haven't spent too much time and brain power on trying to eliminate it.
4) I had the loop sections of Sha1Compress written completely in ASM and didn't get a speed increase (namely because of all the memory moves, can't do much besides MOV EAX, A MOV B, EAX with that). Of course that doesn't mean there wasn't anything that I missed.
5) Sha1File is kind of irrelevant (probably) to what you wanted to know, but I left it in there as an illustration of how the code is meant to be used.

Code Select


{$R- $Q- $O+}
unit sha1;

interface
 type
   string2 = string[2];
   DWord = Longint;
   longword = longint;
   TSHA1Digest= array[0..19] of byte;
   TSHA1Context= record
     Hash: array[0..4] of DWord;
     Hi, Lo: integer;
     Buffer: array[0..63] of byte;
     Index: integer;
   end;

procedure SHA1Init(var Context: TSHA1Context);
procedure SHA1Update(var Context: TSHA1Context; Buffer: pointer; Len: integer);
procedure SHA1Final(var Context: TSHA1Context; var Digest: TSHA1Digest);
function SHA1File(infile_name: string): string;

implementation
  uses sysutils, windows;

function SHA1File(infile_name: string): string;
  var
    infile: file;
    inbuffer: array[0..65535] of byte;
    amount_read: integer;
    Context: TSHA1Context;
    Digest: TSHA1Digest;
  begin
    AssignFile(infile, infile_name);
    Reset(infile, 1);
    Sha1Init(Context);
    blockread(infile, inbuffer, sizeof(inbuffer), amount_read);
    while amount_read > 0 do
      begin
        Sha1Update(Context, @inbuffer, amount_read);
        blockread(infile, inbuffer, sizeof(inbuffer), amount_read);
      end;
    CloseFile(infile);
    SHA1Final(Context,Digest);
    Result := write_hash(@Digest, Sizeof(Digest), true);  // this is a ByteToHex function
  end;

function RB(A: DWord): DWord; assembler;
{ 32-bit little-endian to big-endian, not a big time consumer but rewritten
  anyway to take advantage of assembler instructions not directly supported
  in Delphi }
asm
  BSWAP   EAX         // swap values in EAX
end;                  // EAX is result in "assembler" functions.

procedure Sha1LoadW(W: pointer); assembler;
 // W is in EAX
  asm
    MOV      EDX, 0                   // set EDX counter to 0
// start first loop, make first 16 elements big-endian
    @loop1:
    MOV      ECX, [EAX+EDX*4]         // W[i]
    BSWAP    ECX                      // make ECX big-endian
    MOV      [EAX+EDX*4], ECX         // return result to W[i]
    INC      EDX                      // increment counter by 1
    CMP      EDX, 15                  // compare counter with 15
    JLE      @loop1                   // if <= 15 continue
// start second loop, do SHA spread
    @loop2:                           // EDX should be 16 here
    MOV      ECX, [EAX+EDX*4-12]      // W[i-3]
    XOR      ECX, [EAX+EDX*4-32]      // W[i-8]
    XOR      ECX, [EAX+EDX*4-56]      // W[i-14]
    XOR      ECX, [EAX+EDX*4-64]      // W[i-16]
    ROL      ECX, 1                   // ROtate Left by 1
    MOV      [EAX+EDX*4], ECX         // return result to W[i]
    INC      EDX                      // increment counter by 1
    CMP      EDX, 79                  // compare counter with 79
    JLE      @loop2                   // if <=79 continue
  end;

procedure SHA1Compress(Buffer: pointer; var Data: TSHA1Context);
// performs SHA-1 on a 64 byte section.
var
  A, B, C, D, E, T: DWord;
  W: array[0..79] of DWord;
  i: integer;
begin
  Move(Buffer^,W,64);    // change here to support separate buffer
  Sha1LoadW(@W);
  A:= Data.Hash[0]; B:= Data.Hash[1]; C:= Data.Hash[2]; D:= Data.Hash[3]; E:= Data.Hash[4];
  for i:= 0 to 19 do
  begin
    asm
      MOV ECX, D
      XOR ECX, c
      AND ECX, b
      XOR ECX, D
      MOV EDX, A          // move A to EDX register
      ROL EDX, 5          // ROtate Left EDX by 5
      ADD ECX, EDX
      ADD ECX, E
      ADD ECX, $5A827999
      MOV EAX, i
      ADD ECX, [Pointer(W)+ EAX*4]
      MOV T, ECX
    end;
    E:= D; D:= C;
    asm
      MOV EDX, B          // move b to EDX register
      ROL EDX, 30         // ROtate Left EDX by 30
      MOV C, EDX          // Return Result to C
    end;
    B:= A; A:= T;
  end;
  for i:= 20 to 39 do
  begin
    asm
      MOV ECX, d
      XOR ECX, c
      XOR ECX, b
      MOV EDX, A          // move A to EDX register
      ROL EDX, 5          // ROtate Left EDX by 5
      ADD ECX, EDX
      ADD ECX, e
      ADD ECX, $6ED9EBA1
      MOV EAX, i
      ADD ECX, [Pointer(W)+ EAX*4]
      MOV T, ECX
    end;
    E:= D; D:= C;
    asm
      MOV EDX, B          // move b to EDX register
      ROL EDX, 30         // ROtate Left EDX by 30
      MOV C, EDX          // Return Result to C
    end;
    B:= A; A:= T;
  end;
  for i:= 40 to 59 do
  begin
    asm
      MOV EAX, c
       OR EAX, b
      AND EAX, d
      MOV ECX, b
      AND ECX, c
       OR ECX, EAX
      MOV EAX, A          // move A to EDX register
      ROL EAX, 5          // ROtate Left EDX by 5
      ADD ECX, EAX
      ADD ECX, E
      ADD ECX, $8F1BBCDC
      MOV EAX, i
      ADD ECX, [Pointer(W)+EAX*4]
      MOV T, ECX
    end;
    E:= D; D:= C;
    asm
      MOV EDX, B          // move b to EDX register
      ROL EDX, 30         // ROtate Left EDX by 30
      MOV C, EDX          // Return Result to C
    end;
    B:= A; A:= T;
  end;
  for i:= 60 to 79 do
  begin
    asm
      MOV ECX, d
      XOR ECX, c
      XOR ECX, b
      MOV EDX, A          // move A to EDX register
      ROL EDX, 5          // ROtate Left EDX by 5
      ADD ECX, EDX
      ADD ECX, e
      ADD ECX, $CA62C1D6
      MOV EAX, i
      ADD ECX, [Pointer(W)+ EAX*4]
      MOV T, ECX
    end;
    E:= D; D:= C;
    asm
      MOV EDX, B          // move b to EDX register
      ROL EDX, 30         // ROtate Left EDX by 30
      MOV C, EDX          // Return Result to C
    end;
    B:= A; A:= T;
  end;
  Data.Hash[0]:= Data.Hash[0] + A;
  Data.Hash[1]:= Data.Hash[1] + B;
  Data.Hash[2]:= Data.Hash[2] + C;
  Data.Hash[3]:= Data.Hash[3] + D;
  Data.Hash[4]:= Data.Hash[4] + E;
end;
}

//******************************************************************************
procedure SHA1Init(var Context: TSHA1Context);
begin
  Context.Hi:= 0; Context.Lo:= 0;
  Context.Index:= 0;
  FillChar(Context.Buffer,Sizeof(Context.Buffer),0);
  Context.Hash[0]:= $67452301;
  Context.Hash[1]:= $EFCDAB89;
  Context.Hash[2]:= $98BADCFE;
  Context.Hash[3]:= $10325476;
  Context.Hash[4]:= $C3D2E1F0;
end;

//******************************************************************************
procedure SHA1Update(var Context: TSHA1Context; Buffer: pointer; Len: integer);
begin
// Length needs to be in bits, maintaining 2 parts of a 64 bit integer.
  asm
    MOV   EAX, Context                    // address of context to EAX
    MOV   ECX, Len                        // length to ECX
    SHL   ECX, 3                          // multiply ECX by 8
    ADD   TSha1Context([EAX]).Lo, ECX     // add that to the low value.
    ADC   TSha1Context([EAX]).Hi, 0       // add carry flag if exists.
  end;
// process chunk from previous run
  if Context.Index > 0 then
    begin
      Move(Buffer^, Context.Buffer[Context.Index], 64-Context.Index);
      Sha1Compress(@Context.Buffer[0], Context);
      Inc(Longint(Buffer), 64-Context.Index);
      dec(Len, 64-Context.Index);
      Context.Index := 0;
    end;
// pull off 64-byte chunks and process
  while Len >= 64 do
    begin
      Sha1Compress(Buffer, Context);
      Inc(Longint(Buffer), 64);
      dec(Len, 64);
    end;
// if there's some left, pull off and leave for later
  if Len > 0 then
    begin
      FillChar(Context.Buffer[0], sizeof(Context.Buffer), 0);
      Move(Buffer^, Context.Buffer[0], Len);
      Context.Index := Len;
    end;
end;

//******************************************************************************
procedure SHA1Final(var Context: TSHA1Context; var Digest: TSHA1Digest);
type
  PDWord= ^DWord;
begin
  Context.Buffer[Context.Index]:= $80;  // end of content needs 1 bit at end
  if Context.Index>= 56 then
    begin
      SHA1Compress(@Context.Buffer[0], Context);
      FillChar(Context.Buffer[0], sizeof(Context.Buffer), 0);
    end;
  PDWord(@Context.Buffer[56])^:= RB(Context.Hi);
  PDWord(@Context.Buffer[60])^:= RB(Context.Lo);
  SHA1Compress(@Context.Buffer[0], Context);
  Context.Hash[0]:= RB(Context.Hash[0]);
  Context.Hash[1]:= RB(Context.Hash[1]);
  Context.Hash[2]:= RB(Context.Hash[2]);
  Context.Hash[3]:= RB(Context.Hash[3]);
  Context.Hash[4]:= RB(Context.Hash[4]);
  Move(Context.Hash,Digest,Sizeof(Digest));
  FillChar(Context,Sizeof(Context),0);
end;

Hope that gives you a good start, and if you are unsure about what some of the remaining Delphi does, feel free to ask and I or someone else can explain it.

hutch-- · January 03, 2011, 03:47:16 AM

Thanks Glen, I have been manually optimising the code I had to work on and I am still getting reliable results but its a messy algo with too many local variables which really slow it down. The original I started with ran a 25 meg file in about 328 ms, I have it down to 234 ms but there is little that can be done to make it much faster.

I have tried all sorts of trick, XMM buffer copy to no avail, changed the memory allocation for a very slight gain but the sheer bulk of calculations limit the gains that can be obtained. The source is PowerBASIC assembler code that originally had some high level code mixed in with it. I have replaced it all but with only small gains.

Looking at your assembler code, it looks like it was taken from the same base algo as it is very similar.

Glenn9999 · January 03, 2011, 04:29:57 AM

Quote from: hutch-- on January 03, 2011, 03:47:16 AM
Looking at your assembler code, it looks like it was taken from the same base algo as it is very similar.

Yeah, I think most all have been starting out from RFC 3174 with regards to their code. That's where I began. Of course, there are always optimizations (I had six total in my notes so far). The biggest ones I did mainly involved assembler substitutions to reduce number of instructions at certain points, speed increases when I wrote logic to ASM (like Sha1LoadW), and being careful about how I formed the 64 byte sections (Sha1Update).

Besides trying to learn more assembler and get experience, I'd like to eventually rewrite Sha1Update to ASM just to see if it provides a speed increase (since the profiler indicates the loop variables as the statements taking up the most time) - maybe try macros in sequence instead of the loops. Hopefully I can get that done in the next couple of days.

Glenn9999 · January 03, 2011, 05:33:04 AM

Quote from: hutch-- on January 03, 2011, 03:47:16 AM
The original I started with ran a 25 meg file in about 328 ms, I have it down to 234 ms but there is little that can be done to make it much faster.

I got curious and tested what I posted against a 25MB file (515ms on the machine posted before and 265ms on a friend's Core2 Duo). What kind of hardware did you try this against? I'm wondering how comparable what you have is with what I posted, and if there's anything I did miss.

hutch-- · January 03, 2011, 07:02:07 AM

Glen,

The source i was given to optimise was already written in assembler with a little high level code so it was reasonably fast anyway. I have literally bashed the guts out of it to get it down and that meant replacing the small amount of high level code, found one extra register that did help but finally there are too many local variables and this means memory to register copy back and forth which really drops the speed.

I currently develop on a Core2 Quad at 3 gig (9650) and its a reasonably fast machine although it is different to optimise code for than the earlier PIVs.

MichaelW · January 03, 2011, 08:30:39 AM

There is an implementation here that is supposed to provide a performance increase of ~1.2 to ~1.5 over the best scalar implementations.

Glenn9999 · January 04, 2011, 04:52:38 AM

Quote from: hutch-- on January 03, 2011, 07:02:07 AM
I currently develop on a Core2 Quad at 3 gig (9650) and its a reasonably fast machine although it is different to optimise code for than the earlier PIVs.

Slightly faster than the Core2 Duo that I ran my code against. So I would say what I posted is comparable to near identical in performance to what you posted.

As for looking at it, it seems like there isn't much that can be done save a change to make the algorithm more efficient than what was listed in RFC.

FWIW, I played with the Sha1Compress loops and got it expressed as this without incorrect results:

Code Select


T := Sha1Func1(a, b, c, d, e, w[i]);   // all code, formula that is saved to T
E := D;
D := C;
C := B;
B := A;
A := T;
C := LRot32(C, 30);            // ROL   C, 30

I'm thinking given all the variable moves in the middle that somehow the loops can be expressed without all that, but I didn't come up with the answer in the time I've experimented with it so far. Barring that, if there's an instruction (MMX maybe?) that will do a 12-byte overlapping move, it might not hurt trying that.

Otherwise I might try unrolling the loops and see if I can get a speed increase. Though it's looking like the wall has been hit for SHA-1 here without using any special tricks, and I'll have to be satisfied with what I have.

hutch-- · January 04, 2011, 05:14:06 AM

Glen,

Pretty much the same observation, the 5 memory to memory operations are the real speed killer, reducing memory operations is probably the best way to get an algo faster but you need more registers than the 6 or 7 you can manage under 32 bit x86. The original version I was given used all of the registers except EBX and I got some gain using it to reduce the memory to memory transfers but there are just too many local variables in the algo to beat it.

The version that Michael posted is a 64 bit version from the Intel site using some SSE but the speed gains were nothing exciting, the basic nature of the algorithm does not well suit SSE instructions.

Ghandi · January 04, 2011, 10:16:50 AM

I have not timed it, but have you looked at the SHA1 algorithm in Drizz's Crypto Hash library? How about taking an implementation in C and compiling it optimized for speed and using that as the base model of a newer algorithm? Of course, if either of these suggestions are suitable then the original author would require acknowledgement, which goes without saying.

HR,
Ghandi

dedndave · January 04, 2011, 11:52:53 AM

if Drizz wrote it in assembler, there would be little need of compiling it in C :P

hutch-- · January 04, 2011, 10:02:49 PM

All of the versions I have seen so far are based of an RFC and there is little variation in the basic code.

The last trick I tried was to replace the 64 byte block copy with 4 inline SSE register copies,

Code Select


    movdqu xmm0, [esi]
    movdqu xmm1, [esi+16]
    movdqu xmm2, [esi+32]
    movdqu xmm3, [esi+48]

    movdqu [edi], xmm0
    movdqu [edi+16], xmm1
    movdqu [edi+32], xmm2
    movdqu [edi+48], xmm3

    add esi, 64

There is no way the data can be 16 byte aligned, 8 is about the best with this usage so I used the unaligned instruction and it did not make the slightest difference in timing. Apart from the number of memory to memory copies inherant in the algorithm it has a block of 4 path alternate branching that cannot be predicted as it is data dependent and this is also a major bottleneck in the speed of the algo.

Antariy · January 04, 2011, 11:08:52 PM

Quote from: Glenn9999 on January 03, 2011, 03:35:22 AM
2) I have a profiler I've been using which understands Delphi code. According to that, the majority of the time is spent in Sha1Compress and Sha1LoadW, so the biggest gains could probably be made there, though I'm sure there might be other things.

What if replace this piece (conversion to big endian):

Code Select


    @loop1:
    MOV      ECX, [EAX+EDX*4]         // W[i]
    BSWAP    ECX                      // make ECX big-endian
    MOV      [EAX+EDX*4], ECX         // return result to W[i]
    INC      EDX                      // increment counter by 1
    CMP      EDX, 15                  // compare counter with 15
    JLE      @loop1                   // if <= 15 continue

to this MMX piece:

Code Select


@loop1:
PSHUFW MM0,QWORD PTR [EAX+EDX*4],1Bh
PSHUFW MM1,QWORD PTR [EAX+EDX*4],1Bh
ADD EDX,4
PSRLW MM0,8
PSLLW MM1,8
CMP EDX,15
PADDD MM0,MM1
MOVNTQ QWORD PTR [EAX+EDX*4],MM0
JLE @loop1

EDITED: Or needed to use B1h constant instead of 1Bh - for swapping of bytes of DWORDs, not QWORDs.

Glenn9999 · January 04, 2011, 11:37:35 PM

Quote from: hutch-- on January 04, 2011, 05:14:06 AM
Pretty much the same observation, the 5 memory to memory operations are the real speed killer

I have a Delphi version now without the memory copies which appears to provide correct results. Now the problem to solve with what I have sitting here is the same as the one in this MD5 thread. Hopefully there are enough processing registers to be able to make macros like what is in that thread.

Attempting an ASM version now.

News:

Does anyone have a genuinely fast SHA1 algo ?