This type of code is something I am not all that familiar with but I have some folks that need to do work like this so I wondered if anyone has a fast SHA1 algo up and going that I can play with ?
Quote from: hutch-- on January 01, 2011, 02:37:05 AM
This type of code is something I am not all that familiar with but I have some folks that need to do work like this so I wondered if anyone has a fast SHA1 algo up and going that I can play with ?
In trying to get efficient hashing algorithms (as I've been posting about on and off here), SHA1 is the only one that I really haven't been able to completely optimize to the point of having no reservations about having anything more that could be done. At least to my general knowledge. But it's faster than what I've found so far.
However, it's in Delphi, but with some Delphi assembler modules. The stuff that does not relate to API calls (was going with a Delphi target, so going with what I know) are mainly what is in the Delphi assembler. I can post it and/or maybe work on some MASM substitutes for the Delphi assembler parts if it would help.
Okay. I tried writing some MASM code and realized that I didn't have any resources immediately handy to answer a couple of things I realized I didn't know (addressing a pointer to a STRUCT - can I do it like I did it with the TASM/Delphi ASM? - and a compiler error on a macro call line I have). So here's the Delphi with assembler as I have it right now, all the ideas I had tried out and the speed improvements left in. Hopefully it should be a good start since I couldn't get it converted in a timely manner, and the ASM can be done from it that I haven't already done. I'd definitely be interested to see if it could be made faster (at least within the realm of X86 code).
Stats:
1) It runs about 17000ms or so on my testbed (660MB of files, Athlon XP 2000). The best code I originally had was around 30000ms, so consider it a positive improvement, at least in that regard. It's very comparable to the MD5 code and CRC32 code I have right now, speed wise.
2) I have a profiler I've been using which understands Delphi code. According to that, the majority of the time is spent in Sha1Compress and Sha1LoadW, so the biggest gains could probably be made there, though I'm sure there might be other things.
3) The "big reservation" I mentioned is in Sha1Compress. I'm thinking that the values need not be rotated in memory and can be called in sequence via macro like the MD5 thread (http://www.masm32.com/board/index.php?topic=15495.15), but I wasn't too successful in getting a correct result. I had another thing to tend to, so I really haven't spent too much time and brain power on trying to eliminate it.
4) I had the loop sections of Sha1Compress written completely in ASM and didn't get a speed increase (namely because of all the memory moves, can't do much besides MOV EAX, A MOV B, EAX with that). Of course that doesn't mean there wasn't anything that I missed.
5) Sha1File is kind of irrelevant (probably) to what you wanted to know, but I left it in there as an illustration of how the code is meant to be used.
{$R- $Q- $O+}
unit sha1;
interface
type
string2 = string[2];
DWord = Longint;
longword = longint;
TSHA1Digest= array[0..19] of byte;
TSHA1Context= record
Hash: array[0..4] of DWord;
Hi, Lo: integer;
Buffer: array[0..63] of byte;
Index: integer;
end;
procedure SHA1Init(var Context: TSHA1Context);
procedure SHA1Update(var Context: TSHA1Context; Buffer: pointer; Len: integer);
procedure SHA1Final(var Context: TSHA1Context; var Digest: TSHA1Digest);
function SHA1File(infile_name: string): string;
implementation
uses sysutils, windows;
function SHA1File(infile_name: string): string;
var
infile: file;
inbuffer: array[0..65535] of byte;
amount_read: integer;
Context: TSHA1Context;
Digest: TSHA1Digest;
begin
AssignFile(infile, infile_name);
Reset(infile, 1);
Sha1Init(Context);
blockread(infile, inbuffer, sizeof(inbuffer), amount_read);
while amount_read > 0 do
begin
Sha1Update(Context, @inbuffer, amount_read);
blockread(infile, inbuffer, sizeof(inbuffer), amount_read);
end;
CloseFile(infile);
SHA1Final(Context,Digest);
Result := write_hash(@Digest, Sizeof(Digest), true); // this is a ByteToHex function
end;
function RB(A: DWord): DWord; assembler;
{ 32-bit little-endian to big-endian, not a big time consumer but rewritten
anyway to take advantage of assembler instructions not directly supported
in Delphi }
asm
BSWAP EAX // swap values in EAX
end; // EAX is result in "assembler" functions.
procedure Sha1LoadW(W: pointer); assembler;
// W is in EAX
asm
MOV EDX, 0 // set EDX counter to 0
// start first loop, make first 16 elements big-endian
@loop1:
MOV ECX, [EAX+EDX*4] // W[i]
BSWAP ECX // make ECX big-endian
MOV [EAX+EDX*4], ECX // return result to W[i]
INC EDX // increment counter by 1
CMP EDX, 15 // compare counter with 15
JLE @loop1 // if <= 15 continue
// start second loop, do SHA spread
@loop2: // EDX should be 16 here
MOV ECX, [EAX+EDX*4-12] // W[i-3]
XOR ECX, [EAX+EDX*4-32] // W[i-8]
XOR ECX, [EAX+EDX*4-56] // W[i-14]
XOR ECX, [EAX+EDX*4-64] // W[i-16]
ROL ECX, 1 // ROtate Left by 1
MOV [EAX+EDX*4], ECX // return result to W[i]
INC EDX // increment counter by 1
CMP EDX, 79 // compare counter with 79
JLE @loop2 // if <=79 continue
end;
procedure SHA1Compress(Buffer: pointer; var Data: TSHA1Context);
// performs SHA-1 on a 64 byte section.
var
A, B, C, D, E, T: DWord;
W: array[0..79] of DWord;
i: integer;
begin
Move(Buffer^,W,64); // change here to support separate buffer
Sha1LoadW(@W);
A:= Data.Hash[0]; B:= Data.Hash[1]; C:= Data.Hash[2]; D:= Data.Hash[3]; E:= Data.Hash[4];
for i:= 0 to 19 do
begin
asm
MOV ECX, D
XOR ECX, c
AND ECX, b
XOR ECX, D
MOV EDX, A // move A to EDX register
ROL EDX, 5 // ROtate Left EDX by 5
ADD ECX, EDX
ADD ECX, E
ADD ECX, $5A827999
MOV EAX, i
ADD ECX, [Pointer(W)+ EAX*4]
MOV T, ECX
end;
E:= D; D:= C;
asm
MOV EDX, B // move b to EDX register
ROL EDX, 30 // ROtate Left EDX by 30
MOV C, EDX // Return Result to C
end;
B:= A; A:= T;
end;
for i:= 20 to 39 do
begin
asm
MOV ECX, d
XOR ECX, c
XOR ECX, b
MOV EDX, A // move A to EDX register
ROL EDX, 5 // ROtate Left EDX by 5
ADD ECX, EDX
ADD ECX, e
ADD ECX, $6ED9EBA1
MOV EAX, i
ADD ECX, [Pointer(W)+ EAX*4]
MOV T, ECX
end;
E:= D; D:= C;
asm
MOV EDX, B // move b to EDX register
ROL EDX, 30 // ROtate Left EDX by 30
MOV C, EDX // Return Result to C
end;
B:= A; A:= T;
end;
for i:= 40 to 59 do
begin
asm
MOV EAX, c
OR EAX, b
AND EAX, d
MOV ECX, b
AND ECX, c
OR ECX, EAX
MOV EAX, A // move A to EDX register
ROL EAX, 5 // ROtate Left EDX by 5
ADD ECX, EAX
ADD ECX, E
ADD ECX, $8F1BBCDC
MOV EAX, i
ADD ECX, [Pointer(W)+EAX*4]
MOV T, ECX
end;
E:= D; D:= C;
asm
MOV EDX, B // move b to EDX register
ROL EDX, 30 // ROtate Left EDX by 30
MOV C, EDX // Return Result to C
end;
B:= A; A:= T;
end;
for i:= 60 to 79 do
begin
asm
MOV ECX, d
XOR ECX, c
XOR ECX, b
MOV EDX, A // move A to EDX register
ROL EDX, 5 // ROtate Left EDX by 5
ADD ECX, EDX
ADD ECX, e
ADD ECX, $CA62C1D6
MOV EAX, i
ADD ECX, [Pointer(W)+ EAX*4]
MOV T, ECX
end;
E:= D; D:= C;
asm
MOV EDX, B // move b to EDX register
ROL EDX, 30 // ROtate Left EDX by 30
MOV C, EDX // Return Result to C
end;
B:= A; A:= T;
end;
Data.Hash[0]:= Data.Hash[0] + A;
Data.Hash[1]:= Data.Hash[1] + B;
Data.Hash[2]:= Data.Hash[2] + C;
Data.Hash[3]:= Data.Hash[3] + D;
Data.Hash[4]:= Data.Hash[4] + E;
end;
}
//******************************************************************************
procedure SHA1Init(var Context: TSHA1Context);
begin
Context.Hi:= 0; Context.Lo:= 0;
Context.Index:= 0;
FillChar(Context.Buffer,Sizeof(Context.Buffer),0);
Context.Hash[0]:= $67452301;
Context.Hash[1]:= $EFCDAB89;
Context.Hash[2]:= $98BADCFE;
Context.Hash[3]:= $10325476;
Context.Hash[4]:= $C3D2E1F0;
end;
//******************************************************************************
procedure SHA1Update(var Context: TSHA1Context; Buffer: pointer; Len: integer);
begin
// Length needs to be in bits, maintaining 2 parts of a 64 bit integer.
asm
MOV EAX, Context // address of context to EAX
MOV ECX, Len // length to ECX
SHL ECX, 3 // multiply ECX by 8
ADD TSha1Context([EAX]).Lo, ECX // add that to the low value.
ADC TSha1Context([EAX]).Hi, 0 // add carry flag if exists.
end;
// process chunk from previous run
if Context.Index > 0 then
begin
Move(Buffer^, Context.Buffer[Context.Index], 64-Context.Index);
Sha1Compress(@Context.Buffer[0], Context);
Inc(Longint(Buffer), 64-Context.Index);
dec(Len, 64-Context.Index);
Context.Index := 0;
end;
// pull off 64-byte chunks and process
while Len >= 64 do
begin
Sha1Compress(Buffer, Context);
Inc(Longint(Buffer), 64);
dec(Len, 64);
end;
// if there's some left, pull off and leave for later
if Len > 0 then
begin
FillChar(Context.Buffer[0], sizeof(Context.Buffer), 0);
Move(Buffer^, Context.Buffer[0], Len);
Context.Index := Len;
end;
end;
//******************************************************************************
procedure SHA1Final(var Context: TSHA1Context; var Digest: TSHA1Digest);
type
PDWord= ^DWord;
begin
Context.Buffer[Context.Index]:= $80; // end of content needs 1 bit at end
if Context.Index>= 56 then
begin
SHA1Compress(@Context.Buffer[0], Context);
FillChar(Context.Buffer[0], sizeof(Context.Buffer), 0);
end;
PDWord(@Context.Buffer[56])^:= RB(Context.Hi);
PDWord(@Context.Buffer[60])^:= RB(Context.Lo);
SHA1Compress(@Context.Buffer[0], Context);
Context.Hash[0]:= RB(Context.Hash[0]);
Context.Hash[1]:= RB(Context.Hash[1]);
Context.Hash[2]:= RB(Context.Hash[2]);
Context.Hash[3]:= RB(Context.Hash[3]);
Context.Hash[4]:= RB(Context.Hash[4]);
Move(Context.Hash,Digest,Sizeof(Digest));
FillChar(Context,Sizeof(Context),0);
end;
Hope that gives you a good start, and if you are unsure about what some of the remaining Delphi does, feel free to ask and I or someone else can explain it.
Thanks Glen, I have been manually optimising the code I had to work on and I am still getting reliable results but its a messy algo with too many local variables which really slow it down. The original I started with ran a 25 meg file in about 328 ms, I have it down to 234 ms but there is little that can be done to make it much faster.
I have tried all sorts of trick, XMM buffer copy to no avail, changed the memory allocation for a very slight gain but the sheer bulk of calculations limit the gains that can be obtained. The source is PowerBASIC assembler code that originally had some high level code mixed in with it. I have replaced it all but with only small gains.
Looking at your assembler code, it looks like it was taken from the same base algo as it is very similar.
Quote from: hutch-- on January 03, 2011, 03:47:16 AM
Looking at your assembler code, it looks like it was taken from the same base algo as it is very similar.
Yeah, I think most all have been starting out from RFC 3174 (http://www.faqs.org/rfcs/rfc3174.html) with regards to their code. That's where I began. Of course, there are always optimizations (I had six total in my notes so far). The biggest ones I did mainly involved assembler substitutions to reduce number of instructions at certain points, speed increases when I wrote logic to ASM (like Sha1LoadW), and being careful about how I formed the 64 byte sections (Sha1Update).
Besides trying to learn more assembler and get experience, I'd like to eventually rewrite Sha1Update to ASM just to see if it provides a speed increase (since the profiler indicates the loop variables as the statements taking up the most time) - maybe try macros in sequence instead of the loops. Hopefully I can get that done in the next couple of days.
Quote from: hutch-- on January 03, 2011, 03:47:16 AM
The original I started with ran a 25 meg file in about 328 ms, I have it down to 234 ms but there is little that can be done to make it much faster.
I got curious and tested what I posted against a 25MB file (515ms on the machine posted before and 265ms on a friend's Core2 Duo). What kind of hardware did you try this against? I'm wondering how comparable what you have is with what I posted, and if there's anything I did miss.
Glen,
The source i was given to optimise was already written in assembler with a little high level code so it was reasonably fast anyway. I have literally bashed the guts out of it to get it down and that meant replacing the small amount of high level code, found one extra register that did help but finally there are too many local variables and this means memory to register copy back and forth which really drops the speed.
I currently develop on a Core2 Quad at 3 gig (9650) and its a reasonably fast machine although it is different to optimise code for than the earlier PIVs.
There is an implementation here (http://software.intel.com/en-us/articles/improving-the-performance-of-the-secure-hash-algorithm-1/) that is supposed to provide a performance increase of ~1.2 to ~1.5 over the best scalar implementations.
Quote from: hutch-- on January 03, 2011, 07:02:07 AM
I currently develop on a Core2 Quad at 3 gig (9650) and its a reasonably fast machine although it is different to optimise code for than the earlier PIVs.
Slightly faster than the Core2 Duo that I ran my code against. So I would say what I posted is comparable to near identical in performance to what you posted.
As for looking at it, it seems like there isn't much that can be done save a change to make the algorithm more efficient than what was listed in RFC.
FWIW, I played with the Sha1Compress loops and got it expressed as this without incorrect results:
T := Sha1Func1(a, b, c, d, e, w[i]); // all code, formula that is saved to T
E := D;
D := C;
C := B;
B := A;
A := T;
C := LRot32(C, 30); // ROL C, 30
I'm thinking given all the variable moves in the middle that somehow the loops can be expressed without all that, but I didn't come up with the answer in the time I've experimented with it so far. Barring that, if there's an instruction (MMX maybe?) that will do a 12-byte overlapping move, it might not hurt trying that.
Otherwise I might try unrolling the loops and see if I can get a speed increase. Though it's looking like the wall has been hit for SHA-1 here without using any special tricks, and I'll have to be satisfied with what I have.
Glen,
Pretty much the same observation, the 5 memory to memory operations are the real speed killer, reducing memory operations is probably the best way to get an algo faster but you need more registers than the 6 or 7 you can manage under 32 bit x86. The original version I was given used all of the registers except EBX and I got some gain using it to reduce the memory to memory transfers but there are just too many local variables in the algo to beat it.
The version that Michael posted is a 64 bit version from the Intel site using some SSE but the speed gains were nothing exciting, the basic nature of the algorithm does not well suit SSE instructions.
I have not timed it, but have you looked at the SHA1 algorithm in Drizz's Crypto Hash library? How about taking an implementation in C and compiling it optimized for speed and using that as the base model of a newer algorithm? Of course, if either of these suggestions are suitable then the original author would require acknowledgement, which goes without saying.
HR,
Ghandi
if Drizz wrote it in assembler, there would be little need of compiling it in C :P
All of the versions I have seen so far are based of an RFC and there is little variation in the basic code.
The last trick I tried was to replace the 64 byte block copy with 4 inline SSE register copies,
movdqu xmm0, [esi]
movdqu xmm1, [esi+16]
movdqu xmm2, [esi+32]
movdqu xmm3, [esi+48]
movdqu [edi], xmm0
movdqu [edi+16], xmm1
movdqu [edi+32], xmm2
movdqu [edi+48], xmm3
add esi, 64
There is no way the data can be 16 byte aligned, 8 is about the best with this usage so I used the unaligned instruction and it did not make the slightest difference in timing. Apart from the number of memory to memory copies inherant in the algorithm it has a block of 4 path alternate branching that cannot be predicted as it is data dependent and this is also a major bottleneck in the speed of the algo.
Quote from: Glenn9999 on January 03, 2011, 03:35:22 AM
2) I have a profiler I've been using which understands Delphi code. According to that, the majority of the time is spent in Sha1Compress and Sha1LoadW, so the biggest gains could probably be made there, though I'm sure there might be other things.
What if replace this piece (conversion to big endian):
@loop1:
MOV ECX, [EAX+EDX*4] // W[i]
BSWAP ECX // make ECX big-endian
MOV [EAX+EDX*4], ECX // return result to W[i]
INC EDX // increment counter by 1
CMP EDX, 15 // compare counter with 15
JLE @loop1 // if <= 15 continue
to this MMX piece:
@loop1:
PSHUFW MM0,QWORD PTR [EAX+EDX*4],1Bh
PSHUFW MM1,QWORD PTR [EAX+EDX*4],1Bh
ADD EDX,4
PSRLW MM0,8
PSLLW MM1,8
CMP EDX,15
PADDD MM0,MM1
MOVNTQ QWORD PTR [EAX+EDX*4],MM0
JLE @loop1
EDITED: Or needed to use B1h constant instead of 1Bh - for swapping of bytes of DWORDs, not QWORDs.
Quote from: hutch-- on January 04, 2011, 05:14:06 AM
Pretty much the same observation, the 5 memory to memory operations are the real speed killer
I have a Delphi version now without the memory copies which appears to provide correct results. Now the problem to solve with what I have sitting here is the same as the one in this MD5 thread (http://www.masm32.com/board/index.php?topic=15495.15). Hopefully there are enough processing registers to be able to make macros like what is in that thread.
Attempting an ASM version now.
Glenn, is Delphi's inline ASM support MMX instructions? Have a look into my previous post then.
With the algo I am playing with the following code is the main bottleneck. I modified the 5 memory copies but no timing difference.
calcsha_lbl3:
lea esi, [ecx*4]
add esi, pWrd
mov eax, [esi+52]
xor eax, [esi+32]
xor eax, [esi+8]
xor eax, [esi]
mov edx, eax
shl eax, 1
shr edx, 31
or eax, edx
mov [esi+64], eax
add ecx, 1
cmp ecx, 64
jne calcsha_lbl3
unrolling it made no difference, it is the sum total number of memory accesses in this loop that limit the speed.
Quote from: hutch-- on January 05, 2011, 12:41:12 AM
With the algo I am playing with the following code is the main bottleneck. I modified the 5 memory copies but no timing difference.
Yes, memory references is probably slowest thing to do. Hutch have a look to code suggested on previous page - to MMX packet DWORDs bytes swap. Since packets and non temporal writing back - this should get not small speed up in this point.
add esi, pWrd
Is pWrd a local variable? What if make FPO, and free EBP reg?
Alex,
The problem is the algo has 12 DWORD local variables and no way to reduce them. to make the memory copy a bit faster with the 5 memory copies I used 5 registers to separate the read after write problems but it did not change the timing, the part that slows it all down is the loop code I posted above. i had a look at your MMX replacement for BSWAP but I have reason not to use MMX registers, XMM would be worth a try as they don't conflist with FP operations but I doubt thats the problem.
Quote from: hutch-- on January 05, 2011, 01:18:19 AM
The problem is the algo has 12 DWORD local variables and no way to reduce them.
Is possible to refer to all locals by ESP, and free EBP for something else?
Alex,
Its a point of diminishing returns, for the additional complexity of managing that many locals in ESP, all you get is one register and that will not solve the problem. The loop code that is the bottle neck is already full register code apart from the memory access that it is directly modifying, the algo itself does not have the room to change this. It may be possible in x64 with the extra registers but I still have my doubts.
Quote from: hutch-- on January 05, 2011, 01:58:55 AM
the algo itself does not have the room to change this. It may be possible in x64 with the extra registers but I still have my doubts.
what if use MMX/XMM regs to temporary holding of the other regs data?
I.e. for eaxample, write EBX not to local var, but to XMM:
movd xmm0,ebx
... do something else with EBX ...
movd ebx,xmm0
...
It looks like storage of 4 DWORD per an XMM reg. Just shifting them and store-retrieve values. Such stuff is slow on a PIV, but not sure about Core+
Quote from: Antariy on January 04, 2011, 11:42:52 PM
Glenn, is Delphi's inline ASM support MMX instructions? Have a look into my previous post then.
I'll have a look at it when I get done with what I'm working on now. That said, I tested a inline procedure implementation (all Delphi code) without the memory moves I referenced, and got a speed of 61ms slower than what I posted in this thread with the 25MB file, so the change definitely looks promising assuming there's a way to inline it in ASM as well as remove the Delphi ROL emulation (the big known time killer in the Delphi implementation).
Quote from: hutch-- on January 05, 2011, 12:41:12 AM
With the algo I am playing with the following code is the main bottleneck. I modified the 5 memory copies but no timing difference.
calcsha_lbl3:
lea esi, [ecx*4]
add esi, pWrd
mov eax, [esi+52]
xor eax, [esi+32]
xor eax, [esi+8]
xor eax, [esi]
mov edx, eax
shl eax, 1
shr edx, 31
or eax, edx
mov [esi+64], eax
add ecx, 1
cmp ecx, 64
jne calcsha_lbl3
unrolling it made no difference, it is the sum total number of memory accesses in this loop that limit the speed.
Hutch,
What about this:
calcsha_lbl3:
lea esi, [ecx*4]
add esi, pWrd
mov edx, [esi] avoid stalls.
mov eax, [esi+8]
xor edx, [esi+32]
xor eax, [esi+52]
xor edx, eax
mov eax, edx
shl eax, 1 I think this does the same thing.
rcr edx, 31
add ecx, 1 Avoid stalls.
mov [esi+64], eax
cmp ecx, 64
jne calcsha_lbl3
Dave.
Hutch,
Forget that last two line change, that won't work at all.
Dave.
Finally got the code I was trying written out, but MASM turned out something I didn't understand. I got 100 lines or so of this:
Assembling: sha1procs.asm
sha1procs.asm(128) : error A2070: invalid instruction operands
Sha1Loop1(17): Macro Called From
sha1procs.asm(128): Main Line Code
sha1procs.asm(128) : error A2070: invalid instruction operands
Sha1Loop1(18): Macro Called From
sha1procs.asm(128): Main Line Code
sha1procs.asm(128) : error A2023: instruction operand must have size
Sha1Loop1(20): Macro Called From
sha1procs.asm(128): Main Line Code
sha1procs.asm(129) : error A2070: invalid instruction operands
Sha1Loop1(17): Macro Called From
sha1procs.asm(129): Main Line Code
sha1procs.asm(129) : error A2070: invalid instruction operands
Sha1Loop1(18): Macro Called From
sha1procs.asm(129): Main Line Code
sha1procs.asm(129) : error A2023: instruction operand must have size
Sha1Loop1(20): Macro Called From
sha1procs.asm(129): Main Line Code
I assume the errors are referencing the Sha1Loop1 macro, but I'm not seeing how to reference the code to see what lines it is talking about. Any help?
the more errors you have - the simpler the problem usually is - lol
are you missing...
.686p
.MODEL Flat,StdCall
OPTION CaseMap:None
.MMX
.XMM
Quote from: dedndave on January 05, 2011, 05:11:46 AM
the more errors you have - the simpler the problem usually is - lol
are you missing...
.686p
.MODEL Flat,StdCall
OPTION CaseMap:None
.MMX
.XMM
Nope. The thing of it is, I can presume that the numbers in the first lines are source lines, since line #128 has the first macro call. The problem I'm having is that 17, 18 and 20 make no sense in the context of the source. Of course I could be missing something.
Edit: Here's the macro. Maybe someone can see something?
Sha1Loop1 macro a, b, cf, d, e, w
MOV EBP, d
XOR EBP, cf
AND EBP, b
XOR EBP, d
ADD e, EBP
;----------------------------------------
MOV EBP, a
ROL EBP, 5
ADD e, EBP
;----------------------------------------
ADD e, w
ADD e, 05A827999h
;----------------------------------------
ROL b, 30
;----------------------------------------
endm
i tried
Sha1Loop1 eax,eax,eax,eax,eax,eax
it assembled without error
show us the line with the very first error in the list
you may have to look in C:\masm32\bin\asmbl.txt
Glen,
If you want to try and optimise it once you get it going, inline all of the macros back into the source code.
yah - a lot of dependencies in that macro
perhaps you are trying to use memory operands in var # 5 and 6 ? (e, w)
Quote from: dedndave on January 05, 2011, 05:39:03 AM
show us the line with the very first error in the list
you may have to look in C:\masm32\bin\asmbl.txt
That's just it and why I'm asking. I don't know what to look for (what I posted above doesn't make sense). And I'm doing the same thing as in the MD5 thread (compared it), so I really don't know. Here's the first macro call if that helps any. But what I do read from that output is that it's referring to errors in the macro.
Sha1Loop1 [eax], [ebx], [ecx], [edx], [esi], [edi+ 0*4]
ADD e, w
you can't do that :bg
ADD [esi],[edi+ 0*4]
not allowed to add 2 memory operands
Quote from: dedndave on January 05, 2011, 05:51:21 AM
ADD [esi],[edi+ 0*4]
So I see now. Macros are a little different than what I'm used to, but seems to be what is needed for the algo to work as intended.
A couple of months ago I started reading this in the hopes of implementing it but never got around to it, maybe someone else would like to give it a shot:
http://software.intel.com/en-us/articles/improving-the-performance-of-the-secure-hash-algorithm-1/
Ofcourse its Intel so its all 64 bit code, but if you want real speed...
Quote from: Glenn9999 on January 05, 2011, 05:52:51 AM
So I see now. Macros are a little different than what I'm used to, but seems to be what is needed for the algo to work as intended.
I hit a roadblock. I needed an extra register and ESP turned out to not really be a good choice. Then what I had with those lines commented wasn't any faster than what I posted in here, which really shocks me since the Delphi version was faster. At least it was worth a try.
Anyhow, if anyone wants the version of Sha1Compress that doesn't do the memory shifts I can post it in here.
Quote from: Glenn9999 on January 05, 2011, 07:12:09 AM
I hit a roadblock. I needed an extra register and ESP turned out to not really be a good choice. Then what I had with those lines commented wasn't any faster than what I posted in here, which really shocks me since the Delphi version was faster. At least it was worth a try.
And roadblocks are meant to be torn down. Figured it out, and got a newer (to me) MASM version, seems to work. Get this:
Athlon XP 2000 Old (what I posted in here): Processed 25260492 bytes in 468 ms.
Athlon XP 2000 New: Processed 25260492 bytes in 250 ms.
There's a 4.2 second difference in code on my testbed of 660MB of files.
Core2Duo 2900 Old: Processed 25260492 bytes in 265 ms.
Core2Duo 2900 New: Processed 25260492 bytes in 156 ms.
Quote from: Antariy on January 04, 2011, 11:42:52 PM
Glenn, is Delphi's inline ASM support MMX instructions? Have a look into my previous post then.
The version I usually work with doesn't, but the newer version I have evidently does. However it blows up with "floating point instruction error". Regardless, the newest Sha1LoadW code I have is now a macro in an MASM module, so do I just copy it as is and add ".XMM/.MMX" to the header?
.686p
.MMX
.XMM
.MODEL Flat,StdCall
OPTION CaseMap:None
you need the xmm for sse
Quote from: dedndave on January 06, 2011, 07:09:55 AM
.686p
.MMX
.XMM
.MODEL Flat,StdCall
OPTION CaseMap:None
you need the xmm for sse
Okay, but kinda meant whether the code that was posted was good or not. I get the same error trying it in my MASM module.
oops - sorry, Glenn :P
not sure which code you are talking about
but....
am i the only one that uses google ? - lol
http://code.google.com/p/pyrit/issues/attachmentText?id=207&aid=4885098494754558868&name=xmm.s.200806291157.2.s&token=f0f6faa218f86cc0a17e32b05ac6b464
kinda bloatish - but i bet it hauls ass
it might make a starting point for a good algo - lol
oh - and - that was the first hit for "sse2 sha1"
c'mon, guys :lol
# More, I used "movaps", "xorps" and other opcode because they are one byte shorter than the
# equivalent "movups" and so on.
Doesn't inspire confidence, though: movaps and movups are same size but have different functions...
well - he says it works
also - English is not his primary language - not sure what he meant there
anyways - i wasn't suggesting you copy/paste his code - lol
but, you might look at it and get some ideas
as tricky as optimizing pipeline usage is, it probably isn't optimal, anyways
point was - has anyone tried searching the web - you know - that internet thingy
that was the first hit - i didn't even break a sweat
Quote from: dedndave on January 06, 2011, 07:36:15 AM
oops - sorry, Glenn :P
not sure which code you are talking about
The code that Antariy asked me to try.
Quote
but....
am i the only one that uses google ? - lol
Lol no. But sometimes Google doesn't give answers that are too great, though it was great for researching the algorithm in the first place. :D
oh - found it - page 1 of the thread
did you notice that he edited that post ?
Quote from: dedndave on January 06, 2011, 07:49:48 AM
oh - found it - page 1 of the thread
did you notice that he edited that post ?
Yes, I copied it when I made the other post in order to test it. Tried 1B and B1 as well in the "Edited" Description. 1B compiled, B1 didn't. Maybe he can clarify.
Glen,
The trailing "b" is for binary notation, the other way around b1 is being treated as a memory operand.
i must be looking at the wrong stuff
i see "1Bh"
i don't think it matters, but he uses JLE - looks like JBE would make more sense
Got done playing with this source. Hopefully someone can see something else going on in it that can shave off some more time. I cleaned it up and got it about 10ms faster than it was when I originally posted times for it.
It's about 300 lines of ASM, so I'm going to attach it to this post as a ZIP file.
Please let me know if it helps, and if anyone has any ideas to get it going faster.
For usage, see the Delphi code I originally posted in here - the ASM module call is intended as a direct Sha1Compress replacement.
Quote from: Glenn9999 on January 06, 2011, 07:51:38 AM
Quote from: dedndave on January 06, 2011, 07:49:48 AM
oh - found it - page 1 of the thread
did you notice that he edited that post ?
Yes, I copied it when I made the other post in order to test it. Tried 1B and B1 as well in the "Edited" Description. 1B compiled, B1 didn't. Maybe he can clarify.
B1 - is in hex. When you want to write a hex in MASM, but the number is starting from a letter, not digit, then write a "0" before a number. I.e. use
0B1h instead of B1 - me just said in simper manner.
OK, right code should be something like:
@loop1:
PSHUFW MM0,QWORD PTR [EAX+EDX*8],0B1h
PSHUFW MM1,QWORD PTR [EAX+EDX*8],0B1h
PSRLW MM0,8
ADD EDX,1
PSLLW MM1,8
CMP EDX,8
PADDD MM0,MM1
MOVQ QWORD PTR [EAX+EDX*4],MM0
JL @loop1
Not sure about gain - it may be not so much since memory bus is slow anyway.
Quote from: dedndave on January 06, 2011, 07:56:43 AM
i must be looking at the wrong stuff
i see "1Bh"
i don't think it matters, but he uses JLE - looks like JBE would make more sense
I'm writing that code piece right over original code - just inserting other shuffling stuff, Dave :P
I had a chance to play with this some more. Mainly the question needed to be answered if what has been posted qualifies as a "genuinely fast SHA1 algo".
I looked for other common implementations and the Microsoft ones seemed to be a good start. I tested:
1) Microsoft's FCIV command-line utility. If you invoke it with -R it will put out approximate times in seconds. Not desirable but good to see how close things are - whether they are similar or very far apart.
2) Microsoft's cryto API. I wrote a program which mirrors functionality of the testbed program I've been using to write/test these algorithms invoking this API.
The SHA-1 implementation in FCIV seems to be very comparable to the one I posted here. Probably IA-32 code and times out very similar in most all instances to the point that it would have to be considered relatively even given varations and such.
However, the crypto API was a different story. The SHA-1 algorithm there ran slightly slower than what I posted here when it came to the IA-32 platform. However, when removed to a platform that supports MMX & SSE, the speeds were radically different.
Athlon 2000XP CAPI SHA1 - Processed 681574400 bytes in 9813 ms.
Athlon 2000XP mine SHA1 - Processed 681574400 bytes in 9531 ms.
Core2Duo 2900 CAPI SHA1 - Processed 681574400 bytes in 2297 ms.
Core2Duo 2900 mine SHA1 - Processed 681574400 bytes in 3500 ms.
That said, since the times were similar in the IA-32 platform, what I posted probably is pretty close to the best you're going to get without MMX/SSE, but you're going to get a big return if you implement with those instructions.
Hope that helps, and hope the code posted was useful.
EDIT: Added all timings.