Life happened so I had to step away from learning ASM, but now I'm back. Part of the reason is because I found an interesting project (to me) to get into the ASM with. I've been trying to improve upon the speed of some CRC-32, MD5, and SHA1 algorithms I've been working with. Most of what I've been exposed to seems to be straight forward, some still confuses me, but I am managing to turn out some functional ASM and even get the speed of the code down by some of the ASM (though most of it was simplifying the algorithm involving how memory sections came out of the buffer).
Now I'm using a profiler to find where the code spends the most time and then blocking that code out into ASM, and in some cases rewriting functions. This is a Delphi environment (using the built-in assembler). I'm posting here because I'm looking for questions answered about some code, but also trying to improve upon what I'm doing. So I'm posting some code. If this is the wrong area, I apologize.
Edited: I got an inspiration from reading some other posts here that ended up answering some of my questions I posted here originally. Code also changed to reflect these questions answered.
First off, questions & statements:
1) I'm trying to keep as much of the logic & storage in the processor as possible, thinking it'll help on speed. Good idea, bad idea, or doesn't matter?
2) Delphi assembler functions like what is posted below requires that ESI and EBX be preserved upon exit. Also, "var" variables are presented as pointers.
3) And finally, given the code below (which works fine, as far as I can tell in my testing), is there anything I'm missing that I can do to simplify this or make it faster? Any help or instruction is welcome, especially since I'm still new at doing this.
procedure Crc32Update(var crc32_out: DWord; inbuffer: pointer; amount_read: DWord); assembler;
{ calculates CRC on a section of bytes pointed to by inbuffer of size amount_read }
// crc32_out in EAX, inbuffer in EDX, amount_read in ECX
asm
PUSH ESI // put ESI to stack to preserve
PUSH EBX // put EBX to stack to preserve
LEA ESI, Crc32Tab // load address of CRC32tab to ESI
MOV EBX, ECX // amount_read to EBX as counter
PUSH EAX // preserve EAX/output address
MOV EAX, [EAX] // move contents of crc32_out to EAX
@loop1:
CMP EBX, 0 // amount_read, 0
JLE @exit // if amount_read <= 0 then exit
MOVZX ECX, Byte [EDX] // move inbuffer byte to ECX, zero extend.
PUSH EAX // store CRC to stack
XOR EAX, ECX // crc xor DWord(value)
MOVZX ECX, AL // move AL to ECX, zero extend.
MOV EAX, [ESI+ECX*4] // move contents of CRC32Tab[ECX] to EAX
POP ECX // put CRC back to EDX from stack
SHR ECX, 8 // (crc shr 8)
AND ECX, $00FFFFFF // (and $00ffffff)
XOR EAX, ECX // eax := eax xor edx;
INC EDX // increment buffer address
DEC EBX // decrement amount_read
JMP @loop1 // go to @loop1
@exit:
POP ECX // return output address to ECX
MOV [ECX], EAX // return answer to contents of ECX
POP EBX // restore EBX for exit
POP ESI // restore ESI for exit
end;
procedure Crc32Update(var crc32_out: DWord; inbuffer: pointer; amount_read: DWord);
// this is the original Delphi code I'm translating to ASM, posted for reference if needed
var
i: integer;
begin
for i := 1 to amount_read do
begin
crc32_out := CRC32tab[Byte(crc32_out xor DWord(PChar(inbuffer)^))] xor
((crc32_out shr 8) and $00ffffff);
inc(Longint(inbuffer));
end;
end;
The same but shorter and faster: :wink
push ebx
push eax
mov eax, [eax]
@@:
movzx ebx, byte ptr [edx]
add edx, 1
xor ebx, eax
shr eax, 8
movzx ebx, bl
and eax, 0FFFFFFh
xor eax, [Crc32Tab+ebx*4]
add ecx, -1
jnle @b
pop ecx
pop ebx
mov [ecx], eax
Quote from: lingo on November 17, 2010, 02:49:21 PM
The same but shorter and faster: :wink
...
shr eax, 8 <---
movzx ebx, bl
and eax, 0FFFFFFh <--- ???
...
:eek
Thanks for the suggestions. What has been posted has presented a couple of good lessons.
Quote from: Antariy on November 17, 2010, 04:15:20 PM
...
shr eax, 8 <---
movzx ebx, bl
and eax, 0FFFFFFh <--- ???
...
It looked kind of strange to me, too, but didn't think enough of it to try removing it and testing it on a few files. I tried removing the AND statement and it still worked. I guess it's what I get for pulling something off the shelf and not checking it out. I implemented the other stuff myself, so I know those algorithms.
Anyhow, here's what I came to now after the ideas here.
procedure Crc32Update(var crc32_out: DWord; inbuffer: pointer; amount_read: DWord); assembler;
// calculates CRC on a section of bytes pointed to by inbuffer of size amount_read
// crc32_out in EAX, inbuffer in EDX, amount_read in ECX
asm
PUSH EBX // put EBX to stack to preserve
PUSH EAX // preserve EAX/output address
MOV EAX, [EAX] // move contents of crc32_out to EAX
@loop1:
CMP ECX, 0 // amount_read, 0
JLE @exit // if amount_read <= 0 then exit
MOVZX EBX, Byte [EDX] // move inbuffer byte to EBX, zero extend.
XOR EBX, EAX // crc xor DWord(value)
MOVZX EBX, BL // zero extend BL to EBX
SHR EAX, 8 // (crc shr 8)
XOR EAX, [Pointer(Crc32Tab)+EBX*4] // eax := eax xor edx;
INC EDX // increment buffer address
DEC ECX // decrement amount_read
JMP @loop1 // go to @loop1
@exit:
POP ECX // return output address to ECX
MOV [ECX], EAX // return answer to contents of ECX
POP EBX // restore EBX for exit
end;
The answer to my question #1 seems to be "as long as I don't touch memory too much, the shortest instructions will always help"? As for this code, the only other thing that bothers me is maintaining two counters for the loop ("INC EDX" now profiles to take the most time in this code). I tried adding ECX to EDX initially, then addressing the pointer as [EDX-ECX], and then not incrementing EDX, but the compiler complained at that. Any thoughts there?
Thanks again!
Your algorithm ignores the potential optimization of actually doing it 32-bits at a time.
push ebx
push eax
mov eax, [eax]
test ecx, -4
jz CrcDo8
push ecx
shr ecx, 2
CrcLoop32:
xor eax, dword ptr [edx]
add edx, 4
movzx ebx, al
shr eax, 8
xor eax, [Crc32Tab+ebx*4]
movzx ebx, al
shr eax, 8
xor eax, [Crc32Tab+ebx*4]
movzx ebx, al
shr eax, 8
xor eax, [Crc32Tab+ebx*4]
movzx ebx, al
shr eax, 8
xor eax, [Crc32Tab+ebx*4]
add ecx, -1
jne CrcLoop32
pop ecx
CrcDo8:
and ecx, 3
jz CrcDone
CrcLoop8:
movzx ebx, byte ptr [edx]
add edx, 1
xor ebx, eax
shr eax, 8
movzx ebx, bl
xor eax, [Crc32Tab+ebx*4]
add ecx, -1
jne CrcLoop8
CrcDone:
pop ecx
pop ebx
mov [ecx], eax
Actually I think Ian's method from last week is faster, but requires 4 x 1KB tables.
Thanks again! The CRC-32 code in my project isn't the slowest thing, but figured it would be a good start to really try and learn ASM.
On that note, does anyone know any good tricks for SHA-1, or do you just have to do it pretty much like the RFC (http://www.faqs.org/rfcs/rfc3174.html) states?
You could probably optimize it some, but not much. (Although, if you were doing a large-scale computing of cryptography every little optimization helps. :wink)
Quote from: Baluga Boo on November 20, 2010, 02:37:44 PM
You could probably optimize it some, but not much. (Although, if you were doing a large-scale computing of cryptography every little optimization helps. :wink)
Indeed - sometimes tremendously. Between the MD5, SHA1, and CRC32 code I have here, I got my test run (about 700MB of files) down 21 seconds from the original codes. Any little bit helps, especially since I'm trying to compute those things on file downloads (though I have uses for scanning data as well). I figured out with the MD5 algorithm that I could make a number of ASM procedures and feed numbers to them in cycle instead of swapping data like most expressions of the algorithm does. But with the SHA1 algorithm, it seems it's dependent on the data being swapped around in memory, so I'm not finding much opportunity for ASM optimization beyond direct use of CPU instructions (ROL and BSWAP specifically) instead of emulation of them in HLL. I'm still new to ASM, so I might not be seeing something.
I'm sure there might be possibility for use of MMX and SSE, but other than that the biggest gains I had for my MD5 and SHA1 code was careful optimization of how the 64-byte frames were formed.