News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Computing CRC: a learning project

Started by Glenn9999, November 17, 2010, 07:58:28 AM

Previous topic - Next topic

Glenn9999

Life happened so I had to step away from learning ASM, but now I'm back.  Part of the reason is because I found an interesting project (to me) to get into the ASM with.  I've been trying to improve upon the speed of some CRC-32, MD5, and SHA1 algorithms I've been working with.  Most of what I've been exposed to seems to be straight forward, some still confuses me, but I am managing to turn out some functional ASM and even get the speed of the code down by some of the ASM (though most of it was simplifying the algorithm involving how memory sections came out of the buffer).

Now I'm using a profiler to find where the code spends the most time and then blocking that code out into ASM, and in some cases rewriting functions.  This is a Delphi environment (using the built-in assembler).  I'm posting here because I'm looking for questions answered about some code, but also trying to improve upon what I'm doing.  So I'm posting some code.  If this is the wrong area, I apologize.

Edited: I got an inspiration from reading some other posts here that ended up answering some of my questions I posted here originally.  Code also changed to reflect these questions answered.
First off, questions & statements:
1) I'm trying to keep as much of the logic & storage in the processor as possible, thinking it'll help on speed.  Good idea, bad idea, or doesn't matter?
2) Delphi assembler functions like what is posted below requires that ESI and EBX be preserved upon exit.  Also, "var" variables are presented as pointers.
3) And finally, given the code below (which works fine, as far as I can tell in my testing), is there anything I'm missing that I can do to simplify this or make it faster?  Any help or instruction is welcome, especially since I'm still new at doing this.


procedure Crc32Update(var crc32_out: DWord; inbuffer: pointer; amount_read: DWord); assembler;
{ calculates CRC on a section of bytes pointed to by inbuffer of size amount_read }
  // crc32_out in EAX, inbuffer in EDX, amount_read in ECX
  asm
    PUSH   ESI                       // put ESI to stack to preserve
    PUSH   EBX                       // put EBX to stack to preserve
    LEA    ESI, Crc32Tab             // load address of CRC32tab to ESI
    MOV    EBX, ECX                  // amount_read to EBX as counter
    PUSH   EAX                       // preserve EAX/output address
    MOV    EAX, [EAX]                // move contents of crc32_out to EAX
    @loop1:
    CMP    EBX, 0                    // amount_read, 0
    JLE    @exit                     // if amount_read <= 0 then exit
    MOVZX  ECX, Byte [EDX]           // move inbuffer byte to ECX, zero extend.
    PUSH   EAX                       // store CRC to stack
    XOR    EAX, ECX                  // crc xor DWord(value)
    MOVZX  ECX, AL                   // move AL to ECX, zero extend.
    MOV    EAX, [ESI+ECX*4]          // move contents of CRC32Tab[ECX] to EAX
    POP    ECX                       // put CRC back to EDX from stack
    SHR    ECX, 8                    // (crc shr 8)
    AND    ECX, $00FFFFFF            // (and $00ffffff)
    XOR    EAX, ECX                  // eax := eax xor edx;
    INC    EDX                       // increment buffer address
    DEC    EBX                       // decrement amount_read
    JMP    @loop1                    // go to @loop1
    @exit:
    POP    ECX                       // return output address to ECX
    MOV    [ECX], EAX                // return answer to contents of ECX
    POP    EBX                       // restore EBX for exit
    POP    ESI                       // restore ESI for exit
  end;


procedure Crc32Update(var crc32_out: DWord; inbuffer: pointer; amount_read: DWord);
// this is the original Delphi code I'm translating to ASM, posted for reference if needed
  var
    i: integer;
  begin
    for i := 1 to amount_read do
      begin
        crc32_out := CRC32tab[Byte(crc32_out xor DWord(PChar(inbuffer)^))] xor
                    ((crc32_out shr 8) and $00ffffff);
        inc(Longint(inbuffer));
      end;
  end;


lingo

The same but shorter and faster:  :wink
    push ebx     
    push eax
    mov  eax, [eax]
@@:      
    movzx ebx, byte ptr [edx]
    add   edx, 1   
    xor   ebx, eax
    shr   eax, 8
    movzx ebx, bl
    and   eax, 0FFFFFFh
    xor   eax, [Crc32Tab+ebx*4]
    add   ecx, -1
    jnle  @b
    pop   ecx
    pop   ebx
    mov   [ecx], eax


Antariy

Quote from: lingo on November 17, 2010, 02:49:21 PM
The same but shorter and faster:  :wink
...
    shr   eax, 8  <---
    movzx ebx, bl
    and   eax, 0FFFFFFh <--- ???
...



:eek

Glenn9999

Thanks for the suggestions.  What has been posted has presented a couple of good lessons.

Quote from: Antariy on November 17, 2010, 04:15:20 PM

...
    shr   eax, 8  <---
    movzx ebx, bl
    and   eax, 0FFFFFFh <--- ???
...


It looked kind of strange to me, too, but didn't think enough of it to try removing it and testing it on a few files.   I tried removing the AND statement and it still worked.  I guess it's what I get for pulling something off the shelf and not checking it out.  I implemented the other stuff myself, so I know those algorithms.

Anyhow, here's what I came to now after the ideas here.

procedure Crc32Update(var crc32_out: DWord; inbuffer: pointer; amount_read: DWord); assembler;
// calculates CRC on a section of bytes pointed to by inbuffer of size amount_read
// crc32_out in EAX, inbuffer in EDX, amount_read in ECX
  asm
    PUSH   EBX                       // put EBX to stack to preserve
    PUSH   EAX                       // preserve EAX/output address
    MOV    EAX, [EAX]                // move contents of crc32_out to EAX
    @loop1:
    CMP    ECX, 0                    // amount_read, 0
    JLE    @exit                     // if amount_read <= 0 then exit
    MOVZX  EBX, Byte [EDX]           // move inbuffer byte to EBX, zero extend.
    XOR    EBX, EAX                  // crc xor DWord(value)
    MOVZX  EBX, BL                   // zero extend BL to EBX
    SHR    EAX, 8                    // (crc shr 8)
    XOR    EAX, [Pointer(Crc32Tab)+EBX*4]  // eax := eax xor edx;
    INC    EDX                       // increment buffer address
    DEC    ECX                       // decrement amount_read
    JMP    @loop1                    // go to @loop1
    @exit:
    POP    ECX                       // return output address to ECX
    MOV    [ECX], EAX                // return answer to contents of ECX
    POP    EBX                       // restore EBX for exit
  end;


The answer to my question #1 seems to be "as long as I don't touch memory too much, the shortest instructions will always help"?  As for this code, the only other thing that bothers me is maintaining two counters for the loop ("INC EDX" now profiles to take the most time in this code).  I tried adding ECX to EDX initially, then addressing the pointer as [EDX-ECX], and then not incrementing EDX, but the compiler complained at that.  Any thoughts there?

Thanks again!

clive

Your algorithm ignores the potential optimization of actually doing it 32-bits at a time.

    push ebx
    push eax
    mov  eax, [eax]

    test  ecx, -4
    jz    CrcDo8

    push  ecx
    shr   ecx, 2

CrcLoop32:
    xor   eax, dword ptr [edx]
    add   edx, 4

    movzx ebx, al
    shr   eax, 8
    xor   eax, [Crc32Tab+ebx*4]

    movzx ebx, al
    shr   eax, 8
    xor   eax, [Crc32Tab+ebx*4]

    movzx ebx, al
    shr   eax, 8
    xor   eax, [Crc32Tab+ebx*4]

    movzx ebx, al
    shr   eax, 8
    xor   eax, [Crc32Tab+ebx*4]

    add   ecx, -1
    jne   CrcLoop32

    pop   ecx
CrcDo8:
    and   ecx, 3
    jz    CrcDone

CrcLoop8:
    movzx ebx, byte ptr [edx]
    add   edx, 1
    xor   ebx, eax
    shr   eax, 8
    movzx ebx, bl
    xor   eax, [Crc32Tab+ebx*4]
    add   ecx, -1
    jne   CrcLoop8

CrcDone:

    pop   ecx
    pop   ebx
    mov   [ecx], eax


Actually I think Ian's method from last week is faster, but requires 4 x 1KB tables.
It could be a random act of randomness. Those happen a lot as well.

Glenn9999

Thanks again!  The CRC-32 code in my project isn't the slowest thing, but figured it would be a good start to really try and learn ASM.

On that note, does anyone know any good tricks for SHA-1, or do you just have to do it pretty much like the RFC states?

Twister

You could probably optimize it some, but not much. (Although, if you were doing a large-scale computing of cryptography every little optimization helps. :wink)

Glenn9999

Quote from: Baluga Boo on November 20, 2010, 02:37:44 PM
You could probably optimize it some, but not much. (Although, if you were doing a large-scale computing of cryptography every little optimization helps. :wink)

Indeed - sometimes tremendously.  Between the MD5, SHA1, and CRC32 code I have here, I got my test run (about 700MB of files) down 21 seconds from the original codes.  Any little bit helps, especially since I'm trying to compute those things on file downloads (though I have uses for scanning data as well).  I figured out with the MD5 algorithm that I could make a number of ASM procedures and feed numbers to them in cycle instead of swapping data like most expressions of the algorithm does.  But with the SHA1 algorithm, it seems it's dependent on the data being swapped around in memory, so I'm not finding much opportunity for ASM optimization beyond direct use of CPU instructions (ROL and BSWAP specifically) instead of emulation of them in HLL.  I'm still new to ASM, so I might not be seeing something.

I'm sure there might be possibility for use of MMX and SSE, but other than that the biggest gains I had for my MD5 and SHA1 code was careful optimization of how the 64-byte frames were formed.