News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

I am unable of optimize it more

Started by untio, November 05, 2010, 03:12:43 PM

Previous topic - Next topic

untio

Hi and thanks for reading this stuff.
I am coding a dll that will calculate hashes. I have started with the known crc32. It is calculated in this procedure:

getcrc32 proc uses ebx ecx edx esi, pbuffer:dword, longitud:dword, crc:dword
mov ecx, longitud
mov eax, crc
lea ebx, TablaDeCrc32
mov esi, pbuffer
xor edx, edx
label1:
mov dl, al
xor dl, [esi]
inc esi
shr eax, 8
xor eax, [ebx + edx *4]
dec ecx
jnz label1
ret
getcrc32 endp


My question is if you can tell me some way to optimize this code, mainly inside the loop. For me, some pop or push does not matter at the beginning or the end of the procedure. I am speaking about the loop.

Thanks in advance.

ToutEnMasm


You can:
* Don't use a stackframe,put the local in data and don't use a proc , a call can be enough
* analyze your code,what he do and see if it is possible to use mmx or sse instructions.

dedndave

i would try to get rid of this complex addressing instruction   :P
        xor     eax,[ebx+edx*4]
you might try it like this
label1:
        movzx   edx,al
        xor     dl,[esi]
        shr     eax,8
        shl     edx,2
        add     edx,ebx
        inc     esi
        xor     eax,[edx]
        dec     ecx
        jnz     label1

or
label1:
        movzx   edx,al
        xor     dl,[esi]
        shr     eax,8
        shl     edx,2
        inc     esi
        xor     eax,[ebx+edx]
        dec     ecx
        jnz     label1

and you can eliminate the xor edx,edx instruction
i am sure lingo has a faster way - lol

jj2007

Have you timed it already? What happens e.g. if you replace mov dl, al with mov edx, eax, or you use Dave's suggestion?
Stack frame yes or no is irrelevant if ecx is high enough, but the MMX/SSE2 option is worth considering. However, for testing that one a complete example would be needed.

untio

Hi and thanks for your time.
I have tried your posted code:
label1:
        movzx   edx,al
        xor     dl,[esi]
        shr     eax,8
        shl     edx,2
        add     edx,ebx
        inc     esi
        xor     eax,[edx]
        dec     ecx
        jnz     label1

Clock ticks:
First execution:
3565
3435
3465
Second one:
3485
3355
3475

And I have tried also this code:
label1:
        movzx   edx,al
        xor     dl,[esi]
        shr     eax,8
        shl     edx,2
        inc     esi
        xor     eax,[ebx+edx]
        dec     ecx
        jnz     label1

Ticks of clock:
1:
3325
3315
3294
2:
3235
3215
3304
And, obviously, I have timed my original code:
label1:
mov dl, al
xor dl, [esi]
inc esi
shr eax, 8
xor eax, [ebx + edx *4]
dec ecx
jnz label1

Ticks:
1:
3255
3204
3155
2:
3185
3305
3304
But, after a lot of probes I have tried this code, based on the dedndave's code:
label1:
mov dl, al
xor dl, [esi]
inc esi
shr eax, 8
movzx edi, dl
shl edi, 2
xor eax, [ebx + edi]
dec ecx
jnz label1

And the ticks are:
1:
2874
2784
2834
2:
2754
2814
2845

And my question for the intel engineers is why this difference using edx or esi.

Thanks a lot to all for your comments.

theunknownguy

- Remove the stack frame
- Remove partial regist operations
- Avoid using Displacement or Scalar under addressing
- Avoid using INC/DEC, Sub/Add is faster
- Use a fair order, example all movs togheter, all logical operations togheter

mov edx, eax
And edx, 0FFh
Xor Edx, [Esi]
And Edx, 0FFh
Shr Eax, 8
Shl Edx, 2
Add Esi, 1
Add Edx, Ebx
Xor Eax, [Edx]
Sub Ecx, 1
Jnz Label1


Or:

Movzx Edx, Al
Xor Edx, [Esi]
And Edx, 0FFh
Shr Eax, 8
Shl Edx, 2
Add Edx, Ebx
Add Esi, 1
Xor Eax, [Edx]
Sub Ecx, 1
Jnz Label1



eax - Accumulator Register
ebx - Base Register
ecx - Counter Register
edx - Data Register
esi - Source (for memory operations) register
edi - Destination (for memory operations) register


PS: I see you speak spanish, on the local names of your source ^^

dedndave

well - you also changed the order of instructions
that is more likely the reason for the difference
some guys in here can look at code and see how to re-arrange it for the fastest result
i just do it by trial and error - lol

untio

Hi again,
I have tried the code of theunknownguy and is slower than my final one. As I can see, optimize code is, more or less, try combinations.
And yes, I speak Spanish. I am from Spain in the south of Europe. Exactly from a village named Castelldefels next to a Mediterranean beach (but now is a little cold here).

Thanks again.

clive

Been doing CRC's for decades, it's basically the remainder from a long division. Quicker to do 32-bits, than 8-bits when you can. The effectiveness depends on the direction of shift (crc and/or data), and the CPU architecture (x86, ARM, etc).

Here's a quick mash-up in my preferred framework.

//****************************************************************************

DWORD getcrc32_A(BYTE *buffer, DWORD longitud, DWORD crc)
{
  __asm // Original
  {
    mov ecx, longitud
    mov eax, crc
    lea ebx, TablaDeCrc32
    mov esi, buffer
    xor edx, edx
label1:
    mov dl, al
    xor dl, [esi]
    inc esi
    shr eax, 8
    xor eax, [ebx + edx *4]
    dec ecx
    jnz label1
  }
}

//****************************************************************************

DWORD getcrc32_B(BYTE *buffer, DWORD longitud, DWORD crc)
{
  __asm // Simplified
  {
    mov ecx, longitud
    mov eax, crc
    lea ebx, TablaDeCrc32
    mov esi, buffer
label1:
    xor al, [esi]
    inc esi

    movzx edx,al
    shr eax, 8
    xor eax, [ebx + edx *4]

    dec ecx
    jnz label1
  }
}

//****************************************************************************

DWORD getcrc32_C(BYTE *buffer, DWORD longitud, DWORD crc)
{
  __asm // Parallel efficiency
  {
    mov ecx, longitud
    mov eax, crc
    lea ebx, TablaDeCrc32
    mov esi, buffer

    push  ecx
    shr   ecx,2 ; compute dwords
    jz    crc8 ; no dwords

label32:
    xor eax, [esi]

    movzx edx,al
    shr eax, 8
    xor eax, [ebx + edx *4]

    movzx edx,al
    shr eax, 8
    xor eax, [ebx + edx *4]

    add esi,4

    movzx edx,al
    shr eax, 8
    xor eax, [ebx + edx *4]

    movzx edx,al
    shr eax, 8
    xor eax, [ebx + edx *4]

    sub ecx,1
    jnz label32

crc8:

    pop ecx
    and ecx,3 ; compute residual bytes
    jz  crcdone ; no bytes

label16:
    xor al, [esi]
    inc esi

    movzx edx,al
    shr eax, 8
    xor eax, [ebx + edx *4]

    sub ecx,1
    jnz label16

crcdone:
  }
}

//****************************************************************************


Probably some efficiencies to be tested using MOV EDX,EAX; AND EDX,0FFh instead of MOVZX EDX,AL, and aligning loop/branch targets. I'm more interested in algorithmic optimizations.
It could be a random act of randomness. Those happen a lot as well.

theunknownguy

Quote from: clive on November 05, 2010, 06:02:15 PM

//****************************************************************************

DWORD getcrc32_C(BYTE *buffer, DWORD longitud, DWORD crc)
{
  __asm // Parallel efficiency
  {
    mov ecx, longitud
    mov eax, crc
    lea ebx, TablaDeCrc32
    mov esi, buffer

    push  ecx
    shr   ecx,2 ; compute dwords
    jz    crc8 ; no dwords

label32:
    xor eax, [esi]

    movzx edx,al
    shr eax, 8
    xor eax, [ebx + edx *4]

    movzx edx,al
    shr eax, 8
    xor eax, [ebx + edx *4]

    add esi,4

    movzx edx,al
    shr eax, 8
    xor eax, [ebx + edx *4]

    movzx edx,al
    shr eax, 8
    xor eax, [ebx + edx *4]

    sub ecx,1
    jnz label32

crc8:

    pop ecx
    and ecx,3 ; compute residual bytes
    jz  crcdone ; no bytes

label16:
    xor al, [esi]
    inc esi
    movzx edx,al
    shr eax, 8
    xor eax, [ebx + edx *4]
    sub ecx,1
    jnz label16

crcdone:
  }
}

//****************************************************************************


I think by plain sight that one kickass on speed (ofc if are dwords)

PS: I speak spanish too untio  :bg

clive

Quote from: theunknownguy
I think by plain sight that one kickass on speed (ofc if are dwords)

Algorithmically it's going to spank any byte solution, but does rely on the shift direction chosen by the OP and endian-ness of the x86. Personally I could implement a hardware version that clock out 32-bits at a time in a single cycle at quite staggering speeds. The equations would resolve to a small collection of XOR gates, with a fairly short gate depth.

If the foot print of the 256 entry CRC table (1024 bytes) is too large, it can be done with a 16 entry table 4-bits at a time, which works well on embedded targets.



Initial CRC FFFFFFFF
Polynomial EDB88320
Test Pattern 'BYTE test[9] =  { 0x11, 0x22, 0x33, 0x44, 0x55, 0x66, 0x77, 0x88, 0x99 };'

D1D580EF Test 1 - Classic
D1D580EF Test 2 - Orignial
D1D580EF Test 3 - Simplified
D1D580EF Test 4 - 32-bit folding
It could be a random act of randomness. Those happen a lot as well.

theunknownguy

Quote from: clive on November 05, 2010, 06:19:46 PM
Quote from: theunknownguy
I think by plain sight that one kickass on speed (ofc if are dwords)

Personally I could implement a hardware version that clock out 32-bits at a time in a single cycle at quite staggering speeds. The equations would resolve to a small collection of XOR gates, with a fairly short gate depth.

If the foot print of the 256 entry CRC table (1024 bytes) is too large, it can be done with a 16 entry table 4-bits at a time, which works well on embedded targets.

There should be fast hardware implementation on intel for CRC32... Or there is already?

PS: I wish i could make hardware  :(

clive

Quote from: theunknownguy on November 05, 2010, 06:23:17 PM
There should be fast hardware implementation on intel for CRC32... Or there is already?
PS: I wish i could make hardware  :(

The Core i7 has one that can take bytes, words or dwords as I recall. Might not be the polynomial you want. (Castagnoli)

http://www.strchr.com/crc32_popcnt

http://download.intel.com/design/intarch/papers/323102.pdf
It could be a random act of randomness. Those happen a lot as well.

MichaelW

This is a four-table version that was posted on the old forum by IanB:

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

      .486                      ; create 32 bit code
      .model flat, stdcall      ; 32 bit memory model
      option casemap :none      ; case sensitive

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

  comment *Data Tables for CCITT CRC formatted for MASM*

.tables segment

align 64

  CRC32Table:
; ------------------------------

DD 00000000H, 77073096H, 0ee0e612cH, 990951baH, 076dc419H, 706af48fH, 0e963a535H, 9e6495a3H
DD 0edb8832H, 79dcb8a4H, 0e0d5e91eH, 97d2d988H, 09b64c2bH, 7eb17cbdH, 0e7b82d07H, 90bf1d91H
DD 1db71064H, 6ab020f2H, 0f3b97148H, 84be41deH, 1adad47dH, 6ddde4ebH, 0f4d4b551H, 83d385c7H
DD 136c9856H, 646ba8c0H, 0fd62f97aH, 8a65c9ecH, 14015c4fH, 63066cd9H, 0fa0f3d63H, 8d080df5H
DD 3b6e20c8H, 4c69105eH, 0d56041e4H, 0a2677172H, 3c03e4d1H, 4b04d447H, 0d20d85fdH, 0a50ab56bH
DD 35b5a8faH, 42b2986cH, 0dbbbc9d6H, 0acbcf940H, 32d86ce3H, 45df5c75H, 0dcd60dcfH, 0abd13d59H
DD 26d930acH, 51de003aH, 0c8d75180H, 0bfd06116H, 21b4f4b5H, 56b3c423H, 0cfba9599H, 0b8bda50fH
DD 2802b89eH, 5f058808H, 0c60cd9b2H, 0b10be924H, 2f6f7c87H, 58684c11H, 0c1611dabH, 0b6662d3dH
DD 76dc4190H, 01db7106H, 98d220bcH, 0efd5102aH, 71b18589H, 06b6b51fH, 9fbfe4a5H, 0e8b8d433H
DD 7807c9a2H, 0f00f934H, 9609a88eH, 0e10e9818H, 7f6a0dbbH, 086d3d2dH, 91646c97H, 0e6635c01H
DD 6b6b51f4H, 1c6c6162H, 856530d8H, 0f262004eH, 6c0695edH, 1b01a57bH, 8208f4c1H, 0f50fc457H
DD 65b0d9c6H, 12b7e950H, 8bbeb8eaH, 0fcb9887cH, 62dd1ddfH, 15da2d49H, 8cd37cf3H, 0fbd44c65H
DD 4db26158H, 3ab551ceH, 0a3bc0074H, 0d4bb30e2H, 4adfa541H, 3dd895d7H, 0a4d1c46dH, 0d3d6f4fbH
DD 4369e96aH, 346ed9fcH, 0ad678846H, 0da60b8d0H, 44042d73H, 33031de5H, 0aa0a4c5fH, 0dd0d7cc9H
DD 5005713cH, 270241aaH, 0be0b1010H, 0c90c2086H, 5768b525H, 206f85b3H, 0b966d409H, 0ce61e49fH
DD 5edef90eH, 29d9c998H, 0b0d09822H, 0c7d7a8b4H, 59b33d17H, 2eb40d81H, 0b7bd5c3bH, 0c0ba6cadH
DD 0edb88320H, 9abfb3b6H, 03b6e20cH, 74b1d29aH, 0ead54739H, 9dd277afH, 04db2615H, 73dc1683H
DD 0e3630b12H, 94643b84H, 0d6d6a3eH, 7a6a5aa8H, 0e40ecf0bH, 9309ff9dH, 0a00ae27H, 7d079eb1H
DD 0f00f9344H, 8708a3d2H, 1e01f268H, 6906c2feH, 0f762575dH, 806567cbH, 196c3671H, 6e6b06e7H
DD 0fed41b76H, 89d32be0H, 10da7a5aH, 67dd4accH, 0f9b9df6fH, 8ebeeff9H, 17b7be43H, 60b08ed5H
DD 0d6d6a3e8H, 0a1d1937eH, 38d8c2c4H, 4fdff252H, 0d1bb67f1H, 0a6bc5767H, 3fb506ddH, 48b2364bH
DD 0d80d2bdaH, 0af0a1b4cH, 36034af6H, 41047a60H, 0df60efc3H, 0a867df55H, 316e8eefH, 4669be79H
DD 0cb61b38cH, 0bc66831aH, 256fd2a0H, 5268e236H, 0cc0c7795H, 0bb0b4703H, 220216b9H, 5505262fH
DD 0c5ba3bbeH, 0b2bd0b28H, 2bb45a92H, 5cb36a04H, 0c2d7ffa7H, 0b5d0cf31H, 2cd99e8bH, 5bdeae1dH
DD 9b64c2b0H, 0ec63f226H, 756aa39cH, 026d930aH, 9c0906a9H, 0eb0e363fH, 72076785H, 05005713H
DD 95bf4a82H, 0e2b87a14H, 7bb12baeH, 0cb61b38H, 92d28e9bH, 0e5d5be0dH, 7cdcefb7H, 0bdbdf21H
DD 86d3d2d4H, 0f1d4e242H, 68ddb3f8H, 1fda836eH, 81be16cdH, 0f6b9265bH, 6fb077e1H, 18b74777H
DD 88085ae6H, 0ff0f6a70H, 66063bcaH, 11010b5cH, 8f659effH, 0f862ae69H, 616bffd3H, 166ccf45H
DD 0a00ae278H, 0d70dd2eeH, 4e048354H, 3903b3c2H, 0a7672661H, 0d06016f7H, 4969474dH, 3e6e77dbH
DD 0aed16a4aH, 0d9d65adcH, 40df0b66H, 37d83bf0H, 0a9bcae53H, 0debb9ec5H, 47b2cf7fH, 30b5ffe9H
DD 0bdbdf21cH, 0cabac28aH, 53b39330H, 24b4a3a6H, 0bad03605H, 0cdd70693H, 54de5729H, 23d967bfH
DD 0b3667a2eH, 0c4614ab8H, 5d681b02H, 2a6f2b94H, 0b40bbe37H, 0c30c8ea1H, 5a05df1bH, 2d02ef8dH

  CRC32Table2:
; ------------------------------

DD 00000000H, 191b3141H, 32366282H, 2b2d53c3H, 646cc504H, 7d77f445H, 565aa786H, 4f4196c7H
DD 0c8d98a08H, 0d1c2bb49H, 0faefe88aH, 0e3f4d9cbH, 0acb54f0cH, 0b5ae7e4dH, 9e832d8eH, 87981ccfH
DD 4ac21251H, 53d92310H, 78f470d3H, 61ef4192H, 2eaed755H, 37b5e614H, 1c98b5d7H, 05838496H
DD 821b9859H, 9b00a918H, 0b02dfadbH, 0a936cb9aH, 0e6775d5dH, 0ff6c6c1cH, 0d4413fdfH, 0cd5a0e9eH
DD 958424a2H, 8c9f15e3H, 0a7b24620H, 0bea97761H, 0f1e8e1a6H, 0e8f3d0e7H, 0c3de8324H, 0dac5b265H
DD 5d5daeaaH, 44469febH, 6f6bcc28H, 7670fd69H, 39316baeH, 202a5aefH, 0b07092cH, 121c386dH
DD 0df4636f3H, 0c65d07b2H, 0ed705471H, 0f46b6530H, 0bb2af3f7H, 0a231c2b6H, 891c9175H, 9007a034H
DD 179fbcfbH, 0e848dbaH, 25a9de79H, 3cb2ef38H, 73f379ffH, 6ae848beH, 41c51b7dH, 58de2a3cH
DD 0f0794f05H, 0e9627e44H, 0c24f2d87H, 0db541cc6H, 94158a01H, 8d0ebb40H, 0a623e883H, 0bf38d9c2H
DD 38a0c50dH, 21bbf44cH, 0a96a78fH, 138d96ceH, 5ccc0009H, 45d73148H, 6efa628bH, 77e153caH
DD 0babb5d54H, 0a3a06c15H, 888d3fd6H, 91960e97H, 0ded79850H, 0c7cca911H, 0ece1fad2H, 0f5facb93H
DD 7262d75cH, 6b79e61dH, 4054b5deH, 594f849fH, 160e1258H, 0f152319H, 243870daH, 3d23419bH
DD 65fd6ba7H, 7ce65ae6H, 57cb0925H, 4ed03864H, 0191aea3H, 188a9fe2H, 33a7cc21H, 2abcfd60H
DD 0ad24e1afH, 0b43fd0eeH, 9f12832dH, 8609b26cH, 0c94824abH, 0d05315eaH, 0fb7e4629H, 0e2657768H
DD 2f3f79f6H, 362448b7H, 1d091b74H, 04122a35H, 4b53bcf2H, 52488db3H, 7965de70H, 607eef31H
DD 0e7e6f3feH, 0fefdc2bfH, 0d5d0917cH, 0cccba03dH, 838a36faH, 9a9107bbH, 0b1bc5478H, 0a8a76539H
DD 3b83984bH, 2298a90aH, 09b5fac9H, 10aecb88H, 5fef5d4fH, 46f46c0eH, 6dd93fcdH, 74c20e8cH
DD 0f35a1243H, 0ea412302H, 0c16c70c1H, 0d8774180H, 9736d747H, 8e2de606H, 0a500b5c5H, 0bc1b8484H
DD 71418a1aH, 685abb5bH, 4377e898H, 5a6cd9d9H, 152d4f1eH, 0c367e5fH, 271b2d9cH, 3e001cddH
DD 0b9980012H, 0a0833153H, 8bae6290H, 92b553d1H, 0ddf4c516H, 0c4eff457H, 0efc2a794H, 0f6d996d5H
DD 0ae07bce9H, 0b71c8da8H, 9c31de6bH, 852aef2aH, 0ca6b79edH, 0d37048acH, 0f85d1b6fH, 0e1462a2eH
DD 66de36e1H, 7fc507a0H, 54e85463H, 4df36522H, 02b2f3e5H, 1ba9c2a4H, 30849167H, 299fa026H
DD 0e4c5aeb8H, 0fdde9ff9H, 0d6f3cc3aH, 0cfe8fd7bH, 80a96bbcH, 99b25afdH, 0b29f093eH, 0ab84387fH
DD 2c1c24b0H, 350715f1H, 1e2a4632H, 07317773H, 4870e1b4H, 516bd0f5H, 7a468336H, 635db277H
DD 0cbfad74eH, 0d2e1e60fH, 0f9ccb5ccH, 0e0d7848dH, 0af96124aH, 0b68d230bH, 9da070c8H, 84bb4189H
DD 03235d46H, 1a386c07H, 31153fc4H, 280e0e85H, 674f9842H, 7e54a903H, 5579fac0H, 4c62cb81H
DD 8138c51fH, 9823f45eH, 0b30ea79dH, 0aa1596dcH, 0e554001bH, 0fc4f315aH, 0d7626299H, 0ce7953d8H
DD 49e14f17H, 50fa7e56H, 7bd72d95H, 62cc1cd4H, 2d8d8a13H, 3496bb52H, 1fbbe891H, 06a0d9d0H
DD 5e7ef3ecH, 4765c2adH, 6c48916eH, 7553a02fH, 3a1236e8H, 230907a9H, 0824546aH, 113f652bH
DD 96a779e4H, 8fbc48a5H, 0a4911b66H, 0bd8a2a27H, 0f2cbbce0H, 0ebd08da1H, 0c0fdde62H, 0d9e6ef23H
DD 14bce1bdH, 0da7d0fcH, 268a833fH, 3f91b27eH, 70d024b9H, 69cb15f8H, 42e6463bH, 5bfd777aH
DD 0dc656bb5H, 0c57e5af4H, 0ee530937H, 0f7483876H, 0b809aeb1H, 0a1129ff0H, 8a3fcc33H, 9324fd72H

  CRC32Table3:
; ------------------------------

DD 00000000H, 01c26a37H, 0384d46eH, 0246be59H, 0709a8dcH, 06cbc2ebH, 048d7cb2H, 054f1685H
DD 0e1351b8H, 0fd13b8fH, 0d9785d6H, 0c55efe1H, 091af964H, 08d89353H, 0a9e2d0aH, 0b5c473dH
DD 1c26a370H, 1de4c947H, 1fa2771eH, 1e601d29H, 1b2f0bacH, 1aed619bH, 18abdfc2H, 1969b5f5H
DD 1235f2c8H, 13f798ffH, 11b126a6H, 10734c91H, 153c5a14H, 14fe3023H, 16b88e7aH, 177ae44dH
DD 384d46e0H, 398f2cd7H, 3bc9928eH, 3a0bf8b9H, 3f44ee3cH, 3e86840bH, 3cc03a52H, 3d025065H
DD 365e1758H, 379c7d6fH, 35dac336H, 3418a901H, 3157bf84H, 3095d5b3H, 32d36beaH, 331101ddH
DD 246be590H, 25a98fa7H, 27ef31feH, 262d5bc9H, 23624d4cH, 22a0277bH, 20e69922H, 2124f315H
DD 2a78b428H, 2bbade1fH, 29fc6046H, 283e0a71H, 2d711cf4H, 2cb376c3H, 2ef5c89aH, 2f37a2adH
DD 709a8dc0H, 7158e7f7H, 731e59aeH, 72dc3399H, 7793251cH, 76514f2bH, 7417f172H, 75d59b45H
DD 7e89dc78H, 7f4bb64fH, 7d0d0816H, 7ccf6221H, 798074a4H, 78421e93H, 7a04a0caH, 7bc6cafdH
DD 6cbc2eb0H, 6d7e4487H, 6f38fadeH, 6efa90e9H, 6bb5866cH, 6a77ec5bH, 68315202H, 69f33835H
DD 62af7f08H, 636d153fH, 612bab66H, 60e9c151H, 65a6d7d4H, 6464bde3H, 662203baH, 67e0698dH
DD 48d7cb20H, 4915a117H, 4b531f4eH, 4a917579H, 4fde63fcH, 4e1c09cbH, 4c5ab792H, 4d98dda5H
DD 46c49a98H, 4706f0afH, 45404ef6H, 448224c1H, 41cd3244H, 400f5873H, 4249e62aH, 438b8c1dH
DD 54f16850H, 55330267H, 5775bc3eH, 56b7d609H, 53f8c08cH, 523aaabbH, 507c14e2H, 51be7ed5H
DD 5ae239e8H, 5b2053dfH, 5966ed86H, 58a487b1H, 5deb9134H, 5c29fb03H, 5e6f455aH, 5fad2f6dH
DD 0e1351b80H, 0e0f771b7H, 0e2b1cfeeH, 0e373a5d9H, 0e63cb35cH, 0e7fed96bH, 0e5b86732H, 0e47a0d05H
DD 0ef264a38H, 0eee4200fH, 0eca29e56H, 0ed60f461H, 0e82fe2e4H, 0e9ed88d3H, 0ebab368aH, 0ea695cbdH
DD 0fd13b8f0H, 0fcd1d2c7H, 0fe976c9eH, 0ff5506a9H, 0fa1a102cH, 0fbd87a1bH, 0f99ec442H, 0f85cae75H
DD 0f300e948H, 0f2c2837fH, 0f0843d26H, 0f1465711H, 0f4094194H, 0f5cb2ba3H, 0f78d95faH, 0f64fffcdH
DD 0d9785d60H, 0d8ba3757H, 0dafc890eH, 0db3ee339H, 0de71f5bcH, 0dfb39f8bH, 0ddf521d2H, 0dc374be5H
DD 0d76b0cd8H, 0d6a966efH, 0d4efd8b6H, 0d52db281H, 0d062a404H, 0d1a0ce33H, 0d3e6706aH, 0d2241a5dH
DD 0c55efe10H, 0c49c9427H, 0c6da2a7eH, 0c7184049H, 0c25756ccH, 0c3953cfbH, 0c1d382a2H, 0c011e895H
DD 0cb4dafa8H, 0ca8fc59fH, 0c8c97bc6H, 0c90b11f1H, 0cc440774H, 0cd866d43H, 0cfc0d31aH, 0ce02b92dH
DD 91af9640H, 906dfc77H, 922b422eH, 93e92819H, 96a63e9cH, 976454abH, 9522eaf2H, 94e080c5H
DD 9fbcc7f8H, 9e7eadcfH, 9c381396H, 9dfa79a1H, 98b56f24H, 99770513H, 9b31bb4aH, 9af3d17dH
DD 8d893530H, 8c4b5f07H, 8e0de15eH, 8fcf8b69H, 8a809decH, 8b42f7dbH, 89044982H, 88c623b5H
DD 839a6488H, 82580ebfH, 801eb0e6H, 81dcdad1H, 8493cc54H, 8551a663H, 8717183aH, 86d5720dH
DD 0a9e2d0a0H, 0a820ba97H, 0aa6604ceH, 0aba46ef9H, 0aeeb787cH, 0af29124bH, 0ad6fac12H, 0acadc625H
DD 0a7f18118H, 0a633eb2fH, 0a4755576H, 0a5b73f41H, 0a0f829c4H, 0a13a43f3H, 0a37cfdaaH, 0a2be979dH
DD 0b5c473d0H, 0b40619e7H, 0b640a7beH, 0b782cd89H, 0b2cddb0cH, 0b30fb13bH, 0b1490f62H, 0b08b6555H
DD 0bbd72268H, 0ba15485fH, 0b853f606H, 0b9919c31H, 0bcde8ab4H, 0bd1ce083H, 0bf5a5edaH, 0be9834edH

  CRC32Table4:
; ------------------------------

DD 00000000H, 0b8bc6765H, 0aa09c88bH, 12b5afeeH, 8f629757H, 37def032H, 256b5fdcH, 9dd738b9H
DD 0c5b428efH, 7d084f8aH, 6fbde064H, 0d7018701H, 4ad6bfb8H, 0f26ad8ddH, 0e0df7733H, 58631056H
DD 5019579fH, 0e8a530faH, 0fa109f14H, 42acf871H, 0df7bc0c8H, 67c7a7adH, 75720843H, 0cdce6f26H
DD 95ad7f70H, 2d111815H, 3fa4b7fbH, 8718d09eH, 1acfe827H, 0a2738f42H, 0b0c620acH, 087a47c9H
DD 0a032af3eH, 188ec85bH, 0a3b67b5H, 0b28700d0H, 2f503869H, 97ec5f0cH, 8559f0e2H, 3de59787H
DD 658687d1H, 0dd3ae0b4H, 0cf8f4f5aH, 7733283fH, 0eae41086H, 525877e3H, 40edd80dH, 0f851bf68H
DD 0f02bf8a1H, 48979fc4H, 5a22302aH, 0e29e574fH, 7f496ff6H, 0c7f50893H, 0d540a77dH, 6dfcc018H
DD 359fd04eH, 8d23b72bH, 9f9618c5H, 272a7fa0H, 0bafd4719H, 0241207cH, 10f48f92H, 0a848e8f7H
DD 9b14583dH, 23a83f58H, 311d90b6H, 89a1f7d3H, 1476cf6aH, 0accaa80fH, 0be7f07e1H, 06c36084H
DD 5ea070d2H, 0e61c17b7H, 0f4a9b859H, 4c15df3cH, 0d1c2e785H, 697e80e0H, 7bcb2f0eH, 0c377486bH
DD 0cb0d0fa2H, 73b168c7H, 6104c729H, 0d9b8a04cH, 446f98f5H, 0fcd3ff90H, 0ee66507eH, 56da371bH
DD 0eb9274dH, 0b6054028H, 0a4b0efc6H, 1c0c88a3H, 81dbb01aH, 3967d77fH, 2bd27891H, 936e1ff4H
DD 3b26f703H, 839a9066H, 912f3f88H, 299358edH, 0b4446054H, 0cf80731H, 1e4da8dfH, 0a6f1cfbaH
DD 0fe92dfecH, 462eb889H, 549b1767H, 0ec277002H, 71f048bbH, 0c94c2fdeH, 0dbf98030H, 6345e755H
DD 6b3fa09cH, 0d383c7f9H, 0c1366817H, 798a0f72H, 0e45d37cbH, 5ce150aeH, 4e54ff40H, 0f6e89825H
DD 0ae8b8873H, 1637ef16H, 048240f8H, 0bc3e279dH, 21e91f24H, 99557841H, 8be0d7afH, 335cb0caH
DD 0ed59b63bH, 55e5d15eH, 47507eb0H, 0ffec19d5H, 623b216cH, 0da874609H, 0c832e9e7H, 708e8e82H
DD 28ed9ed4H, 9051f9b1H, 82e4565fH, 3a58313aH, 0a78f0983H, 1f336ee6H, 0d86c108H, 0b53aa66dH
DD 0bd40e1a4H, 05fc86c1H, 1749292fH, 0aff54e4aH, 322276f3H, 8a9e1196H, 982bbe78H, 2097d91dH
DD 78f4c94bH, 0c048ae2eH, 0d2fd01c0H, 6a4166a5H, 0f7965e1cH, 4f2a3979H, 5d9f9697H, 0e523f1f2H
DD 4d6b1905H, 0f5d77e60H, 0e762d18eH, 5fdeb6ebH, 0c2098e52H, 7ab5e937H, 680046d9H, 0d0bc21bcH
DD 88df31eaH, 3063568fH, 22d6f961H, 9a6a9e04H, 07bda6bdH, 0bf01c1d8H, 0adb46e36H, 15080953H
DD 1d724e9aH, 0a5ce29ffH, 0b77b8611H, 0fc7e174H, 9210d9cdH, 2aacbea8H, 38191146H, 80a57623H
DD 0d8c66675H, 607a0110H, 72cfaefeH, 0ca73c99bH, 57a4f122H, 0ef189647H, 0fdad39a9H, 45115eccH
DD 764dee06H, 0cef18963H, 0dc44268dH, 64f841e8H, 0f92f7951H, 41931e34H, 5326b1daH, 0eb9ad6bfH
DD 0b3f9c6e9H, 0b45a18cH, 19f00e62H, 0a14c6907H, 3c9b51beH, 842736dbH, 96929935H, 2e2efe50H
DD 2654b999H, 9ee8defcH, 8c5d7112H, 34e11677H, 0a9362eceH, 118a49abH, 033fe645H, 0bb838120H
DD 0e3e09176H, 5b5cf613H, 49e959fdH, 0f1553e98H, 6c820621H, 0d43e6144H, 0c68bceaaH, 7e37a9cfH
DD 0d67f4138H, 6ec3265dH, 7c7689b3H, 0c4caeed6H, 591dd66fH, 0e1a1b10aH, 0f3141ee4H, 4ba87981H
DD 13cb69d7H, 0ab770eb2H, 0b9c2a15cH, 017ec639H, 9ca9fe80H, 241599e5H, 36a0360bH, 8e1c516eH
DD 866616a7H, 3eda71c2H, 2c6fde2cH, 94d3b949H, 090481f0H, 0b1b8e695H, 0a30d497bH, 1bb12e1eH
DD 43d23e48H, 0fb6e592dH, 0e9dbf6c3H, 516791a6H, 0ccb0a91fH, 740cce7aH, 66b96194H, 0de0506f1H

.tables ends

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

align 4

DoLoopCRC32    proc    syscall private

comment    * ------------------------------------------------------------------------------
            CRC core loop using 4 lookup tables and a single source DWORD XOR. It does not
            check entry parameters.

            Any potential stalls between accessing byte sub-registers and full registers are
            hopefully minimised by using two paired registers to store lookup addresses.
            There is however going to be a speed hit in the paging required to access 4K of
            table data rather than just 1K as in a standard CRC algorithm, as there will be
            a definite overhead incurred in displacing/replacing precious Level1/2 cache
            with all this extra data for smaller byte sources.
            ------------------------------------------------------------------------------ *

        ; On entry:
        ;       EAX = CRC32 start value
        ;       EDX = address of array including offset
        ;       ECX = number of bytes in array
        ; On exit:
        ;       EAX = new CRC32 after processing subarray
        ;       ECX = 0
        ; EBP/EBX/EDI/ESI preserved, ECX/EDX are not

        push    esi                     ; save registers
        push    edi
        push    ebx

        mov     esi, edx                ; ESI now holds array pointer
        sub     ebx, ebx                ; zero EBX
        sub     edx, edx                ; zero EDX
        sub     ecx, 4                  ; test for initial zero made in outer call
        jb      CRCBytesLeft

CRCDwordLoop:
        xor     eax, DWORD PTR [esi]    ; get next DWORD of process data
        add     esi, 4

        mov     bl, al                  ; get the table offset from the result
        mov     dl, ah                  ; get the table offset from the result
        mov     edi, DWORD PTR CRC32Table4[ebx*4]
        bswap   eax
        xor     edi, DWORD PTR CRC32Table3[edx*4]

        mov     bl, ah                  ; get the table offset from the result
        mov     dl, al                  ; get the table offset from the result
        xor     edi, DWORD PTR CRC32Table2[ebx*4]
        mov     eax, DWORD PTR CRC32Table[edx*4]
        xor     eax, edi

        sub     ecx, 4
        jae     CRCDwordLoop

CRCBytesLeft:
        add     ecx, 4
        jz      EndCRCLoop

        mov     edx, DWORD PTR [esi]    ; get any remainder bytes
@@:
        xor     dl, al
        mov     bl, dl                  ; get the table offset from the result
        shr     eax, 8
        shr     edx, 8
        xor     eax, DWORD PTR CRC32Table[ebx*4]
        sub     ecx, 1
        jnz     @B

EndCRCLoop:
        pop     ebx                     ; restore registers
        pop     edi
        pop     esi
        ret

DoLoopCRC32    endp

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««


In my tests it was more than twice as fast as a single-table version, and roughly 20 times faster than a straightforward bit-by-bit algorithm.
eschew obfuscation

clive

Quote from: MichaelW on November 05, 2010, 06:40:21 PM
This is a four-table version that was posted on the old forum by IanB:

Wow. I think I might have processed 16 or 32-byte lines with 4K worth of tables in my footprint.

A lot of efficiencies will likely depend on the size and type of data being handled. For example checking blocks of 512, 4 KB or 32 KB

And the size vs speed trade offs.

A quick order of magnitude test on a 3 GHz Prescott P4, doing the 9 byte test and 1KB CRC32 across the CRC32Table from Ian's code. This shows Ian's code is very fast.

D1D580EF Test 1 - Classic

Clocks        157 min,        175 avg
D1D580EF Test 2 - Orignial

Clocks        150 min,        173 avg
D1D580EF Test 3 - Simplified

Clocks        127 min,        141 avg
D1D580EF Test 4 - 32-bit folding

Clocks        120 min,        134 avg
D1D580EF Test 5 - Ian

Clocks        120 min,        144 avg
D1D580EF Test 6 - Ian MOVZX

Clocks        127 min,        139 avg
D1D580EF Test 7 - Ian MOVZX / Paragraph


903061EC Test 1 - Classic

Clocks       8572 min,      10893 avg
903061EC Test 2 - Orignial

Clocks       8745 min,      15959 avg
903061EC Test 3 - Simplified

Clocks       6547 min,      11889 avg
903061EC Test 4 - 32-bit folding

Clocks       3667 min,       7413 avg
903061EC Test 5 - Ian

Clocks       3675 min,       9272 avg
903061EC Test 6 - Ian MOVZX

Clocks       3383 min,       6821 avg
903061EC Test 7 - Ian MOVZX / Paragraph
It could be a random act of randomness. Those happen a lot as well.