SHA-256

Petroizki · March 20, 2006, 07:16:20 PM

This is a SHA-256 hash algo. It has no unrolling to the infinite, or other complex macros, just the basic assembly. The clock count is about 1400 per block (64 bytes) on my machine.

info: http://en.wikipedia.org/wiki/SHA-256
dl: http://personal.inet.fi/atk/partsu/SHA256.zip

Program output:

QuoteMESSAGE: abc
HASH: BA7816BF8F01CFEA414140DE5DAE2223B00361A396177A9CB410FF61F20015AD
CYCLES: 1453

MESSAGE: abcdbcdecdefdefgefghfghighijhijkijkljklmklmnlmnomnopnopq
HASH: 248D6A61D20638B8E5C026930C3E6039A33CE45964FF2167F6ECEDD419DB06C1
CYCLES: 2882

MESSAGE: 1000000 times the letter 'a'
HASH: CDC76E5C9914FB9281A1C7E284D73E67F1809A48A497200E046D39CCC7112CD0
CYCLES: 23819820

Mark_Larson · March 27, 2006, 07:06:24 PM

nice job Petroizki! I walked through the code with my eyeballs and spotted some possible speed ups if you are still wanting to optimize further?

1) change loops that use INC/ADD to DEC/SUB to remove a CMP instruction from the loop.
here's some of the code where you do an INC/ADD to increment a register. The trick is to fixup ECX by doing a NEG first.

Code Select


	; copy source 64 bytes to the workmem
	mov esi, [$lpSource]
	xor ecx, ecx
@@:		mov eax, dword ptr [esi + ecx*4]
		mov ebx, dword ptr [esi + ecx*4 + 4]
		mov edx, dword ptr [esi + ecx*4 + 8]
		mov edi, dword ptr [esi + ecx*4 + 12]
		bswap eax
		bswap ebx
		bswap edx
		bswap edi
		mov dword ptr [$W + ecx*4], eax
		mov dword ptr [$W + ecx*4 + 4], ebx
		mov dword ptr [$W + ecx*4 + 8], edx
		mov dword ptr [$W + ecx*4 + 12], edi
		add ecx, 4
		cmp ecx, 16
	jne @B

Hutch has some sample code doing this from one of his routines in m32lib called addarr ( in addarr.asm). Just need to apply it to your code wherever you do this. The trick is in the "neg". He also adds ECX to itself 4 times, which ends up being an ECX*4. You also use ECX*4 in your code.

Code Select


arr_add proc arr:DWORD,cnt:DWORD,plus:DWORD

    mov edx, plus
    mov eax, arr
    mov ecx, cnt
    add ecx, ecx
    add ecx, ecx
    add eax, ecx
    neg ecx
    jmp @F

  align 16
  @@:
    add [eax+ecx], edx
    add ecx, 4
    jnz @B

2) this is a small really loop. It only gets executed 4 times. You might want to try unrolling it.

Code Select


	; copy source 64 bytes to the workmem
	mov esi, [$lpSource]
	xor ecx, ecx
@@:		mov eax, dword ptr [esi + ecx*4]
		mov ebx, dword ptr [esi + ecx*4 + 4]
		mov edx, dword ptr [esi + ecx*4 + 8]
		mov edi, dword ptr [esi + ecx*4 + 12]
		bswap eax
		bswap ebx
		bswap edx
		bswap edi
		mov dword ptr [$W + ecx*4], eax
		mov dword ptr [$W + ecx*4 + 4], ebx
		mov dword ptr [$W + ecx*4 + 8], edx
		mov dword ptr [$W + ecx*4 + 12], edi
		add ecx, 4
		cmp ecx, 16
	jne @B

3) code alignment. Try aligning your loops on 16 byte boundaries. The way I usually do it, align one loop, time the code, and see if it's faster. If it isn't, remove the align. And then try the next loop.

Petroizki · March 28, 2006, 07:46:32 AM

Thanks Mark,

I have made changes, as you suggested. I also added my proctimers to the example. New upload on link at top post.

Speed improvements after changes: :8)

Quotev.0.01
1st MESSAGE: 1519 cycles
2nd MESSAGE: 3007 cycles
3rd MESSAGE: 24920978 cycles

v.0.02
1st MESSAGE: 1453 cycles
2nd MESSAGE: 2882 cycles
3rd MESSAGE: 23700356 cycles

Mark Jones · March 28, 2006, 07:57:41 AM

Hello Petroizki, great work!

Quote from: AMD XP 2500+ / XP SP2
MESSAGE: abc
HASH: BA7816BF8F01CFEA414140DE5DAE2223B00361A396177A9CB410FF61F20015AD
CYCLES: 1812

MESSAGE: abcdbcdecdefdefgefghfghighijhijkijkljklmklmnlmnomnopnopq
HASH: 248D6A61D20638B8E5C026930C3E6039A33CE45964FF2167F6ECEDD419DB06C1
CYCLES: 3594

MESSAGE: 1000000 times the letter 'a'
HASH: CDC76E5C9914FB9281A1C7E284D73E67F1809A48A497200E046D39CCC7112CD0
CYCLES: 27817112

Ghirai · March 28, 2006, 08:54:44 AM

Very nice :)

Results fom my notebook (Mobile AMD Sempron 3000+):

QuoteMESSAGE: abc
HASH: BA7816BF8F01CFEA414140DE5DAE2223B00361A396177A9CB410FF61F20015AD
CYCLES: 1453

MESSAGE: abcdbcdecdefdefgefghfghighijhijkijkljklmklmnlmnomnopnopq
HASH: 248D6A61D20638B8E5C026930C3E6039A33CE45964FF2167F6ECEDD419DB06C1
CYCLES: 2895

MESSAGE: 1000000 times the letter 'a'
HASH: CDC76E5C9914FB9281A1C7E284D73E67F1809A48A497200E046D39CCC7112CD0
CYCLES: 21830745

PBrennick · March 28, 2006, 10:08:37 PM

Very nice;)
You guys rock!

Anyway AMD 1000 w/XPHE+SP2

Quote
MESSAGE: abc
HASH: BA7816BF8F01CFEA414140DE5DAE2223B00361A396177A9CB410FF61F20015AD
CYCLES: 1812

MESSAGE: abcdbcdecdefdefgefghfghighijhijkijkljklmklmnlmnomnopnopq
HASH: 248D6A61D20638B8E5C026930C3E6039A33CE45964FF2167F6ECEDD419DB06C1
CYCLES: 3594

MESSAGE: 1000000 times the letter 'a'
HASH: CDC76E5C9914FB9281A1C7E284D73E67F1809A48A497200E046D39CCC7112CD0
CYCLES: 27179203

Don't laugh at my old machine!
Paul

Mark_Larson · March 29, 2006, 04:48:12 PM

I ran the code on my work computer. It is a 3.2 GHz Nocona. I recently converted to AMD for my home system ( got an X2).

Code Select


MESSAGE: abc
HASH: BA7816BF8F01CFEA414140DE5DAE2223B00361A396177A9CB410FF61F20015AD
CYCLES: 2390

MESSAGE: abcdbcdecdefdefgefghfghighijhijkijkljklmklmnlmnomnopnopq
HASH: 248D6A61D20638B8E5C026930C3E6039A33CE45964FF2167F6ECEDD419DB06C1
CYCLES: 4456

MESSAGE: 1000000 times the letter 'a'
HASH: CDC76E5C9914FB9281A1C7E284D73E67F1809A48A497200E046D39CCC7112CD0
CYCLES: 35971865

Mark Jones · March 29, 2006, 06:54:58 PM

Wow, it seems like the bigger AMD's are getting slower!

Mark_Larson · March 29, 2006, 08:28:15 PM

Quote from: Mark Jones on March 29, 2006, 06:54:58 PM
Wow, it seems like the bigger AMD's are getting slower!

The timings are from my work computer which is an Intel.

News:

SHA-256