News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

bignumber

Started by roticv, February 20, 2005, 05:45:06 PM

Previous topic - Next topic

roticv

Hello all,

It has been long since I last post here. Anyway I have been working on a bignumber library. I was just wondering if you guys would like to beat it to death and prehaps tell me if I have made any mistake with it. Any takes?

First is my bignumber *= 32bitnumber function

bnMul32 proc bigint1:dword, num
;bigint *= 32bitint
push ebx
mov ecx, [esp+4][4]
i = 0
REPT N-1
mov eax, [ecx+i]
mul dword ptr[esp+8][4]
IF i ne 0
add eax, ebx
ENDIF
mov [ecx+i], eax
mov ebx, edx
i = i + 4
ENDM
mov eax, [ecx + i]
imul dword ptr[esp+8][4]
mov [ecx+i], eax
pop ebx
retn 8
bnMul32 endp


N is the how the size of the bignumber we are dealing with. Currently I set N to be 4, so we are dealing with 32*4 = 128 bit numbers.

Jibz

I think you need an adc after you add in the high dword from the previous mul:

00000001 FFFFFFFF * FFFFFFFF

First mul gives FFFFFFFE 00000001, so store the low dword and ebx is set to FFFFFFFE.

Second mul gives 00000000 FFFFFFFF, and when you add ebx to the low dword it overflows.

roticv


Mark_Larson

 
  Looks nice.  From a performance standpoint you want to be careful from using too large of an N and unrolling too much code.  You'll have to do some testing to find the spot to stop unrolling at and to start using a loop.
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

Mark_Larson

#4
  I have been optimizing the code for a P4.  I am not done yet.  But I wanted to post the results I have so far in case anyone else has other optimziation ideas.  I found one other thing that looked like a bug in the code.  If the last signed multiply ( IMUL) generats a 64-bit value you don't save EDX to memory.  I made bignum 5 dwords, and passed in 4 dwords with a value of N=4.  The extra dword is to handle the overflow into the upper 32-bits from the last multiply.


.data
align 4
bignum dd 0ffffffffh,0ffffffffh,0ffffffffh,0ffffffffh
dd 000000000h
number dd 0ffffffffh
...
.code
invoke bnMul32, ADDR bignum, number

...


bnMul32 proc bigint1:dword, num:dword
push ebx
mov ecx, [bigint1]
i = 0
REPT N-1
mov eax, [ecx+i]
mul dword ptr[num]
IF i ne 0
add eax, ebx
adc edx, 0
ENDIF
mov [ecx+i], eax
mov ebx, edx
i = i + 4
ENDM
mov eax, [ecx + i]
imul dword ptr[num]
mov [ecx+i], eax
mov [ecx+i+4],edx  ; modified this to handle a 64-bit result from the last multiply
pop ebx
ret
bnMul32 endp


  The above code runs in 51 cycles on my P4.  An obvious thing to try is to move [num] to a local variable.  So I moved [num] to ESI.  That shaved 4 cyclces off the code.  The code is now running in 47 cycles.  The next thing I would try to do is to do two MULs in parallel to break up dependencies.  The problem is is we are already using almost all the registers, and don't have enough registers.  In addition ADC is really slow on a P4 and using MMX to do a 64-bit add and avoid ADC is faster.  So I switched to MMX to get more registers and get rid of that annoying ADC on P4.


The new code runs in 19 cycles.  That is 2.69 times faster than the original ALU code that ran in 51 cycles.


bnMul32 proc bigint1:dword, num:dword
;MM7 is EBX from the original code.  It is the previous multiplie's upper 32-bits
mov ecx, [bigint1]
movd mm1,[num]
i = 0
REPT N-1
movd mm0, [ecx+i]
pmuludq mm0,mm1
IF i ne 0
paddq mm0,mm7
ENDIF
movd [ecx+i], mm0
pshufw mm7,mm0,01001110b ;copy high dword to low dword
i = i + 4
ENDM
movd mm0, [ecx + i]
pmuludq mm0,mm1
        pshufw mm0,mm0,01001110b                        ;flip the dwords ( endianess) before writing them back to memory.
movq [ecx+i], mm0
ret
bnMul32 endp


  The next step would be to do two MMX's in parallel to help break up dependencies.
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

roticv

Hey mark,

I did not thought of using mmx to do multiplication. Thanks for your help. Maybe I will have to look into using mmx/sse/sse2 for my other codes.

The only problem I face right now is verifying my own codes. I will look into your codes and go and code a bignumber * bignumber.

After that maybe I will start on the harder stuffs like... division!

:U

Mark_Larson

#6
Quote from: roticv on February 27, 2005, 06:06:40 AM
Hey mark,

I did not thought of using mmx to do multiplication. Thanks for your help. Maybe I will have to look into using mmx/sse/sse2 for my other codes.

The only problem I face right now is verifying my own codes. I will look into your codes and go and code a bignumber * bignumber.

After that maybe I will start on the harder stuffs like... division!

:U

Sure not a problem.  That was one of the things I talked about on my website was how slow ADC was on the P4 and how to get around it.  I did find a second bug in your code.  In the last multiply you forget to add EBX to EAX ( EBX is your running carry of the last multiplies upper 32-bits).  When I re-timed the code with the above bug fixed in the MMX code, it now runs in 20 cycles.  I made some changes to it again to get it back down to 19 cycles.


push ebx
mov ecx, [bigint1]
i = 0
REPT N-1
mov eax, [ecx+i]
mul dword ptr[num]
IF i ne 0
add eax, ebx
adc edx, 0                  ;bug jibz pointed out
ENDIF
mov [ecx+i], eax
mov ebx, edx
i = i + 4
ENDM
mov eax, [ecx + i]
imul dword ptr[num]
add eax, ebx                ;second bug
adc edx, 0                   ;second bug
mov [ecx+i], eax
        mov [ecx+i+4],edx      ;first bug
pop ebx
ret


     This is the new code with the above BUG fixed in the MMX code.  It runs in 19 cycles.  I made one slight change.  The code to adjust the remainder for the next round of multiplies does a shift left 32 of a 64-bit register instead of a PSHUFW.  That also frees up MM7, because I just shift it on MM0, which has the reuslts of the 32:32 multiply.



movd mm0, [ecx]
pmuludq mm0,mm1
movd [ecx], mm0
psrlq mm0,32

movd mm2, [ecx+4]
pmuludq mm2,mm1
paddq mm2,mm0
movd [ecx+4], mm2
psrlq mm2,32

movd mm0, [ecx+8]
pmuludq mm0,mm1
paddq mm0,mm2
movd [ecx+8], mm0
psrlq mm0,32

movd mm2, [ecx + 12]
pmuludq mm2,mm1
paddq mm2,mm0
pshufw mm2,mm2,01001110b
movq [ecx+12], mm2

BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

roticv

Hi,

Wow your response was quick. I was wondering why you put "BUG????!!!" only to find it missing when posting this reply.

By the way, I think "movd [ecx+12], mm2" should be used instead of "movq [ecx+12], mm2" This is because the data is only defined as 128bit and if you use a movq, you would overwrite into other data.

Sure I do know that it is best to avoid adc/sbb. I was just doing some testing code. I think I need to change my addition and subtraction routines so that mmx is used instead of adc/sbb.

Anyway I marcoised it to

movd mm0, [ecx]
pmuludq mm0,mm1
movd [ecx], mm0
psrlq mm0,32
i = 4
REPT N-2
movd mm2, [ecx+i]
pmuludq mm2,mm1
paddq mm2,mm0
movd [ecx+i], mm2
psrlq mm0,32
i = i + 4
ENDM
movd mm2, [ecx+i]
pmuludq mm2, mm1
paddq mm2, mm0
pshufw mm2,mm2,01001110b
movq [ecx+i], mm2


PS: I realised that ollydbg 1.10 cannot disassemble pmuludq properly
PSS: You made a tiny mistake in your code. It should be "psrlq mm0,32" instead of "psrlq mm2,32"

Mark_Larson

Quote from: roticv on February 27, 2005, 07:16:20 AM
Hi,

Wow your response was quick. I was wondering why you put "BUG????!!!" only to find it missing when posting this reply.

I had already pointed out the bug in the ALU code, so got rid of the comment, so as not to add to the confusion.




Quote from: roticv on February 27, 2005, 07:16:20 AM
By the way, I think "movd [ecx+12], mm2" should be used instead of "movq [ecx+12], mm2" This is because the data is only defined as 128bit and if you use a movq, you would overwrite into other data.

The problem is, unless I am wrong.  When you multiply a bignum that has 4 dword F's in it by a 32-bit int that is all F's, you will get larger than 128-bits for your result.  So I added an extra dword on the end of the bignum.


.data
align 4
bignum dd 0ffffffffh,0ffffffffh,0ffffffffh,0ffffffffh
dd 000000000h
number dd 0ffffffffh



Quote from: roticv on February 27, 2005, 07:16:20 AM
Sure I do know that it is best to avoid adc/sbb. I was just doing some testing code. I think I need to change my addition and subtraction routines so that mmx is used instead of adc/sbb.

Anyway I marcoised it to

movd mm0, [ecx]
pmuludq mm0,mm1
movd [ecx], mm0
psrlq mm0,32
i = 4
REPT N-2
movd mm2, [ecx+i]
pmuludq mm2,mm1
paddq mm2,mm0
movd [ecx+i], mm2
psrlq mm0,32
i = i + 4
ENDM
movd mm2, [ecx+i]
pmuludq mm2, mm1
paddq mm2, mm0
pshufw mm2,mm2,01001110b
movq [ecx+i], mm2


  That has a bug in it, since I alternate using MM2 and MM0 as the register to read from memory.  So try unrolling your code, you will see it is no longer the same as my original code.  You can change it into blocks of 2 multiplies for macroing.



Quote from: roticv on February 27, 2005, 07:16:20 AM
PS: I realised that ollydbg 1.10 cannot disassemble pmuludq properly
PSS: You made a tiny mistake in your code. It should be "psrlq mm0,32" instead of "psrlq mm2,32"

  Actually that is right.  Since I alternated using MM0 and MM2 as the register to read from memory, I also alternated the register I used for the remainder.
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

roticv

#9
I see I see.

So it has to be something like

movd mm0, [ecx]
pmuludq mm0,mm1
movd [ecx], mm0
psrlq mm0,32
i = 4
REPT N/2 - 1
movd mm2, [ecx+i]
pmuludq mm2,mm1
paddq mm2,mm0
movd [ecx+i], mm2
psrlq mm2,32
i = i + 4
movd mm0, [ecx+8]
pmuludq mm0,mm1
paddq mm0,mm2
movd [ecx+8], mm0
psrlq mm0,32
i = i + 4
ENDM
movd mm2, [ecx+i]
pmuludq mm2, mm1
paddq mm2, mm0
pshufw mm2,mm2,01001110b
movq [ecx+i], mm2

?

bitRAKE

#10
; loop range from 2 thru 63
mpn_mul_1__k7_04_UNROLL = 32 ;32: 3.0700 cycles per limb (DWORD)

mpn_mul_1__k7_04 PROC USES ebx esi edi, bInt:PTR BIGINT, limb:DWORD, carry:DWORD

mov DWORD PTR [buff1], 1

lea esi, [buff1 + ((SIZEOF buff1 / 4) MOD mpn_mul_1__k7_04_UNROLL)*4]

mov ecx, mpn_mul_1__k7_04_UNROLL
mov eax, SIZEOF buff1 / 4
cdq
div ecx

mov edi, eax
imul edx, edx, -14
lea edx, [edx][_2]

xor ecx, ecx
mov ebx, 0 ; carry
mov ebp, 12345678h ; multiplier

jmp edx

ALIGN 16

_0:
i=0
WHILE i LT mpn_mul_1__k7_04_UNROLL
mul ebp
add ebx, eax
j = 4*(i - 31)
IF j EQ 0
BYTE 8Bh, 46h, 00h
mov [esi][j][-4], ebx
ELSEIF j EQ 4
mov eax, [esi][j]
BYTE 89h, 5Eh, 00h
ELSEIF j LT 80h
mov eax, [esi][j]
mov [esi][j][-4], ebx
ELSE ; only byte offsets
.err mpn_mul_1__k7_04_UNROLL too big!
ENDIF
mov ebx, ecx
adc ebx, edx
i=i+1
ENDM

_2: dec edi
lea esi, [esi + 4*mpn_mul_1__k7_04_UNROLL]
jns _0

mov eax, ebx ; return carry
ret
mpn_mul_1__k7_04 ENDP
3 cycles per DWORD is pretty good. :toothy
(tested on megabit size big integers)

pro3carp3

I hadn't heard about the problems with adc/sbc on the P4. Where can i get more info on this?

Thanks.
LGC

Mark_Larson

Quote from: pro3carp3 on March 11, 2005, 02:11:28 PM
I hadn't heard about the problems with adc/sbc on the P4. Where can i get more info on this?

Thanks.

lots of places.

http://www.visionx.com/markl/optimization_tips.htm
http://www.mark.masmcode.com/                          - both lead to my assembler optimization webpage.

http://www.agner.org/assem/ - download the first PDF, how to optimize for the Pentium microprocessors

Also the Optimization manual for the P4 on Intel's website:

ftp://download.intel.com/design/Pentium4/manuals/24896611.pdf - the link will change when they update the manuals, so grab it as soon as you see it.

This is a link for all 5 of the manuals you should have for the P4.  It's the middle 5 of the 7 links that you want.
http://developer.intel.com/design/pentium4/documentation.htm#manuals

The optimization manual is a must have for anyone interested in optimization.
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm