Divison

roticv · February 28, 2005, 04:31:47 PM

Hello,

I just wanna check if there are any opcodes that can divide data that is in mmx registers.

tekhead009 · March 01, 2005, 02:04:23 PM

DIVPD, DIVPS, DIVSD, and DIVSS may be what you're looking for.

Mark_Larson · March 01, 2005, 04:24:03 PM

Quote from: tekhead009 on March 01, 2005, 02:04:23 PM
DIVPD, DIVPS, DIVSD, and DIVSS may be what you're looking for.

That's incorrect. They only work on XMM registers.

roticv,

There are no MMX division instructions. Best you can do is convert to floating point ( SSE) and do it there. Or maybe find another way to do it ( multiply by the reciprocal?)

roticv · March 02, 2005, 01:34:27 PM

Hello mark,

It would not help if your data is integer and is bigger than xmm. I think I might have to use div or something.

bozo · March 04, 2005, 11:05:13 AM

Floating point instructions should be sufficient enough for 64-bit values.
Consult Intel/AMD documentation..the software manuals.

Forgot to mention..you could do some division using PSRLQ and get remainder with PAND on MMX registers
although its not possible to use immediate values with PAND.

Like dividing value in mm0 by 2

Code Select


PSRLQ mm0, 1

Still again..floating point instructions are not that difficult to learn.
And if you couldn't be bothered to go indepth, there are libraries and macros here and there.

roticv · March 04, 2005, 11:08:09 AM

It is not 64bit. It is 128 bit. I'm coding a bignumber library

bozo · March 04, 2005, 11:11:20 AM

Well, then SSE was designed for floating point arithmetic on 128-bit numbers, so have you checked them out yet?
You may be able to get some ideas from OpenSSL libraries.

Mark_Larson · March 04, 2005, 12:38:05 PM

Quote from: Kernel_Gaddafi on March 04, 2005, 11:11:20 AM
Well, then SSE was designed for floating point arithmetic on 128-bit numbers, so have you checked them out yet?
You may be able to get some ideas from OpenSSL libraries.

The registers are 128-bit but you can't operate on more than 64-bits at a time ( a real8 number). So if you do any division or multiplication you'd have to do it on two 64-bit numbers in parallel. The one exception to that is logical operations like OR and AND all operate on the whole register.

roticv, the integer DIV instruction goes through the floating point unit on P3 and P4. So using floating point might not be a bad idea. It just depends on how well it fits what you want to do.

roticv · March 04, 2005, 03:54:45 PM

I was thinking about using shifts and subtractions. I know how to use pencil and paper to solve it, but it is difficult for me to translate those to code.

AeroASM · March 07, 2005, 07:16:55 PM

Use movdqa xmm128,mem128 to move your 128bit number from memory (aligned on a 16 byte boundary) to an xmm register.
Use pslldq xmm128,immed8 to shift the whole register left by immed8 bytes, and psrldq to shift right.

Subtraction is harder, because there are no instructions that work on whole 128-bit numbers. What subtractions will you need, because it may be possible to use something like the xmm equivalent of sbb.

kenngough · September 14, 2005, 06:00:55 PM

Attached is an example program with 128-bit math. These routines are for addition, subtraction, multiplication, and division. The multiplication is an add and shift and the division is a shift and subtract, This code could be written smaller but was written for the fastest possible execution. The original code was written for a compiled "basic" run-time and was 80-bits (5 words) and I rewrote it for 32-bit so that I could share it with the forum. Email if any questions.

[attachment deleted by admin]

hutch-- · September 14, 2005, 11:49:33 PM

Hi Kenn,

Welcome on board. Thanks for posting bits like this, many find tem useful. :U

roticv · September 15, 2005, 04:22:12 PM

Thanks alot. :U

Eddy · September 15, 2005, 10:27:48 PM

Kenn,

Thanks for posting the code! Great study material! :8)

A few questions/remarks:

* For the subtraction routine, you negate the second operand, then do an addition.
Wouldn't it be faster to just do it like this (since you particularly said you wanted fast code):

Code Select


        clc 

        mov     eax, DGT_TWO+12
        sbb     DGT_ONE+12, eax
;
        mov     eax, DGT_TWO+8 
        sbb     DGT_ONE+8, eax        
;
        mov     eax, DGT_TWO+4       
        sbb     DGT_ONE+4, eax         
;
        mov     eax, DGT_TWO     
        sbb     DGT_ONE, eax    
;
        ret                                     ; exit routine

* I experienced that following lines....

Code Select


        mov     eax, DGT_TWO+4   
        adc     DGT_ONE+4, eax

...are better replaced with the alternative:

Code Select


        mov     eax, DGT_TWO+4 
        adc     eax, DGT_ONE+4
        mov    DGT_ONE+4, eax

It looks less efficient, but it has the same nr of clock cycles as the other 2 lines, but my tests showed that the first 2 lines take more than 2 times as long to execute as the 3 line alternative!
It will have something to do with the pentiums internal optimisation: pairing instructions in the u and v pipeline or something like that...
Anyway, I could imagine that the results will vary between different processor types and brands. But my tests were clear.
So, if you can spare the time, run this test ... :toothy

Kind regards
Eddy

News:

Divison

roticv

tekhead009

Mark_Larson

roticv

bozo

roticv

bozo

Mark_Larson

roticv

AeroASM

kenngough

hutch--

roticv

Eddy