News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Divison

Started by roticv, February 28, 2005, 04:31:47 PM

Previous topic - Next topic

roticv

Hello,

I just wanna check if there are any opcodes that can divide data that is in mmx registers.

tekhead009

DIVPD, DIVPS, DIVSD, and DIVSS may be what you're looking for.

Mark_Larson

Quote from: tekhead009 on March 01, 2005, 02:04:23 PM
DIVPD, DIVPS, DIVSD, and DIVSS may be what you're looking for.

  That's incorrect.  They only work on XMM registers.



  roticv,

  There are no MMX division instructions.  Best you can do is convert to floating point ( SSE) and do it there.  Or maybe find another way to do it ( multiply by the reciprocal?)
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

roticv

Hello mark,

It would not help if your data is integer and is bigger than xmm. I think I might have to use div or something.

bozo

#4
Floating point instructions should be sufficient enough for 64-bit values.
Consult Intel/AMD documentation..the software manuals.

Forgot to mention..you could do some division using PSRLQ and get remainder with PAND on MMX registers
although its not possible to use immediate values with PAND.

Like dividing value in mm0 by 2


PSRLQ mm0, 1


Still again..floating point instructions are not that difficult to learn.
And if you couldn't be bothered to go indepth, there are libraries and macros here and there.

roticv

It is not 64bit. It is 128 bit. I'm coding a bignumber library

bozo

Well, then SSE was designed for floating point arithmetic on 128-bit numbers, so have you checked them out yet?
You may be able to get some ideas from OpenSSL libraries.

Mark_Larson

#7
Quote from: Kernel_Gaddafi on March 04, 2005, 11:11:20 AM
Well, then SSE was designed for floating point arithmetic on 128-bit numbers, so have you checked them out yet?
You may be able to get some ideas from OpenSSL libraries.

  The registers are 128-bit but you can't operate on more than 64-bits at a time ( a real8 number).  So if you do any division or multiplication you'd have to do it on two 64-bit numbers in parallel.  The one exception to that is logical operations like OR and AND all operate on the whole register.


roticv, the integer DIV instruction goes through the floating point unit on P3 and P4.  So using floating point might not be a bad idea.  It just depends on how well it fits what you want to do.
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

roticv

I was thinking about using shifts and subtractions. I know how to use pencil and paper to solve it, but it is difficult for me to translate those to code.

AeroASM

Use movdqa xmm128,mem128 to move your 128bit number from memory (aligned on a 16 byte boundary) to an xmm register.
Use pslldq xmm128,immed8 to shift the whole register left by immed8 bytes, and psrldq to shift right.

Subtraction is harder, because there are no instructions that work on whole 128-bit numbers. What subtractions will you need, because it may be possible to use something like the xmm equivalent of sbb.

kenngough

Attached is an example program with 128-bit math.  These routines are for addition, subtraction, multiplication, and division.  The multiplication is an add and shift and the division is a shift and subtract,  This code could be written smaller but was written for the fastest possible execution.  The original code was written for a compiled "basic" run-time and was 80-bits (5 words) and I rewrote it for 32-bit so that I could share it with the forum.  Email if any questions.


[attachment deleted by admin]

hutch--

Hi Kenn,

Welcome on board. Thanks for posting bits like this, many find tem useful.  :U
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

roticv


Eddy

Kenn,

Thanks for posting the code! Great study material!   :8)

A few questions/remarks:

* For the subtraction routine, you negate the second operand, then do an addition.
Wouldn't it be faster to just do it like this (since you particularly said you wanted fast code):

        clc

        mov     eax, DGT_TWO+12
        sbb     DGT_ONE+12, eax
;
        mov     eax, DGT_TWO+8
        sbb     DGT_ONE+8, eax       
;
        mov     eax, DGT_TWO+4       
        sbb     DGT_ONE+4, eax         
;
        mov     eax, DGT_TWO     
        sbb     DGT_ONE, eax   
;
        ret                                     ; exit routine


* I experienced that following lines....

        mov     eax, DGT_TWO+4   
        adc     DGT_ONE+4, eax

...are better replaced with the alternative:

        mov     eax, DGT_TWO+4
        adc     eax, DGT_ONE+4
        mov    DGT_ONE+4, eax

It looks less efficient, but it has the same nr of clock cycles as the other 2 lines, but my tests showed that the first 2 lines take more than 2 times as long to execute as the 3 line alternative!
It will have something to do with the pentiums internal optimisation: pairing instructions in the u and v pipeline or something like that...
Anyway, I could imagine that the results will vary between different processor types and brands. But my tests were clear.
So, if you can spare the time, run this test ... :toothy

Kind regards
Eddy


Eddy
www.devotechs.com -- HIME : Huge Integer Math and Encryption library--