News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

best way of multiplying

Started by juliomacedo, January 18, 2006, 01:55:15 PM

Previous topic - Next topic

juliomacedo

My question was based on the following:

"Intel recommends that any fixed integer with fewer than 8 '1' bits set should be decomposed to shifts and adds"

It seems to mean that I shouldn't simply use MUL or IMUL when performing integer multiplications since MUL and IMUL are very slow instructions...

The article suggests "do the obvious 'horner's rule' method of just shifting and adding" and what I wanted to know when I created this post was if HORNER'S RULE (exemplified by the code that I wrote with first code) is the best (in terms of time) solution...

As I see, there's other suggestions as LEA solution, although I don't know if it is a good solution for all integer constants or only for a few cases such as 19.

We need more performance tests using more-reliable RDSTC and other constants besides 19...


gabor

Hello!

Quote from: hutch-- on January 20, 2006, 12:21:33 AM
I have yet to see much evidence that smaller as in byte count code goes any faster in most instances.

I am sure that longer code is faster in some cases! Example: you have an instruction that is NOT 4 or 8 bytes long. AFAIK it is recommended to insert nops or to rearrange the instruction flow so that it becomes dword aligned. Am I right, is this so? Some time ago I tried to optimize a code of mine and experienced about 7x acceleration when alignement was adjusted. I hope I'm not messing up my memories and I remember right?

There turn out to be so many condition to be taken in count that it is high time to have a tester tool that gives the opportunity to test and compare codes. I beliece Intel created such tool, but I don't remember its name. And it was not freeware... Do you know any other tools for this purpose?

btw, I always like to read about such topics: they seem a bit nonsense (mul by 19 is rather special case), and yet I have learnt from post like this really a lot! :bg

Greets, Gábor

juliomacedo

Gabor,

The purpose is not finding a way of multiplying for 19, it was just a arbitrary constant that I chose to demonstrate the method of decomposing in shifts and adds that I wanted to dicuss...

What I want to discuss is the faster way of making any integer multiplication...

Thanks

Julio

Tedd

Quote from: juliomacedo on January 20, 2006, 01:55:31 PM
What I want to discuss is the faster way of making any integer multiplication...

Then you should probably use the shift & add method.
You could include within this special cases for particular numbers, using lea and other tricks, where it would be faster than shifting and adding. However, it depends how you want to use it, if it's to generate code for performing the muliplications, then you can go to some expense at finding the best combination, but if you want to perform the multiplications then the code required to catch the special cases will probably remove the benefit of using the special cases anyway.
No snowflake in an avalanche feels responsible.

Mark_Larson

Quote from: hutch-- on January 20, 2006, 12:21:33 AM
...I tend to avoid the known pitfalls with repeated slow instructions and interleave the odd slow one with the preferred shorter instructions like add, sub etc ...

That's a recommended optimization technique on the P3 and earlier processors from the P3 and earlier optimization manuals.  They describe it as interleaving a multi-uop instruction followed by one to three 1 uop instructions.  In the P4 optimization manual they say don't do this anymore.  However I have found it still works good on P4 ( at least the one I have).

Mark
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

juliomacedo

Quote from: Tedd on January 20, 2006, 02:05:37 PM

if it's to generate code for performing the muliplications, then you can go to some expense at finding the best combination, but if you want to perform the multiplications then the code required to catch the special cases will probably remove the benefit of using the special cases anyway.


My purpose is generate code to perform the multiplications so the code to catch the special cases will run at compile-time and won't affect run-time performance...

daydreamer

Quote from: !Czealot on January 18, 2006, 05:56:38 PM
I was gonna ask this one
mov eax,050009h
mov ebx, 0190019h
mul ebx
should eax contain 9x19h and edx 5x19h ?


ok my idea was to use it for byte multiplications, for example blend operations, using full range of one 32bit mul and first result in eax and second in edx
I have a vague memory of Mark somewhere wrote MMX/SSE has some execution units and ALU some, so execute this together with some MMX/SSE2 muls? should that be working in parallel?


EduardoS

A generic mul for PIV,
You want multiply eax by a constant,
for each bit in constant from right to left do
add ebx, eax
add eax, eax
if the bit is seted and
add eax, eax
if the bit is claered
both code should run in 1 clock cycle
with 19 (for example)

mov ebx, eax ; ebx should start with 0, so mov instead of add, the first bit is set
add eax, eax
add ebx, eax ; the second bit is seted
add eax, eax
; the third bit is cleared
add eax, eax
; the forth bit is cleared
add eax, eax
add ebx, eax ; the fifth bit is seted
;dont need final add eax, eax, i already have the result

this code should run within 5 clocks, if the tables on pentium otimisation manuals are right its not possible to be faster, and this is a general form.

juliomacedo

Quote from: EduardoS on January 20, 2006, 09:15:29 PM
if the tables on pentium otimisation manuals are right

Where are these tables?

juliomacedo

Quote from: EduardoS on January 20, 2006, 09:15:29 PM
A generic mul for PIV,
You want multiply eax by a constant,
for each bit in constant from right to left do
add ebx, eax
add eax, eax
if the bit is seted and
add eax, eax
if the bit is claered
both code should run in 1 clock cycle

Another interesting generic solution... Now we have Horner and Eduardo... Maybe LEA solution that I don't know if it is generic or specific...

What's is the faster solution (in average, for all integer constants)??? Mathematicians are welcome to answer this question  :toothy





EduardoS

try www.agner.org

lea isn't so generic, it works well for 19, but not for 23 for example, and also lea is slow on PIV (still fast on AMD processors)...

Snouphruh

Quote from: Mark_Larson on January 19, 2006, 07:48:52 PM

EDIT:  I have a Linux system at work, or I'd test all this myself.

Mark

What kind of Linux? Fedora Core, Suse, Mandriva, Red Hat, Alt Linux, Linux XP, ASP Linux, Debian?