News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

math with blocks of memory

Started by loki_dre, April 17, 2008, 06:16:57 AM

Previous topic - Next topic

loki_dre

is there a fast way to do math with 2 blocks of memory without creating a for loop?
When I say fast I mean fast interms of processing time.

eg.    a[0 to 10]=a[0 to 10]+b[0 to 10]
         a[0 to 10]=a[0 to 10]-b[0 to 10]
         a[0 to 10]=a[0 to 10]*b[0 to 10]
         a[0 to 10]=a[0 to 10]/b[0 to 10]
         a[0 to 10]=a[0 to 10]>b[0 to 10]
         a[0 to 10]=a[0 to 10]>=b[0 to 10]
         a[0 to 10]=a[0 to 10]<b[0 to 10]
         a[0 to 10]=a[0 to 10]<=b[0 to 10]

hutch--

loki,

It is not the loop code that determines the speed of calculations like this, its the speed of the memory access that imposes the speed limit. Just code an efficient loop for the operations and it will be as fast as it can be. The ADD and SUB operations are reasonably fast but MUL and DIV are slower.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

zooba

Quote from: hutch-- on April 17, 2008, 07:14:48 AM
The ADD and SUB operations are reasonably fast but MUL and DIV are slower.

Just to blow this out of the water somewhat, I recently did some basic benchmarking on a range of Core 2 processors and found that multiplication takes roughly as long as addition or subtraction (both integer, floating point and SSE). Division/modulus typically takes 4 times as long as the other operations. For earlier processors, certainly multiplication is slower, but they've finally reached parity for it. (I believe AMD has been there longer than Intel, but can't confirm.)

As to the original question, I suggest you read http://www.mark.masmcode.com/. It has enough ideas to help you out here. The short answer is yes, it can be done without a loop. The longer answer is that you have to write a lot of code for large blocks and the gain is minimal. A simple unrolled loop using SSE instructions will give you the best performance, and well beyond what a C/C++ compiler can give.

Cheers,

Zooba :U

loki_dre

do you know why the processing takes longer in C++?

u

Quote from: zooba on April 17, 2008, 10:34:53 AM
found that multiplication takes roughly as long as addition or subtraction (both integer, floating point and SSE).
Then that's some slow-ass add/sub performance :P. (no, really).
On AMD cpus multiplication is 3 times slower than add/sub/xor/and/or/... You can chain these simple ops in such a way, that a 1,8GHz cpu will perform as an 11GHz P4.

Quote from: loki_dre
do you know why the processing takes longer in C++?
Depends on which compiler and what optimization settings you've set. Compilers know only a subset of optimization patterns and can't compose new ones, unlike a good asm coder.
Please use a smaller graphic in your signature.

loki_dre

hmmmmmm.....
So, anyone know what is the fastest processor is to do a lot of math with?

u

Cell BE, judging from Folding@Home.
Please use a smaller graphic in your signature.

zooba

Quote from: Ultrano on April 17, 2008, 06:10:00 PM
Then that's some slow-ass add/sub performance :P. (no, really).

Probably. The test was set up to avoid pairing, so that may be influencing the speed of the add/sub operations.

I am yet to see a C++ compiler produce a loop using SSE packed singles. Having said that I haven't looked since last year and there's new versions of stuff out now.

Quote from: Ultrano on April 17, 2008, 09:44:16 PM
Cell BE, judging from Folding@Home.

I agree. An Intel-based processor is not going to give the best processing performance. I'm pretty sure that within the Intel/AMD world it's largely irrelevant anyway, except for large steps in processor speed (500MHz-700MHz intervals).

Cheers,

Zooba :U

hutch--

It would be interesting to see if the Core 2 Duo range are actually faster with DIV and MUL relative to earlier processors. When Intel introduced the PIV it did some stuff better like pairing suitable instructions but was noticable slower for a given clock count on a number of comonly used iteger instructions like SHL SHR and they killed LEA as a fast method as well.

With very late SSE it would be very useful if they have build enough capacity into critical maths instruction rather than just dumping so much of it off to microcode.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php