News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

What am I doing wrong?

Started by Merrick, April 29, 2007, 12:48:47 AM

Previous topic - Next topic

Merrick

NEVERMIND!!!!
I got about 80% of the way through writing this message when I realized what my problem actually was, but since there's a small possibility it might catch someone the way it did me I'll go ahead and finish writing it up and posting it. I hope it might be useful to someone.

********************************************

I tend not to write full assembler apps. I generally do visual/rad development in some visual language, putting the CPU intensive code in external DLLs written in assembler. In this case, I want to work on optimizing the code in assembler - so I've written a simple console application. But it's not doing what I want it to do, and I can't figure out what I'm doing wrong. I'm sure it's something very simple...

What I've done is hijacked the FFT routine from Numerical Recipes in FORTRAN and ported it over (with VERY LITTLE optimization) to MASM (thus, my desire to do some optimizing). First, to give you just a little background:

The routine is called FourierHL. It takes two arrays and a dword argument containing the length of the array.
The first array is, nominally, an array of time data and the second array is, nominally, an array of frequency data (which would typically be the result of the FT of time data).
The frequency array is just a place holder at this point. The FT is currently doing the transform in place so that the frequency data winds up in the original time array. The frequency array is there as a holder because I have been considering modifying the routine to do the calculation out-of-place.

When I call this routine in a DLL from VB the results are correct.
I copy that routine to the attached program and the results are not.

Called from VB, the code is:
FourierHL timeArray(1), frequencyArray(1), CLng(n)
    [where VB is passing the address of the first element in each array and n is the length of the array]

Some more pertinent information:
This FT works on complex numbers. So the array I send, which is 8 real numbers, is stored as an array of 16 numbers in pairs of real and complex components. So, if the first 4 numbers in the real time data are 1, 2, 3, 4 then the first 8 numbers in timeArray will be 1, 0, 2, 0, 3, 0, 4, 0 (and so on).

My test data is currently: (1.3, 2.4, 1.7, 4.1, 2.9, 1.7, 5.1, 2.7)
Which, as complex number pairs are stored as:

1.3, 0, 2.4, 0, 1.7, 0, 4.1, 0, 2.9, 0, 1.7, 0, 5.1, 0, 2.7, 0

The correct results are:
21.9 + 0i
-2.094975- 1.915076i
-2.6 - 2.7i
-1.105026 + 4.884924i
0.1 + 0ii
-1.105026 - 4.884924i
-2.6 + 2.7i
-2.094975 + 1.915076i


The VB code (calling the assembly routine) gets the correct results.

The attached assembler program prints out the first value in the input array, runs the FFT repetitively in a timing loop to determine a good estimate of the time needed to perform the transform, then prints out the first value of the output array (which is written over the original array in memory) to test (albeit not very stringently) the results of the calculation. But the results are wrong. What gives?

********************

OK. The problem is that because the calculation is being done in place the final transform after multiple loops is NOT the same as the transform after 1 loop. When I change the loop number to 1 the results are correct (though the results of the timing estimate are of course lousy!!)
Unfortunately, I'm a little concerned about how useful this will be. This particular form of the FFT (which is most useful to physicists) results in output that has been scaled w/respect to the original data by the square root of the number of elements in the input array. So, if you choose a moderately high (more than a few dozen) loop iterations for the timing test you end up with mathematical overflows. It's not clear to me whether performing the FFT on NaN's of some form or another takes approximately the same amount of time as performing the FFT on actual numbers.

As it is, however, I am getting single pass timings of about 2950 cycles.
Theoretically, the minimum number of floating point calculations which must be performed in a complex FFT approaches 5 N Log(N,2) - where Log(N,2) is the logarithm on N base 2.
So, a complex array of length 8 should require 120 floating point operations.
If we pessimistically choose fsin as our prototypical floating operation, then each floating point operation should take between 16 and 126 cycles.
120 flops * 16 cycles per flop = 1920 cycles.
I've chosen a pessimistic archetype FPU directive to determine my timing, but I've also chosen the most optimistic timing of fsin for my calculation. Also, the theoretical limit ONLY counts the floating point operations required. The actual FFT routine has a non-trivial data reshuffling performed up front that is not accounted for.
So, in the best case scenario, I do not think I can beat my current results by more than a factor of about 3 (optimistically) and probably less than a factor of 2. And I've written that routine in remarkably high level assembler (i.e., pretty much just converting the FORTRAN loops into MASM high level loop directives) so that's really not that bad.

Anyways... if you've gotten here, thanks for taking the time to read this through...

zooba

You can probably improve your timing accuracy by modifying the function to use separate input and output arrays, ie. the function doesn't change the input array. That way you can run it multiple times using the same input data. The output array can be allocated statically or as a LOCAL, so allocation overhead is irrelevant. Alternatively, you can re-load the input array as part of your loop, but then you have to account for the extra time that adds.

Cheers,

Zooba :U

Merrick

That's true zooba, but would require a substantial modification of the routine. (Though I do have some interest in doing this for other reasons - but I'd only want to do it once after optimizing the code!)
I should have mentioned that I think I can get improvement by changing the high level loop structures to actual jmps, experimenting with which variables are kept in registers and which are stored in memory, etc.
There might be some use to be had of the SSE registers - but it's not clear that this wold help.

Thanks.

raymond

Two comments.
When I wrote my own program to render the Mandelbrot fractal (which uses complex numbers), keeping as many computed values on the FPU as possible is a definite improvement. Also, when I was using memory variables for holding intermediate results, I've seen significant differences in timing depending on where those memory variables were located within the data section.

When I wrote my own program to decompress JPEG image files, one of the steps requires is a Discrete Cosine Transform on 8x8 blocks which is the most time consuming part of the decompression. I initially wrote the algo using the FPU and later rewrote it using MMX instructions. The latter was by far much faster. If you don't need extreme precision, using SSE instructions will probably reduce your FT time considerably.

Raymond
When you assume something, you risk being wrong half the time
http://www.ray.masmcode.com