News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Converting words to floats using sse

Started by akane, August 01, 2009, 11:23:51 PM

Previous topic - Next topic

akane

I agree,realW = 2.0*cos(2.0*pi*k/N);
imagW = sin(2.0*pi*k/N);

d1 = 0.0;
d2 = 0.0;
for (n=0; n<N; ++n)

realW and imagW (or the other variables - theta, CO, CO2, S0 from my attachment) will be precomputed for all tones.
The wroted FFT program I want to optimize with Goerzel can be downloaded from http://www.ionicwind.com/forums/index.php?topic=3472.0 - this is my topic and idea. It requires FFTW binaries.

dedndave

#16
multiplying to divide is nothing new
you are simply multiplying by the reciprical of the constant
the MUL instructions are somewhat faster than the DIV instructions, so that technique gets a lot of mileage

unsigned multiplication by a constant is even better

this is a simplified example
if we want to multiply by 10, that is 1010 in binary
so, it's shift and add, shift and add
the shift/add pattern matches the 1010 constant pattern

        movzx   eax,word ptr InputVal16u
        shl     eax,1                      ;2X in eax
        mov     edx,eax                    ;2X in edx
        shl     eax,2                      ;8X in eax
        add     eax,edx                    ;10X in eax

Hutch showed me a trick to speed that up

        movzx   eax,word ptr InputVal16u
        lea     eax,[eax+4*eax]            ;5X in eax
        shl     eax,1                      ;10X in eax

by selecting the right shift/add pattern, you can multiply by any value, so long as you are aware of overflow
combining the multiply-to-divide and hard-coded multiplication by constants methods can eliminate all MULs and DIVs
you are not really too concerned over a slight loss in resolution, which makes it a great application for the methods
shifts and adds are VERY fast, especially register to register
the entire DFT would be blazingly fast

Rockoon

Quote from: akane on August 02, 2009, 10:41:25 PM
Yes, it is processing wave files or sound captured from the mixer or so. It will probably replace FFT, because I have already a GUI FFT program for "sound decompilation" - you select a part of the sound, and it shows which piano keys are active. But in order to keep good resolution for the lowest frequencies, the FFT must be 32768 samples big

Aha.. well now! This is a much more attackable problem.

Did you know that if you downsample that 32768 sample buffer to say, 4096 samples, that you still get to measure those low frequencies?

They will just appear like they are 3 octaves higher (their wavelength will be 1/8th as long as originally) The downsampling will destroy the highest frequency information, cause some aliasing in the new highest frequencies, but the low frequency information will remain intact.

Downsampling is a form of low-pass filter.

http://en.wikipedia.org/wiki/Downsampling
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

dedndave

there ya go - that's your octave switch - only look at half the samples to switch down an octave
set up your 12 notes for one octave and you have it made - lol
in my routine, the

        mov     ecx,8Fh

sets the biased exponent - 8Fh is for a sample count of N=32768 (it gets decremented to 8Eh for -32768)
for N=16384, change the 8Fh to 90h
for N=8192, use 91h
for N=4096, use 92h, and so on
put it in a variable or calculate it based on the number of samples

        mov     ecx,SampleCountExponent

akane

Quotesets the biased exponent - 8Fh is for a sample count of N=32768
You mean sample amplitude, right? Because 1/32768 in your code is used to decrease amplitude of all samples, in order to keep constant output level from dsp routine:
words[] = from_wave()
floats[] = words[]/32768
complex[] = dsp(floats, count, samplerate)


Amplitude scaling to +-1 or +-2 (PEP) is sometimes important for custom sin/cos routines. You never know which algorithm is used for sine, while FPU is accepting any value. For example the raw tailor polynomial for sine is accepting up to +-PI

dedndave

my bad - i thought that was the number of samples

akane

The version with float lookup table is much faster than the one with "manual float creation". All the floats are initially divided by 32768, .
movzx eax,word[esi+0]
movzx edx,word[esi+2]
mov   eax,[g_floatlookup+eax*4]
mov   edx,[g_floatlookup+edx*4]
mov   [edi+0],eax
mov   [edi+4],edx


The SSE version will be faster if SSE2 is available - instead cvtpi2ps+movlhps+cvtpi2ps we need only CVTDQ2PS xmm,mem; it will convert 4 integers to floats.
Funny, Olly v1 does not recognize CVTDQ2PS as instruction.

raymond

A table lookup will always be the fastest if it fits in L1.

For curiosity, I timed some of the algos with QueryPerformanceCounter on my old P4, for 1000 repeats of the algo. The reported counts are corrected for the counts of running only the reduction of the 1000 down to 0, which was 10 on my system. (Simply invoking QueryPerformanceCounter twice in succession results in a difference of 6 counts.)

Keeping the 32768 as the divisor on the FPU, the sequence
fild word + fdiv st,st(1) + fstp mem requires 87 counts.

Keeping the reciprocal of 32768 as the multiplier on the FPU, the sequence
fild word + fmul st,st(1) + fstp mem requires 15 counts.

My suggested algo for converting word integers to a float in the range of -1 to +1 requires 47 counts.

None of the above are sensitive to the value of the integer.

dedndave's suggested algo to convert a word integer to a float is very sensitive to the size of the integer; the smaller it is, the more time it requires.
For word=4321, 38 counts
For word=432, 62 counts
For word=43, 95 counts
For word=4, 125 counts.

The identical result to dedndave's algo can be obtained with the following FPU sequence
fild word + fstp mem which only requires 15 counts and is not sensitive to the word value.

Surprisingly, that time was the same as when the word was also multiplied by the reciprocal before storing it.

When you assume something, you risk being wrong half the time
http://www.ray.masmcode.com

dedndave

i knew it wasn't blazing - lol
did you see how fast it is for zero !!!  ::)

i think it is as fast as it gets with no FPU or SSE
the loop kills it
to be fair, a "typical" value for a sine wave should be more than 8192 away from zero
the RMS of a full-range sine wave is ~ +/-11584
that means that the loop will typically only make 3 passes
if we assume that the levels of the complex waveform cut that in half, it is 4 passes
that means it's a pretty good guess-timation that the 38-62 counts will be nominal for my routine
you're biased, Ray - lol (pun)

it's a moot point, really, as neither method can beat the SSE2 instructions
if someone wanted to spend the time on the code, no floating point math would be required at all
it would be as fast as a bullet up to the complex, which would require some bit-twiddling
i don't think it would require any loops, though

dedndave

after looking at some of the SSE instructions, i can see it would be difficult to beat their times
they are FAST ! - lol
no use in messing with the fixed point methods

dedndave

i learned 2 new instructions, BSWAP and BSR/BSF - well, new for me at least - lol
because i am a bit-twiddler, i really like them
i used them to eliminate the loop and speed up the final positioning of bits
i gave preference to values nearest to 0, of course

        movzx   eax,word ptr InVal16s
        or      ax,ax
        jz      Scale2

        mov     cx,16
        jns     Scale0

        neg     ax
        mov     ch,1

Scale0: bsr     dx,ax
        cmp     dl,7
        ja      Scale1

        xchg    al,ah
        mov     cl,8

Scale1: sub     cl,dl
        add     dl,7Fh
        mov     dh,ch
        shl     eax,cl
        shr     edx,1
        rcr     ax,1
        xchg    al,dl
        bswap   eax
        mov     ah,dl

Scale2: mov     OutValReal4,eax

some fair test values might be:
8191
4095
1023
255
-255
-1023
-4095
-8191

dedndave

i made a comparison between Ray's code and my new code
for Ray's method, i made 2 measurements; one with the multiplier resident and one non-resident
the program with source code is attached

resident method:

        finit
        fld real10 ptr Recip

;start timer loop

        fild word ptr InVal16s
        fmul    st,st(1)
        fstp real4 ptr OutValReal4

;stop timer loop

non-resident method:

        finit

;start timer loop

        fld real10 ptr Recip
        fild word ptr InVal16s
        fmul    st,st(1)
        fstp real4 ptr OutValReal4
        ffree   st

;stop timer loop

it looks like the aging newbie/FPU guy wins again   :U

Total System Processor Cores: 2
CPU 0: Intel(R) Pentium(R) 4 CPU 3.00GHz MMX SSE3 Cores: 2

Aging n00b/bit-twiddler guy new method:
8191:   22      21      22      clock cycles
4095:   21      22      22      clock cycles
1023:   22      22      21      clock cycles
255:    22      22      23      clock cycles
-255:   25      26      26      clock cycles
-1023:  25      25      25      clock cycles
-4095:  24      25      24      clock cycles
-8191:  24      24      24      clock cycles

Aging newbie/FPU guy method (resident multiplier):
8191:   7       7       7       clock cycles
4095:   7       7       7       clock cycles
1023:   7       7       7       clock cycles
255:    7       7       7       clock cycles
-255:   7       7       7       clock cycles
-1023:  7       7       7       clock cycles
-4095:  7       7       7       clock cycles
-8191:  7       7       7       clock cycles

Aging newbie/FPU guy method (non-resident multiplier):
8191:   15      15      15      clock cycles
4095:   15      15      15      clock cycles
1023:   15      15      15      clock cycles
255:    15      15      15      clock cycles
-255:   15      15      15      clock cycles
-1023:  15      15      15      clock cycles
-4095:  15      15      15      clock cycles
-8191:  15      15      15      clock cycles



[attachment deleted by admin]

Ghandi

Strange, i get this when i run it on my E5400:


Total System Processor Cores: 2
CPU 0: Intel(R) Pentium(R)  CPU       E5400  @ 2.70GHz MMX SSSE3 Cores: 2

Aging n00b/bit-twiddler guy new method:
8191:   16      16      16      clock cycles
4095:   16      16      16      clock cycles
1023:   16      16      16      clock cycles
255:    17      17      17      clock cycles
-255:   17      17      17      clock cycles
-1023:  16      16      16      clock cycles
-4095:  16      16      16      clock cycles
-8191:  16      16      16      clock cycles

Aging newbie/FPU guy method (resident multiplier):
8191:   0       0       0       clock cycles
4095:   0       0       0       clock cycles
1023:   0       0       0       clock cycles
255:    0       0       0       clock cycles
-255:   0       0       0       clock cycles
-1023:  0       0       0       clock cycles
-4095:  0       0       0       clock cycles
-8191:  0       0       0       clock cycles

Aging newbie/FPU guy method (non-resident multiplier):
8191:   0       0       0       clock cycles
4095:   0       0       0       clock cycles
1023:   0       0       0       clock cycles
255:    0       0       0       clock cycles
-255:   0       0       0       clock cycles
-1023:  0       0       0       clock cycles
-4095:  0       0       0       clock cycles
-8191:  0       0       0       clock cycles


HR,
Ghandi

jj2007

Quote from: Ghandi on March 06, 2010, 06:31:33 AM
Strange, i get this when i run it on my E5400:

Zero cycles is cheating, man!

Total System Processor Cores: 1
CPU 0: Intel(R) Celeron(R) M CPU        420  @ 1.60GHz MMX SSE3 Cores: 1

Aging n00b/bit-twiddler guy new method:
8191:   17      17      17      clock cycles
4095:   16      17      17      clock cycles
1023:   17      17      17      clock cycles
255:    18      18      18      clock cycles
-255:   18      18      18      clock cycles
-1023:  17      17      17      clock cycles
-4095:  17      17      17      clock cycles
-8191:  17      17      17      clock cycles

Aging newbie/FPU guy method (resident multiplier):
8191:   1       1       1       clock cycles
4095:   1       1       1       clock cycles
1023:   1       1       1       clock cycles
255:    1       1       1       clock cycles
-255:   1       1       1       clock cycles
-1023:  1       1       1       clock cycles
-4095:  1       1       1       clock cycles
-8191:  1       1       1       clock cycles

Aging newbie/FPU guy method (non-resident multiplier):
8191:   1       1       1       clock cycles
4095:   1       1       1       clock cycles
1023:   1       1       1       clock cycles
255:    1       1       1       clock cycles
-255:   1       1       1       clock cycles
-1023:  1       1       1       clock cycles
-4095:  1       1       1       clock cycles
-8191:  1       1       1       clock cycles

Ghandi

Yes, its definitely strange. I havent modified the application at all, it is straight out of the box giving me those values. o0

HR,
Ghandi