News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Crunch an array of word integers

Started by jj2007, February 23, 2011, 09:21:54 PM

Previous topic - Next topic

jj2007

Long time no fun in the lab, so here is an exercise: You got an array of WORDs and must add it up. Here is the basic idea:

.data
MyWords dw 10000, 20000, 40000, 20000, 10000, 10000, etc...

.code
mov esi, offset MyWords
fldz
m2m ecx, 20
.Repeat
dec ecx
fiadd word ptr [esi+2*ecx] ; fine for signed words but no good for unsigned words
.Until Zero?


Results:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
111     cycles for AddWordsA aka 5 cycles per WORD
113     cycles for AddWordsB
117     cycles for AddWordsC
123     cycles for AddWordsD
364     cycles for AddWordsE

32       bytes for AddWordsA
33       bytes for AddWordsB
34       bytes for AddWordsC
21       bytes for AddWordsD
32       bytes for AddWordsE


Algo E uses a pushw - that should definitely be avoided!
push 0 ; create a slot
fldz
m2m ecx, 20
align 16
.Repeat
pop ax
dec ecx
push word ptr [esi+2*ecx] ; this is horribly slow...!
fiadd dword ptr [esp]
.Until Zero?

dedndave

prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
158     cycles for AddWordsA
281     cycles for AddWordsB
279     cycles for AddWordsC
219     cycles for AddWordsD
1375    cycles for AddWordsE

160     cycles for AddWordsA
283     cycles for AddWordsB
276     cycles for AddWordsC
219     cycles for AddWordsD
1373    cycles for AddWordsE

dedndave

sometimes, the old ways are the best ways (Brion James - Fifth Element)

prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
158     cycles for AddWordsA
281     cycles for AddWordsB
281     cycles for AddWordsC
219     cycles for AddWordsD
1372    cycles for AddWordsE

41      cycles for AddWordsF

158     cycles for AddWordsA
275     cycles for AddWordsB
278     cycles for AddWordsC
219     cycles for AddWordsD
1372    cycles for AddWordsE

48      cycles for AddWordsF

32       bytes for AddWordsA
33       bytes for AddWordsB
34       bytes for AddWordsC
21       bytes for AddWordsD
32       bytes for AddWordsE

29       bytes for AddWordsF

i didn't try to see if there are faster ways - that was just off-the-cuff   :P

dedndave

there must be something amiss
60-80 cycles, i would expect
40-50 seems a little too fast for 20 words - lol

Antariy

Quote from: dedndave on February 24, 2011, 03:18:40 AM
there must be something amiss
60-80 cycles, i would expect
40-50 seems a little too fast for 20 words - lol


:bg

Pretty funny test :bg

I thought that testing with huge table is much more close to the reality.

Used 3000 repeat of the test table - total table size 60000 entries over 117 KB.

Results:

383187  cycles for AddWordsA
726846  cycles for AddWordsB
726506  cycles for AddWordsC
544366  cycles for AddWordsD
4146797 cycles for AddWordsE
161863  cycles for AddWordsF
107808  cycles for AxAddWordInt

381553  cycles for AddWordsA
725862  cycles for AddWordsB
725483  cycles for AddWordsC
546341  cycles for AddWordsD
4145276 cycles for AddWordsE
161054  cycles for AddWordsF
107105  cycles for AxAddWordInt

35       bytes for AddWordsA
36       bytes for AddWordsB
37       bytes for AddWordsC
24       bytes for AddWordsD
48       bytes for AddWordsE
32       bytes for AddWordsF
40       bytes for AxAddWordsInt


jj2007

Quote from: dedndave on February 24, 2011, 02:50:58 AM
sometimes, the old ways are the best ways (Brion James - Fifth Element)

Integer addition is faster, but in my real life app, I need the result on the FPU after each addition ;-)

dedndave

then, FILD the result   :bg

oh - after EACH step - gotcha

Antariy

Quote from: jj2007 link=topic=16158.msg133595#msg133595
Integer addition is faster, but in my real life app, I need the result on the FPU after each addition ;-)


FPU pieces added.

For first code: results are 64 bit width EDX:EAX - i.e. unlimited number of the WORDs to process is possible.

For second code: this code is much more interesting in a therms of testing across different hardware with testing for correctness of the results.
Accumulator FPU reg is used in a way of SIMD-like - going addition of two DWORDs in one QWORD part of TBYTE. Each DWORD is a casted WORD from a table.
This way have limitation about 131074 WORDs to process is possible.

Both the code is unsigned.

Results:

424295  cycles for AddWordsA
734488  cycles for AddWordsB
730655  cycles for AddWordsC
541213  cycles for AddWordsD
160005  cycles for AddWordsF
106321  cycles for AxAddWordInt
361570  cycles for AxAddWordsFPU1
2092869 cycles for AxAddWordsFPU2
1200000000      result for FPU1 code - should be 1200000000
1200000000      result for FPU2 code - should be 1200000000

385801  cycles for AddWordsA
721699  cycles for AddWordsB
722357  cycles for AddWordsC
549136  cycles for AddWordsD
160099  cycles for AddWordsF
106310  cycles for AxAddWordInt
361766  cycles for AxAddWordsFPU1
2100187 cycles for AxAddWordsFPU2
1200000000      result for FPU1 code - should be 1200000000
1200000000      result for FPU2 code - should be 1200000000

35       bytes for AddWordsA
36       bytes for AddWordsB
37       bytes for AddWordsC
24       bytes for AddWordsD
47       bytes for AxAddWordsFPU1
55       bytes for AxAddWordsFPU2
32       bytes for AddWordsF
40       bytes for AxAddWordsInt


Note the second code - not timings but result is interesting.

oex


AMD Sempron(tm) Processor 3100+ (SSE3)
271414  cycles for AddWordsA
286499  cycles for AddWordsB
287004  cycles for AddWordsC
243164  cycles for AddWordsD
122597  cycles for AddWordsF
118502  cycles for AxAddWordInt
246886  cycles for AxAddWordsFPU1
789107  cycles for AxAddWordsFPU2
1200000000      result for FPU1 code - should be 1200000000
1200000000      result for FPU2 code - should be 1200000000

244289  cycles for AddWordsA
286729  cycles for AddWordsB
287219  cycles for AddWordsC
241594  cycles for AddWordsD
123074  cycles for AddWordsF
118730  cycles for AxAddWordInt
246799  cycles for AxAddWordsFPU1
790455  cycles for AxAddWordsFPU2
1200000000      result for FPU1 code - should be 1200000000
1200000000      result for FPU2 code - should be 1200000000

35       bytes for AddWordsA
36       bytes for AddWordsB
37       bytes for AddWordsC
24       bytes for AddWordsD
47       bytes for AxAddWordsFPU1
55       bytes for AxAddWordsFPU2
32       bytes for AddWordsF
40       bytes for AxAddWordsInt

--- ok ---
We are all of us insane, just to varying degrees and intelligently balanced through networking

http://www.hereford.tv

Antariy


Antariy

Quote from: Antariy on February 24, 2011, 04:33:15 PM
For second code: this code is much more interesting in a therms of testing across different hardware with testing for correctness of the results.

So, nobody else want to participate in test of unusual approach?

jj2007

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
325567  cycles for AddWordsA
322822  cycles for AddWordsB
341571  cycles for AddWordsC
310839  cycles for AddWordsD
120927  cycles for AddWordsF
90783   cycles for AxAddWordInt
332415  cycles for AxAddWordsFPU1
605996  cycles for AxAddWordsFPU2
1200000000      result for FPU1 code - should be 1200000000
1200000000      result for FPU2 code - should be 1200000000


AxAddWordInt is pretty fast indeed :U

Antariy

Quote from: jj2007 on February 25, 2011, 05:32:58 AM
AxAddWordInt is pretty fast indeed :U

Most interesting for me is the results (correctness) of the AxAddWordFPU2 proc. Have a look into it - it is interesting code maybe because it uses one FPU reg as holder of two DWORDs accumulators.
I.e. - MMX like stuff but still with FP reg :bg

FORTRANS

G:\WORK\TEMP>addwords
pre-P4 (SSE1)
557852  cycles for AddWordsA
490449  cycles for AddWordsB
495974  cycles for AddWordsC
303239  cycles for AddWordsD
130264  cycles for AddWordsF
121405  cycles for AxAddWordInt
326911  cycles for AxAddWordsFPU1
605932  cycles for AxAddWordsFPU2
1200000000      result for FPU1 code - should be 1200000000
1200000000      result for FPU2 code - should be 1200000000

375607  cycles for AddWordsA
495827  cycles for AddWordsB
523588  cycles for AddWordsC
303198  cycles for AddWordsD
130102  cycles for AddWordsF
121261  cycles for AxAddWordInt
328075  cycles for AxAddWordsFPU1
606087  cycles for AxAddWordsFPU2
1200000000      result for FPU1 code - should be 1200000000
1200000000      result for FPU2 code - should be 1200000000


Hi,

   Is there a reason for the different timing with AddWordsA?

Regards,

Steve N.