News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

multibyte parallel mods, with zero byte detection

Started by dsouza123, November 12, 2006, 12:36:00 AM

Previous topic - Next topic

dsouza123

The algorithm increments by x each byte in an array
then mods each byte by an associated max byte in a second array
and counts the resulting zeroes.

Some of it can be done in parallel
the bytes 1, 3, 2, 3 are incremented by 2
by adding 2, 2, 2, 2 in a dword
but the problem is figuring a way
to do the mods, ie subtractions, when needed,
and count the zeroes, if any, in parallel.

Maybe there is someway to use either AND or XOR
to do the zero count in parallel.

For example the bytes 1, 3 with max bytes 5, 7.
3 = 1 + 2, 5 = 3 + 2  both < 5, 7 so no 0's
7 = 3 + 4, 9 = 5 + 4  both > 5, 7 so sub 5, 7
2 = 7 - 5, 2 = 9 - 7
4 = 2 + 2, 4 = 2 + 2  both < 5, 7 so no 0's
8 = 4 + 4, 8 = 4 + 4  both > 5, 7 so sub 5, 7
3 = 8 - 5, 1 = 8 - 7
5 = 3 + 2, 3 = 1 + 2  first = 5, second < 7 so sub 5
0 = 5 - 5   anycnt+1, allcnt + 1
4 = 0 + 4, 7 = 3 + 4  first < 5, second = 7 so sub 7
0 = 7 - 7   anycnt+1, allcnt + 1
6 = 4 + 2, 2 = 0 + 2  first > 5, second < 7 so sub 5
1 = 6 - 1
With all 16 sometimes more than 1 byte becomes 0 in an iteration.

anycnt is incremented by 1 if there are >= 1 zeroes in an iteration (of the array)
allcnt is increased by the count of zeroes in the iteration

When all the iterations are done anycnt, allcnt
along with the seconds in kernel and user time are displayed
12345678 means 1.2+ seconds

anycnt and allcnt are shown to validate the algorithm (compared with known values)
kernel and user time are for benchmarking

for example 10 000 000 iterations
6052595 = anycnt
8805234 = allcnt

Here is most of the code, the full code is in the attachment.

.data
align 16
  actarr db   1,  3,  2,  3,  16,  6,  9, 48,  14,  8, 42, 27,  19, 15,  7, 40
  maxarr db   5,  7, 11, 13,  17, 19, 23, 53,  29, 31, 59, 37,  41, 43, 47, 61

  zeroes dd 0         ;  esi used instead
  plus24 db 2,2,2,2   ;  edi used instead
  anycnt dd 0
  allcnt dd 0

  times  dd 10000000  ; iterations

.code
start:
   mov edi, dword ptr plus24     ; use edi for plus24
   mov ebx, times                ; use ebx for times

alliters:
   mov eax, dword ptr actarr+0
   xor esi, esi                  ; use esi for zeroes
   mov edx, dword ptr maxarr+0   ; get max for dword+0
   xor ecx, ecx                  ; use ecx to count dwords in the array

manybites:
   add eax, edi  ; add eax, plus24
spot:
   cmp  al, dl   ;  a < m    x < 5
   jc  @F        ; al < dl   skip sub 5
   sub  al, dl   ; al >= dl do sub, set zero flag if result is 0
   jnz @F        ; if not a zero  skip inc zeroes
   inc esi       ; inc zeroes
@@:
   cmp  ah, dh   ;  a < m    6 > 5
   jc  @F        ; ah > dh so do next instruction
   sub  ah, dh   ; do subtract ah = 1
   jnz @F
   inc esi
@@:
   ror eax, 16   ; swap the higher two bytes with the lower two
   ror edx, 16   ; bswap could replace ror, for both cases

   cmp  al, dl   ;  a < m   ex 4 < 5
   jc  @F        ; al < dl  skip
   sub  al, dl
   jnz @F
   inc esi
@@:
   cmp  ah, dh   ;  a < m  ex  5 == 5
   jc  @F        ; ah == dh
   sub  ah, dh   ; do subtract ah = 0
   jnz @F
   inc esi       ; do inc zeroes
@@:
   ror eax, 16   ; put it back in correct order,  ; bswap  eax, if used before

   mov dword ptr [actarr+ecx*4], eax    ; save it/update spot in the array
   inc ecx
   cmp ecx, 4    ; 4 == 4 dwords/16 bytes  0..3 when ecx == 4 done
   jz  @F
   mov eax, dword ptr [actarr+ecx*4]    ; get next actarr dword
   mov edx, dword ptr [maxarr+ecx*4]    ; get next maxarr dword
   jmp manybites
@@:
   cmp esi, 0      ; cmp zeroes, 0      ; theory max Number of bytes
   jz  @F
   inc anycnt      ; inc totals
   add allcnt, esi ; add allcnt, zeroes
@@:
   xor edi, 06060606h   ; xor dword ptr plus24, 06060606h  switch 2 to 4 or 4 to 2
   dec ebx         ; dec times
   jnz alliters


For each iteration (10M for testing purposes):
Each byte is incremented by the current iteration amount, 2 or 4
If an individual byte in actarr is >= it's match in maxarr
  then it is reduced by the match in maxarr.
If after the subtraction (reduction) it is 0
  then the count of zeroes (in actarr) is incremented.

At the end of each iteration (processing of all 16 bytes)
  If the count of zeroes >= 1 then anycnt+1, ie if any==0 then anycnt + 1
  The count of zeroes is added to allcnt, ie allcnt + zeroes
  The 2 or 4 is switched.
  count of zeroes = 0

Any alternatives/optimizations for the doing
the compare with max, sub, check if 0 in parallel
instead of doing one byte at a time
or any other optimizations.

Any ideas on using the variables zeroes, plus24, times
freeing up the registers esi, edi, ebx
but hiding the memory access times.

Maybe someone can think of alternative algorithms.

Remember it doesn't matter the order things are done
it doesn't have to be on bytes 0 to 15 in order.

The steps don't have to be done on one byte at a time
the bytes are separate there are no dependencies,
ex byte 11 and 12 are independant.

The add 2 step could be done to all 16 bytes before doing
any compares with the associated max bytes
the subtractions of max if needed
the adding to count of zeroes.

This is the fastest of the different versions I've coded,
only CPU (ALU) instructions are used so it should work on any PC.
Delphi source code, done one byte at a time, is at the end of the MASM code.

A SSE2 version is nearly working, haven't looked into MMX or 3DNow or SSE1.

[attachment deleted by admin]

dsouza123

A puzzle, the attached SSE2 version, gets the correct anycnt and allcnt
but only if movd isn't used.

It is being tested with the Bochs PC emulator running 98SE with an emulated Athlon 64 CPU
because my PC doesn't have SSE2 instructions.

I can't tell if there is a problem with the emulation/support of
the movd instruction or a bug in the program or assembler 6.15.

The work around is writing the SSE2 register to memory and
then putting the low word into esi.

   ; movd      esi, xmm7     ; movd zeroes, xmm7   THIS is a problem
;------ in place of movd ----------
   movdqa   xeroes, xmm7  ; xeroes is a oword, really 16 bytes
   mov esi, dword ptr xeroes  ; movzx esi, word ptr xeroes    could replace both mov + and
   and esi, 0FFFFh            ; only the low word is valid data
;------ in place of movd -----------

It, 98SE, doesn't support the GetProcessTimes function
so the timing is unknown, even if it worked it wouldn't be right
because time doesn't flow at the same rate in an emulated system.

Attached:
SSE2eqD.asm  version with work around instructions
SSE2eqDX.asm  version with movd esi, xmm7   crashes on emulator
along with the exes.

Could someone test both of these versions along with the CPU version
from the previous post for timing and functionality (does it work and get correct answers).
10000000 iterations
6052595 = anycnt
8805234 = allcnt

System:
Athlon 1171 Mhz XP SP2

Timings:
CPU 11716848   ; 1.17+ seconds
SSE2 0     ; emulator + 98SE doesn't support GetProcessTimes
XSSE2 crash

Thank You
Dan

[attachment deleted by admin]

EduardoS

On A64@2.0GHz anycnt and allcnt matches on both,
For SSE2eqD.exe:
first run - kernel time = 0, user time = 1406250;
second run - kernel time = 156250, user time = 1562500;
SSE2edDX.exe
first run - kernel time = 156250, user time = 1562500;
second run - kernel time = 156250, user time = 1718750;

dsouza123

Thank you EduardoS

The results are interesting in a number of ways,

That the X version ran indicates an issue
with the emulator for the movd instruction or 98SE.

On a slightly modified CPU/ALU version ModZeroE.asm attached
Athlon@1171 Mhz
(the dword variable times is used directly instead of put in ebx)
the best times from half dozen runs
kernel time = 0, user time = 11316272
kernel time = 0, user time = 11516560
The previous 0, 11716848 was the from this version.

The best times from the previous version ModZeroD.asm
kernel time = 0, user time = 12618144
kernel time = 0, user time = 12818432
so the slight change of using less registers/more memory access (counter intuitive)
with the side effect of accompanying re-alignments of instructions (maybe more important)
has an 11% reduction.

Going from 1.171Ghz to 2.0Ghz (and to a later generation architecture)
along with CPU/ALU code to SSE2 code
has an 88% reduction, it only takes 12% of the time.

If possible could you run a CPU/ALU version also
to compare ALU and SSE2 on the same system ?

[attachment deleted by admin]

EduardoS


dsouza123

Thank you again EduardoS

I ran the two SSE2 versions on a P3 that has SSE1 only
both ran suprisingly but the values for anycnt and allcnt were lower
anycnt == 5184951, allcnt == 6844904
it must work by only doing the lower 64 bits/8 bytes
instead of the full 128/16 bytes in SSE2.

SSE2 has a 66 prefix on some (all ?) instructions
they must match the equivalent SSE1 without the 66,
maybe the 66 is ignored.

The OS is one of the 9x/ME so it doesn't support GetProcessTimes.

EduardoS

In P3 or Athlon XP integer SSE2 instrunctions are interpreted as MMX.

EduardoS

Hi again dsouza,
I changed a little the SSE code... A bit faster :toothy

[attachment deleted by admin]

dsouza123

EduardoS, That was ingenious ! 

Faster, smaller and fewer SSE2/MMX registers used.

What was originally intended to be actarr >= maxarr but was coded  actarr > maxarr - 1
was turned into maxarr > actarr  followed by  NOT + AND then SUB

What was intended to be actarr = 0 but was actarr(before SUB) = maxarr
was turned into actarr = 0 then INC (FF -> 0, 0 -> 1) then ABS(v - 1) ( 1 - 1 ->
  • , 0 - 1 ->[-1])   ABS(Dest - Src)

    Also pextrw, pextrw+4, CPU add instead of shuffle, add (SSE2/MMX), movd

    The cmp 0, skip == 0, inc, add became cmp 1, sbb -1, add  which got rid of a conditional jump.

    Well done.

dsouza123

Increasing the parallelism by full CPU use and finding more zeros.

Since 16 bytes was a success, using more bytes should find even more zeros,
and it does, nearly a million more !

The results for 10M iterations using 52 bytes:

6 979 124 iterations with at least one zero
11 503 055 total zeros

the 16 byte version results were
6 052 595 iterations with at least one zero
8 805 234 total zeros

an increase of 926 529 iterations with at least one zero

Here are the expanded arrays:
.data
align 16
  actarr db   1,  3,  2,  3,  16,  6,  9, 48,  14,  8, 42, 27,  19, 15,  7, 40
         db  34, 30, 28, 22,  18, 12,  4,  0, 101,101,101,101, 101,101,101,101
         db 101,101,101,101, 101,101,101,101, 101,101,101,101, 101,101,101,101
         db 101,101,101,101,   1,  1,  1,  1,   1,  1,  1,  1,   1,  1,  1,  1

  maxarr db   5,  7, 11, 13,  17, 19, 23, 53,  29, 31, 59, 37,  41, 43, 47, 61
         db  67, 71, 73, 79,  83, 89, 97,101, 103,107,109,113, 127,131,137,139
         db 149,151,157,163, 167,173,179,181, 191,193,197,199, 211,223,227,229
         db 233,239,241,251,   6,  6,  6,  6,   6,  6,  6,  6,   6,  6,  6,  6

It has only one change to the routine, using 13 instead of 4 for the dwords to evaluate.

An extra summation verfication check is added after the calculations
and displayed with the other results.


Now that various instruction set versions exist,
an ideal method is using all three register sets together.

The first two 16 byte rows can be done by SSE2,
the third row broken into two 8 byte sections can be done by MMX,
and the last 4 bytes by the ALU.

The instructions from SSE2, MMX, ALU can be interleaved
allowing the three processing units to work concurrently !

The value 6 is a special filler, it will never produce a 0 when started with 1, 
the sequence started with +2 goes 1, 3, 7, 1, 3, 7 etc or if started with +4  1, 5, 7, 1, 5, 7 etc.

By having a complete fourth 16 byte row a pure SSE2 version (or MMX version)
can be tested without distorting the results.

The code attached is a modified ALU version that does 52 bytes and has an added verification check
the sum of all 52 actarr bytes is displayed.

A use for this is as a multi stride composite number identifier,
if a zero byte is found it means the number is divisible by the matching maxarr byte.
The multi stride is because it alternates advancing by 2 or 4.

[attachment deleted by admin]

EduardoS

Sorry if this disapoint you but...
Increasing the number of bytes may not increase the speed, MMX and SSE instructions share the same execution units (in fact, internally the Athlon 64 splits SSE instructions into two MMX/3DNow! instructions), just using diferent registers.
In the Athlon 64 the decoders have a limit of 3 MacroOps per cicle, FADD and FMUL one MacroOp each (each packed SSE instruction = 2 MacroOps, so 1 SSE instruction per clock as maximum throughput), my last code was getting 2.3 macroOps per cicle, almost 2 of them was SSE and the 0.3 left was the few "general" instructions, so the SSE units are full and i can't run more SSE instructions in paralell, there are 0.7 MOps/cicle free on decoders and integers units are almost IDLE, so you may try to put some integers instructions between them to count few more zeros per cicle, however this extra instructions may "polute" the pipeline and slow down the SSEs...
Achieving 3 MOps/cicle is a very hard and fun game  :wink

dsouza123

The concurrency wouldn't be all five at once SSE2, SSE2, MMX, MMX, ALU but a maximum of three SSE2, MMX, ALU.
Two versions are attached an ALU only with registers consistent with the new SSE2, MMX, ALU version.
It is only coarse interleaving of blocks of SSE2, MMX, SSE2, MMX, ALU instead fine interleaving of instructions.
If both versions are working correctly, the 1st, 2nd and 5th numbers on the results should be the identical.

Two other versions will be done a 4 SSE2 version and a 7 MMX version to match the 13 ALU version,
all will do the minimum number dwords/four bytes for their register size, allowing for more detailed comparisons
with the mixed unit versions.

Both the pure SSE2 and MMX versions would only be able to hold two copies in registers,
so reads and writes to memory are necessary like the pure ALU version.

On the Intel Core 2 processors :  Core 2 Duo, Core 2 Quad, Xeon 51xx, Xeon 53xx work on whole 128 bit registers.
They are able to work on 3 SSE2 at once.
http://www.realworldtech.com/page.cfm?ArticleID=RWT030906143144&p=1
http://www.realworldtech.com/includes/images/articles/merom-2.gif

The upcoming AMD Athlons (Barcelona editions) will also work on whole 128 bit registers.
http://www.channelinsider.com/pages/print_article.aspx?articleid=191008&pagetype=print_article

[attachment deleted by admin]

dsouza123

The ALU version got a bug introduced when it the registers were made consistent,
it should have been  xor ebx, 06060606h  but had ecx in error.

The repaired version is attached.

The pure MMX version is inconsistent it works correctly upto 24 bytes
but past that the results are off.  There maybe some subtle issue/side effect
of the MMX instructions used.

It is also attached.

[attachment deleted by admin]

dsouza123

Work around for MMX (should also apply to SSE2 but untried).

The issue pcmpgtb (packed compare for greater than with signed bytes)
is a signed byte comparison, so if maxarr is >= 128 and actarr < 128, actarr would seem bigger
solution xor both actarr and temp maxarr before the comparison
then xor actarr again after for a correct psubb.

[attachment deleted by admin]

dsouza123

Pure SSE2 version, like the pure MMX version it is based on, uses all SIMD 8 registers.
The only memory accesses in the 10M iterations
are writing old actarr and reading new actarr and maxarr and decrementing times.

It works on 1 actarr and maxarr row at a time,
instead of loading 2 rows (with the trade off of more memory access for lack of registers)
in the SSE2, MMX, ALU version.

Untested but should work correctly.

Both the pure SSE2 and the more optimized (8 register) pure MMX it is based on are attached.
Also a work around for the signed comparison SSE2, MMX, ALU version untested but should work correctly.

All versions should do 10M iterations and end with
6979124 iterations with one or more zeros
11503055 total zeros
3180 for the total of the 52 bytes in actarr when done  (actarr has 64 but 12 are filler).

[attachment deleted by admin]