News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

MMX and XMM Graphics Code Listing and Question

Started by OceanJeff32, February 07, 2005, 06:07:04 AM

Previous topic - Next topic

OceanJeff32

 ::)

Here is the Code that I've made recently.  It is a modification based on original MMX code, first I optimized?? the MMX code, then I tried the same thing with XMM (thought it would be faster, but it was not faster at all...at least on my computer).

; -------------------------------------------------------------------------
Blur_MMX2 PROC                 ; 24bit color version
   mov edi,bitmap2            ; (Developed under an old SiS6326 graphic card
   mov esi,bitmap1            ;  which prefers 24bit for faster operation)
   mov bitmap1,edi            ;  Note: SiS315 is excellent, good rendering quality
   mov bitmap2,esi

; Set Up Row Displacements
   mov eax, maxx      ; EAX = MAXX
   lea eax, [eax+eax*2]      ; EAX = MAXX * 3
   mov ebx, eax         ; EBX = MAXX * 3
      MOV EDX, EBX
      NEG EDX
      imul maxy         ; EAX = MAXX * MAXY * 3 (BITMAP SIZE with 24-bits per          

            ; pixel)
   push eax         ; this PUSHes EAX onto stack for Reference in loop
      

; Set up Shift and Fade Mask
; SHIFT MASK = 81h ; 8 = 1000 (4 bits) 1 = 0001 (4 bits) [64-bit mask is 8 bytes]
;   mov eax, 0FF3FFF3FH      ; SUBtracting this will change MMX WORD coMMands to BYTE;
;   mov [ebp-4], eax      ; and will fade the
;   mov [ebp-8], eax      ; and you need two of them to make an MMX register
;   mov [ebp-12], eax      ; and will fade the
;   mov [ebp-16], eax      ; and you need two of them to make an MMX register
   movdqu XMM7, [fadelvl]      ; Then load the SHIFT FADE MASK into MM7

   xor eax,eax         ; set EAX to ZERO, because it's used in loop EAX
@@:
   MOVDQU XMM0, [esi+32]      ; load XMM registers with bitmap data.
   MOVDQU XMM1, [esi]
   MOVDQU XMM2, [esi-32]
   MOVDQU XMM3, [esi+EDX+16]
   MOVDQU XMM4, [esi+ebx+16]
   MOVDQU XMM5, [esi+EDX-16]
   MOVDQU XMM6, [esi+ebx-16]

   PSRLW   XMM0, 2         ; shift right logical ea WORD every register (1 bit)
   PSRLW   XMM1, 2         ; in essence dividing every WORD by TWO, we want to
   PSRLW   XMM2, 2         ; divide every BYTE by TWO, but it won't let us...so
   PSRLW   XMM3, 2         ; that's where the next set of coMMands helps.
   PSRLW   XMM4, 2
   PSRLW   XMM5, 2
   PSRLW   XMM6, 2

      PAND      XMM0, XMM7
      PAND      XMM1, XMM7
      PAND      XMM2, XMM7
      PAND      XMM3, XMM7
      PAND      XMM4, XMM7
      PAND      XMM5, XMM7
      PAND      XMM6, XMM7
     
    PADDB XMM0, XMM1
    PADDB XMM0, XMM4
    PADDB XMM0, XMM3

    PADDB XMM2, XMM5
    PADDB XMM2, XMM1
    PADDB XMM2, XMM6
   
   MOVDQU   [edi+eax+16], XMM0   ; MM0 is result for [edi+eax (offset) +8]
   MOVDQU   [edi+eax-16], XMM2   ; MM2 is cooresponding result for opp side of EDI+EAX

   ; all MMx registers are free now except MM7 which still contains SHIFT FADE MASK

       MOVDQU XMM1, [esi+16]      ; 2nd round, this time we're doing the middle 64-bits
   MOVDQU XMM2, [esi-16]      ; if you've been following the offset math
   MOVDQU XMM3, [esi+ebx]
   MOVDQU XMM4, [esi+EDX]

   PSRLW   XMM1, 2
   PSRLW   XMM2, 2
   PSRLW   XMM3, 2
   PSRLW   XMM4, 2

    PAND XMM1, XMM7
    PAND XMM2, XMM7
    PAND XMM3, XMM7
    PAND XMM4, XMM7

   PADDB   XMM1, XMM2
   PADDB   XMM1, XMM3
   PADDB   XMM1, XMM4

   MOVQ   [edi+eax], XMM1

   lea esi,[esi+48]
   lea eax,[eax+48]
   cmp eax,[esp]         ; stack points to maximum size of BITMAP, because
               ; EAX was PUSHed before the loop w/ max size of BITMAP
   jbe @B
   pop eax            ; this clears stack pointer from pointing at loop control
   eMMs            ; empty MMX registers (and FPU x87 at same time)
   ret            ; return from this procedure call
Blur_MMX2 ENDP
; -------------------------------------------------------------------------
; -------------------------------------------------------------------------
Blur_MMX2 PROC                 ; 24bit color version
   mov edi,bitmap2            ; These are the two bitmap PAGES
   mov esi,bitmap1           
   mov bitmap1,edi           
   mov bitmap2,esi

; Set Up Row Displacements
   mov eax, maxx      ; EAX = MAXX
   lea eax, [eax+eax*2]      ; EAX = MAXX * 3
   mov ebx, eax         ; EBX = MAXX * 3
      MOV EDX, EBX
      NEG EDX
      imul maxy         ; EAX = MAXX * MAXY * 3 (BITMAP SIZE with 24-bits per          

            ; pixel)
   push eax         ; this PUSHes EAX onto stack for Reference in loop
      

; Set up Shift and Fade Mask
; SHIFT MASK = 81h ; 8 = 1000 (4 bits) 1 = 0001 (4 bits) [64-bit mask is 8 bytes]
   mov eax, 0FF3FFF3FH      ; SUBtracting this will change MMX WORD coMMands to BYTE
   mov [ebp-4], eax      ; and will fade the
   mov [ebp-8], eax      ; and you need two of them to make an MMX register
   movq MM7, [ebp-8]      ; Then load the SHIFT FADE MASK into MM7
;   lea esi,[esi-8]      ; Point at Beginning of BITMAP. might need to be -8?
   xor eax,eax         ; set EAX to ZERO, because it's used in loop EAX
@@:
   MOVQ   MM0, [esi+16]      ; load MMX registers with bitmap data.
   MOVQ   MM1, [esi]
   MOVQ   MM2, [esi-16]
   MOVQ   MM3, [esi+EDX+8]
   MOVQ   MM4, [esi+ebx+8]
   MOVQ   MM5, [esi+EDX-8]
   MOVQ   MM6, [esi+ebx-8]

   PSRLW   MM0, 2         ; shift right logical ea WORD every register (1 bit)
   PSRLW   MM1, 2         ; in essence dividing every WORD by TWO, we want to
   PSRLW   MM2, 2         ; divide every BYTE by TWO, but it won't let us...so
   PSRLW   MM3, 2         ; that's where the next set of coMMands helps.
   PSRLW   MM4, 2
   PSRLW   MM5, 2
   PSRLW   MM6, 2

      PAND      MM0, MM7
      PAND      MM1, MM7
      PAND      MM2, MM7
      PAND      MM3, MM7
      PAND      MM4, MM7
      PAND      MM5, MM7
      PAND      MM6, MM7
     
    PADDB MM0, MM1
    PADDB MM0, MM4
    PADDB MM0, MM3

    PADDB MM2, MM5
    PADDB MM2, MM1
    PADDB MM2, MM6
   
   MOVQ   [edi+eax+8], MM0   ; MM0 is result for [edi+eax (offset) +8]
   MOVQ   [edi+eax-8], MM2   ; MM2 is cooresponding result for opp side of EDI+EAX

   ; all MMx registers are free now except MM7 which still contains SHIFT FADE MASK

       MOVQ   MM1, [esi+8]      ; 2nd round, this time we're doing the middle 64-bits
   MOVQ   MM2, [esi-8]      ; if you've been following the offset math
   MOVQ   MM3, [esi+ebx]
   MOVQ   MM4, [esi+EDX]

   PSRLW   MM1, 2
   PSRLW   MM2, 2
   PSRLW   MM3, 2
   PSRLW   MM4, 2

    PAND MM1, MM7
    PAND MM2, MM7
    PAND MM3, MM7
    PAND MM4, MM7

   PADDB   MM1, MM2
   PADDB   MM1, MM3
   PADDB   MM1, MM4

   MOVQ   [edi+eax], MM1

   lea esi,[esi+24]
   lea eax,[eax+24]
   cmp eax,[esp]         ; stack points to maximum size of BITMAP, because
               ; EAX was PUSHed before the loop w/ max size of BITMAP
   jbe @B
   pop eax            ; this clears stack pointer from pointing at loop control
   eMMs            ; empty MMX registers (and FPU x87 at same time)
   ret            ; return from this procedure call
Blur_MMX2 ENDP
; -------------------------------------------------------------------------

My Question is: Can you make this code even better?  The MMX version looks crisp and clear, the XMM version fills the screen with white noise all over the place? Did I make a mistake somewhere? (such an innocent question for an assembly programmer? )

Anyways, let me know, or if not, let me hear something...

I'm continuing to work on it, and I'm just doing this for fun, my next project that I also plan to post when done is a complete Particle System for Windows GDI (with this MMX blur and hopefully XMM blur too).

Later guys...gals...??

Jeff C
:toothy
Any good programmer knows, every large and/or small job, is equally large, to the programmer!

UncannyDude

Sorry but I cannot not help you with what's going wrong.

From the performance viewpoint, the memory acesses (when reading) are scattered and sometimes backwards. This hurts severaly efficiency.

Convert things like:
   MOVDQU XMM0, [esi+32]      ; load XMM registers with bitmap data.
   MOVDQU XMM1, [esi]
   MOVDQU XMM2, [esi-32]

to:
   MOVDQU XMM2, [esi-32]
   MOVDQU XMM1, [esi]
   MOVDQU XMM0, [esi+32]      ; load XMM registers with bitmap data.


It would be better if you can abuse of linear, forward memory accesses.

Cheers,

U.

Mark_Larson


I haven't had a chance to look at it closely, since I have to get ready for work.  However I did notice one glaring mistake that will affect your performance.  XMM is very sensitive to using aligned data.  Not using aligned data causes a big performance hit.  Get rid of the MOVDQU ( which is used for reading unaligned data from memory), align your data appropriately ( requires all data read to be aligned on a 16 byte boudnary), and use MOVDQA.  Ignoring the stalls from reading unaligned data from memory, MOVDQU takes 10 cycles.  MOVDQA takes 6.  See the difference?  MOVDQU is almost twice as slow.  Get that working and repost this on The Laboratory and I'll show you some more tricks to speed up your code.
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

OceanJeff32

#3
Uncanny dude:  From a bitmap perspective I am loading the surrounding four pixels from a bitmap, computing the average and storing that in the corresponding bitmap (on the other page to be displayed).  I compute the pixels for ESI+16 (64-bits)  ESI-16 (64-bits), and then the next code loads the four pixels around ESI itself.  Thus, at the end of the loop I move 16*3 = 48 bytes into the bitmap.  (you'll notice the MMX version does half that.)

Mark Larson:  Um...I guess I need to learn how to align my data, I'll look into that, and get back to you.

I think I remember some directive about that... let me guess .align   :dance:

P.S. Also I forgot to mention that MMX code works (it just looks like really fast fading of the fireworks effect), the XMM code fills the screen with white noise and otherwise a complete mess...I was wondering if there's something else to the conversion from MMX to XMM that I'm missing.

P.P.S. The full code for the program can be found in THE CAMPUS : a little fire using gdi+ by robert collins.
or at the following web site, which also has other next MMX graphics code, including MMX BLUR N SWIRL which is a neat little project.

http://www.ronybc.8k.com
Thanks guys,

Jeff C  :green
:8)
Any good programmer knows, every large and/or small job, is equally large, to the programmer!

Mark_Larson

 I found one bug in your XMM version.


PADDB   XMM1, XMM2
   PADDB   XMM1, XMM3
   PADDB   XMM1, XMM4

   MOVQ   [edi+eax], XMM1    ;needs to be a MOVDQU

   lea esi,[esi+48]


  You also could use a facelift optimization-wise on your fireworks code.  There are lots of little places you can write more optimized code.  Here is an example.  In FShell_recycle you load X twice and Y twice and perform 2 operations a piece.  Instead try loading them once each, and doing both operations at the same time.  Here is the original umodified code.

FShell_recycle PROC hb:DWORD, x:DWORD, y:DWORD
    mov edi,hb
    mov eax,x
    mov [edi+EXX],eax
    mov eax,y
    mov [edi+EXY],eax
    mov eax,x
    mov lightx,eax             ; Light last one
    mov eax,y
    mov lighty,eax


This changes to


FShell_recycle PROC hb:DWORD, x:DWORD, y:DWORD
    mov edi,hb
    mov eax,x
    mov [edi+EXX],eax
    mov lightx,eax             ; Light last one
    mov eax,y
    mov [edi+EXY],eax
    mov lighty,eax


That still is not optimal.  On modern processors you can execute instructions in parallel.  With register renaming using EAX twice will probably get fixed up, but just in case do the following.  It does two things.  One it allows the ALU instructions to execute in parallel.  And two is helps break up dependencies.  Doing a "mov eax,x" followed by a "mov [edi+EXX],eax" produces a stall.  It has to wait for EAX to load with the correct value before doing the "mov [edi+EXX],eax".  The following code helps break up dependencies more.  Just glancing at your code I saw quite a few dependencies.  Try not to use a register on the instruction after setting it up.

FShell_recycle PROC hb:DWORD, x:DWORD, y:DWORD
    mov edi,hb
    mov eax,x
    mov ebx,y
    mov [edi+EXX],eax
    mov [edi+EXY],ebx
    mov lightx,eax             ; Light last one
    mov lighty,ebx


BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

Mark_Larson

  ROFL.  I just realized that link wasn't your webpage.  I thought it was.  If you talk to that person I found a memory leak bug in their code.  They allocate memory like so.  Notice what the author of the code saves in bitmap1 and bitmap2.  They add 4096 to the pointer first.  Bad.



    invoke GetProcessHeap
    mov hHeap,eax
    invoke HeapAlloc,hHeap,HEAP_ZERO_MEMORY,4194304
    add eax,4096               ; blur: -1'th line problem
    mov bitmap1,eax
    invoke HeapAlloc,hHeap,HEAP_ZERO_MEMORY,4194304
    add eax,4096               ; blur: -1'th line problem
    mov bitmap2,eax


Here is the author's code to free the memory.

invoke HeapFree,hHeap,0,bitmap1
invoke HeapFree,hHeap,0,bitmap2


bitmap1 and bitmap2 won't be pointing to the allocated memory block because they were modified. 
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

hutch--

Jeff,

Data alignment is no big deal to do. Most memory allocation API functions specify the alignment so you can just look it up but if you are writing to various parts of a buffer, you need to test the alignment or just set it to the next highest aligned address. Have a look at a macro in the masm32 macro file called memalign. its very simple and you can use those instructions to align a read or a write to a location in an existing buffer.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

OceanJeff32

Sorry, I've been gone for a few days...I only go online at work...

:cheekygreen:
You guys are awesome!

Ok, thanks for spotting the bug with the MOVQ!!! I can't believe I forgot that...I'm too spoiled by Visual C++ compiler...I was thinking MASM assembler would show me my own stupid mistakes!!!

Mark L. what do you do for a living?  If you haven't already you should write a book, I just glanced at your web site the other day, and I'm sure I'll be using it as a reference in the future.

Yeah, that fixed the Fireworks noise problem.  But both the MMX and XMM get the same Frames Per Second.  Hmm...next step Align the reads and writes with XMM and hope that speeds it up.

This is so cool.

Later guys,

Jeff
:U

P.S. Yeah, the bitmap data is allocated because the reads and writes don't check for bitmap boundaries before doing their job, so he wanted a buffer on either end it looks like.

P.P.S. Yes again, I've just written the above MMX and XMM code segments, the rest of Rony's Fireworks is still in tact.  I'm going to use his basic framework and start from scratch with a particle engine (that's my first goal) then a 3d graphics engine (major goal for this year).

Ah, windows...the joy...the pain.
:green :lol
Any good programmer knows, every large and/or small job, is equally large, to the programmer!

OceanJeff32

Hey Uncanny,

Do you mean if I situation the reads to start at a low address, then read increasing higher memory, is better? Now I got it!

Cool, I will try to make those modifications.

I was reading the bitmap as follows:

--------------------------------------------------------------------
|            |            |            |            |            |
|            |     1     |      7    |      6    |            |
--------------------------------------------------------------------
|            |            |            |            |            |
|    2      |      8    |    base  |     9     |    5      |
--------------------------------------------------------------------
|            |            |            |            |            |
|            |    3      |     10    |    4       |            |
--------------------------------------------------------------------

I load BASE, 1, 2, 3 and divide them by two in pairs, then add them together in pairs, then divide by two again, then add them together again in pairs, this creates the average of 8, which I store in the other bitmap.

BASE, 4, 5 and 6 are loaded to find the average pixel surrounding 9.

7,8,9,10 are averaged to find BASE.

So I do all those loads in 64-bit chunks for MMX version, and 128-bit chunks for XMM version, and store 3 chunks at a time in the 2nd bitmap.

I just think this code was cool to discover and being able to completely re-write it for myself, gives me a good feeling, this is the first project I've ever done in Assembly Language beside printing hello world in dos.

hope to hear back from you soon,

Jeff C
  :lol
Any good programmer knows, every large and/or small job, is equally large, to the programmer!

UncannyDude

Hi!

Yeah, that's what I intended. As there are no English compiler sometimes I got trapped in some static semantic errors. :P

As for you problem, the algorithm is not very memory-access eficient. If you are engaged to get the last nano, try reading from contiguous memory(see it as flat vector of bytes, not as a bitmap), and store optimizing for next reading. This is called locality of reference. Writing get no slowdown when you are not reading too(memset is a good example), but processor will put that data on cache, and may remove from cache data you still need to read

Also, when you shift mmx data, for a divide by 2, the shift amount should be 1, eh?

Mark_Larson

Since we have been discussing assembler optimization.  I've also written some assembler optimization tutorials in case anyone is interested.


Using SSE2 to do Quaternions ( used in game programming):
http://www.old.masmforum.com/viewtopic.php?t=3469&highlight=quaternions

Mersenne Twister Random Number Generator optimization tutorial. The author of the mersenne twister's C code runs in 258 cycles. Agner Fog's P4 SSE2 code for the mersenne twister runs in 44 cycles. My ALU code runs in 25 cycles ( 10 times faster than the author's code, and almost twice as fast as Agner's SSE2 code). Yes, you read that right, my ALU code is running faster than Agner's SSE2 code. It's because I optimized it specifically for the P4, and you can execute up to 4 ALU instructions in parallel if you do it right. I then wrote an SSE2 version that runs in 14 cycles ( 18.5 times faster than the author's code, and 3.1 times faster than Agner's SSE2 code).
http://www.old.masmforum.com/viewtopic.php?t=3565&highlight=mersenne+twister


How to optimize C code into fast assembler code. This was the first one I did. It is 6 pages due to all the replies I was getting from people. Jibz was kind enough to offer some better optimized C code to compare against. I took the original code from a book on optimizing C. I wanted to show how to speed up already highly optimized C code using assembler.
http://www.old.masmforum.com/viewtopic.php?t=3329&highlight=optimization+tutorial


My account got messed up on masmforum. I had to get a new account. Some of my old posts now say hutch-- and some say marklarson. I participated in the MD5CRK project ( http://www.md5crk.com). You can see where Jean-luc gave me credit here: http://www.md5crk.com/?sec=aboutmd5client ( search for "larson"). My code runs 10 times faster than the standard C code. I also posted the code on masmforum but it says hutch-- ( because of that I previously mentioned problem).
http://www.old.masmforum.com/viewtopic.php?t=2921&highlight=md5
               
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

OceanJeff32

I recently discovered PAVGB this is a very helpful command for this fireworks demo that I was trying to "optimize".  And it has low latency, according to the Intel Manuals.

Later,

Jeff C
:U
Any good programmer knows, every large and/or small job, is equally large, to the programmer!

Farabi

I still cannot use mmx or sse function on my project. Anyone can solve the problem?
And also what CPU type are not support FPU?
Those who had universe knowledges can control the world by a micro processor.
http://www.wix.com/farabio/firstpage

"Etos siperi elegi"

AeroASM

All Intel x86 CPUs above the 386 come with FPU. (i think)

To use MMX, you need the .mmx directive and to use SSE, you need the .xmm directive.

To use SSE2 you need the ml 6.15
To use SSE3 you need the SSE3 macros.

Farabi

Quote from: AeroASM on March 01, 2005, 02:29:13 PM
All Intel x86 CPUs above the 386 come with FPU. (i think)

To use MMX, you need the .mmx directive and to use SSE, you need the .xmm directive.

To use SSE2 you need the ml 6.15
To use SSE3 you need the SSE3 macros.

I did but  I still cannot use it except I made a new project. It say, 'Not support on current CPU mode'.
Those who had universe knowledges can control the world by a micro processor.
http://www.wix.com/farabio/firstpage

"Etos siperi elegi"