News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Beginners Graphic Questions

Started by Jimg, December 01, 2008, 08:10:19 PM

Previous topic - Next topic

Jimg

#15
Okay, putting that last fiasco behind me, I have a better question-

I set up a single bitmap compatible with my app to contain all my images, and am bliting from that bitmap to my hidden buffer as needed, rather than having a separate bitmap for each image.

I set up this large bitmap organized as 64 images wide by 1 image high.  It ran much slower than expected.  By playing around, I found that the jump in execution time occurred between making the bitmap 47 and 48 images wide.  What would cause this behavior?

I set up a larger bitmap, organized as 16 images wide by 10 images high, and it still ran in the same time as the small bitmap.

Test 2 is the 64 x 1 bitmap, Test 3 is the 16 x 10 bitmap-
        Clocks   Description
Test 2  27557    BitBlt to hidden buffer from compatible buffer containing images 64 x 1
Test 3  2017     BitBlt to hidden buffer from compatible buffer containing images 16 x 10

Thats over 13 times faster, and with more storage available.  It this just another screwup on my part, or is there some logical explanation for this?

Here's a look at the code-

Setup:
Test2Part1:
    inv LoadImage,0,addr file1,IMAGE_BITMAP,0,0,LR_LOADFROMFILE
    mov hBmp1,eax
    inv GetObject,hBmp1,sizeof btmap,addr btmap ; get card info

    inv GetClientRect,hWin,addr crect    ; get the size needed for the bitmap
    inv GetDC,hWin  ; get dc's
    mov hdc,eax
    inv CreateCompatibleDC,hdc      ; create dc for hidden buffer
    mov mdc,eax
    inv CreateCompatibleDC,hdc      ; create dc for scratch card images
    mov sdc,eax
    inv CreateCompatibleDC,hdc      ; create dc for single card image
    mov tdc,eax

    inv CreateCompatibleBitmap,hdc, mwidth, mheight ; make a hidden buffer of the appropriate size
    mov mbmp,eax
    inv SelectObject, mdc, mbmp
    inv DeleteObject,eax        ; delete old bitmap
    inv CreateCompatibleBitmap,hdc,btmap.bmWidth,btmap.bmHeight ; make one of the appropriate size for card images
    mov tbmp,eax
    ret

Test2Part2:
    ; copy 4 card images into the same dc to avoid SelectObject,tdc,hBmp1
    push esi
    xor esi,esi
;       (already loaded above to get size)
;    inv LoadImage,0,addr file1,IMAGE_BITMAP,0,0,LR_LOADFROMFILE
;    mov hBmp1,eax
;    inv GetObject,hBmp1,sizeof btmap,addr btmap ; get card info
    inv SelectObject,tdc,hBmp1  ; get a card
    inv DeleteObject,eax        ; delete old bitmap
    inv BitBlt,sdc,esi,0,btmap.bmWidth,btmap.bmHeight,tdc,0,0,SRCCOPY
    inv DeleteObject,hBmp1

    inv LoadImage,0,addr file2,IMAGE_BITMAP,0,0,LR_LOADFROMFILE
    mov hBmp1,eax
    inv SelectObject,tdc,hBmp1  ; get a card
    inv DeleteObject,eax        ; delete old bitmap
    add esi,btmap.bmWidth       ; move over one bitmap width
    inv BitBlt,sdc,esi,0,btmap.bmWidth,btmap.bmHeight,tdc,0,0,SRCCOPY
    inv DeleteObject,hBmp1
   
    inv LoadImage,0,addr file3,IMAGE_BITMAP,0,0,LR_LOADFROMFILE
    mov hBmp1,eax
    inv SelectObject,tdc,hBmp1  ; get a card
    inv DeleteObject,eax        ; delete old bitmap
    add esi,btmap.bmWidth
    inv BitBlt,sdc,esi,0,btmap.bmWidth,btmap.bmHeight,tdc,0,0,SRCCOPY
    inv DeleteObject,hBmp1
           
    inv LoadImage,0,addr file4,IMAGE_BITMAP,0,0,LR_LOADFROMFILE
    mov hBmp1,eax
    inv SelectObject,tdc,hBmp1  ; get a card
    inv DeleteObject,eax        ; delete old bitmap
    add esi,btmap.bmWidth
    inv BitBlt,sdc,esi,0,btmap.bmWidth,btmap.bmHeight,tdc,0,0,SRCCOPY
    inv DeleteObject,hBmp1
   
    pop esi
       
    mov strtx,10
    mov strty,20
   
    BeginCounter
        xor eax,eax         ; choose 4th card for test
        add eax,btmap.bmWidth
        add eax,btmap.bmWidth
        add eax,btmap.bmWidth
        inv BitBlt,mdc,strtx,strty,btmap.bmWidth,btmap.bmHeight,sdc,eax,0,SRCCOPY
    EndCounter
    inv BitBlt,hdc,strtx,strty,btmap.bmWidth,btmap.bmHeight,mdc,0,0,SRCCOPY
    inv ReleaseDC,hWin,hdc
    inv DeleteObject,mbmp
    inv DeleteObject,sbmp
    inv DeleteDC,mdc
    inv DeleteDC,sdc
    inv DeleteDC,tdc
    inv DeleteObject,hBmp1
    ;Showit x
    ret

The tests:

Test2:
    Desc 'BitBlt to hidden buffer from compatible buffer containing images 64 x 1'

    call Test2Part1
  ; cards are 71 x 96
    xdim=48
    ydim=1
    mov eax,btmap.bmWidth   ; usually 71
    imul eax,xdim
    mov ecx,eax
    mov eax,btmap.bmHeight  ; usually 96
    imul eax,ydim
    inv CreateCompatibleBitmap,hdc,ecx,eax ; make one of the appropriate size for card images
    mov sbmp,eax
    inv SelectObject, sdc, sbmp
    inv DeleteObject,eax        ; delete old bitmap

    call Test2Part2
    ret

Test3:
    Desc 'BitBlt to hidden buffer from compatible buffer containing images 16 x 10'

    call Test2Part1
   
  ; cards are 71 x 96
    xdim=16
    ydim=10
    mov eax,btmap.bmWidth   ; usually 71
    imul eax,xdim
    mov ecx,eax
    mov eax,btmap.bmHeight  ; usually 96
    imul eax,ydim
    inv CreateCompatibleBitmap,hdc,ecx,eax ; make one of the appropriate size for card images
    mov sbmp,eax
    inv SelectObject, sdc, sbmp
    inv DeleteObject,eax        ; delete old bitmap

    call Test2Part2
    ret



Rockoon

Quote from: japheth on December 17, 2008, 07:54:37 PM
My guess is that the copy to video memory is done via PCI Busmaster DMA, that is, the cpu is not involved in the copy process.

I suspect that yes thats done for the DIB's.
DDB's are usualy already stored in video memory (one of the reasons that you can't nick a pointer to the pixel data) so no need to busy the BUS either.

as far as the OP's 10,000 GDI objects.. woah.. slow down guy.. thats too much..
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

Jimg

#17
Hi, it's me again.  I have another problem I just can't seem to understand.

The original question has to do with making the corners transparent or rather, not wipe out the background when the card with rounded corners is drawn.

After experimenting, I find the fastest way is to draw the card in sections without bliting to the corners.

It takes 5 sections to blit the card image.  Each blit seems to take about 2000 clicks.

I know that just drawing a line with fillrect only takes a couple hundred, so I thought I'd blit the center of the card, and draw the border in sections with four fillrects.

The timing tests come out like this:
        Clocks   Description
Test 1  8891     BitBlt to hidden buffer from compatible buffer in sections
Test 2  1990     card without border
Test 3  556      border only
Test 4  8882     card and border only


I expected the drawing the center of the card and the border to take the sum of the two, about 2600 clicks, but apparently, it takes much longer, about 8800 clicks.

What goes on here?

The code for test2, test3, and test4 are identical with only the appropriate bitblt's commented out like this-
Test2:
    Desc 'card without border'
    call TestPart1
       
    pusha
    BeginCounter
        call setup1
        inv BitBlt,mdc,edi,ebx,69,94,sdc,esi,1,SRCCOPY
        call setup2
;       inv FillRect,mdc,addr lrr,hBrush
        call setup3
;        inv FillRect,mdc,addr lrr,hBrush
        call setup4
;        inv FillRect,mdc,addr lrr,hBrush
        call setup5
;        inv FillRect,mdc,addr lrr,hBrush
    EndCounter
    popa
    call cleanup
    ret

Test3:
    Desc 'border only'
    call TestPart1
       
    pusha
    BeginCounter
        call setup1
;        inv BitBlt,mdc,edi,ebx,69,94,sdc,esi,1,SRCCOPY
        call setup2
        inv FillRect,mdc,addr lrr,hBrush
        call setup3
        inv FillRect,mdc,addr lrr,hBrush
        call setup4
        inv FillRect,mdc,addr lrr,hBrush
        call setup5
        inv FillRect,mdc,addr lrr,hBrush
    EndCounter
    popa
    call cleanup
    ret

Test4:
    Desc 'card and border only'
    call TestPart1
       
    pusha
    BeginCounter
        call setup1
        inv BitBlt,mdc,edi,ebx,69,94,sdc,esi,1,SRCCOPY
        call setup2
        inv FillRect,mdc,addr lrr,hBrush
        call setup3
        inv FillRect,mdc,addr lrr,hBrush
        call setup4
        inv FillRect,mdc,addr lrr,hBrush
        call setup5
        inv FillRect,mdc,addr lrr,hBrush
    EndCounter
    popa
    call cleanup
    ret


clearly, there is something going on here I don't understand.  Is the drawing done asynchronously?  Are all my timing tests of graphics worthless?

(test prog in later post)

NightWare

haven't tested the app, or really analysed the code, but i can say :

Quote from: Jimg on December 20, 2008, 05:50:00 PM
I set up this large bitmap organized as 64 images wide by 1 image high. It ran much slower than expected.
it's probably the problem, the l1/l2 cache load memory section, so to obtain your image you will have to load SEVERAL sections (due to the width), by organizing 1 image wide with 64 image high, it's should solve the problem...

Quote from: Jimg on December 22, 2008, 01:36:41 AM
I expected the drawing the center of the card and the border to take the sum of the two, about 2600 clicks, but apparently, it takes much longer, about 8800 clicks.
here again, it's a problem with the cache, it's better to perform 1 area, than several, to obtain the full benefit of the cache, even if some useless work is made... choice a fill with an alpha option...

Jimg

Well, I went on a search for alpha gdi, and it's all a bit beyond me at the moment, but it led me to http://www.winprog.org/tutorial/transparency.html

That only takes two blits, so runs in half the time.  Thanks.   From what I read of alpha, it looks too complicated to run fast, do you have a simple example you would share?

Jimg

#20
From further research, I conclude that the fastest way is TransparentBlt (in msimg32 lib), but it only works on NT or later systems.  For non-NT, use the two blits mask method.
The only real drawback is that it uses up one of the 16 colors available as the transparent color.  Perhaps I could search all the colors used in all the incoming bitmaps for an unused color?

Here are my results-
        Clocks   Description
Test 1  9479     BitBlt to hidden buffer from compatible buffer in sections
Test 2  1986     card without border
Test 3  560      border only
Test 4  8835     card and border separate
Test 5  3890     two BitBlts using mask
Test 6  2011     TransparentBlt (NT only)



Rockoon

Note that TransparentBlt had a resource leak on unpatched Win98 and earlier (and possibly even fully patched editions, but I don't know)

As far as what pixel color to use.. use the pixel color already present in your images in that to-be transparent area.

For unpalleted images, usualy the pixel in the upper left corner is taken to be the transparent color in color-keying software that use image formats that do not store a color-key. For paletted images, usualy the first or last color index is taken to be the color-key.
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

NightWare

Quote from: Jimg on December 22, 2008, 03:27:24 PM
do you have a simple example you would share?
personally i use my own algos for gfx stuff, i create a memory area (a screen/work area, where each dword correspond to a pixel, bytes A,R,G,B (32 bits)). then i copy/paste the needed gfxs to this area (so here it's easy to apply alpha/transparency). and when everything is done, i send the area to SetDIBitsToDevice. it's IMHO the best method.

now, if you want to limit the memory usage, or use small range of colors, etc... it can becomes problematic... but honestly all thoses 256, 5/6/5, 8/8/8, etc... formats are obsolets now. here some examples i've posted some time ago :

http://www.masm32.com/board/index.php?topic=7809.0
http://www.masm32.com/board/index.php?topic=7461.0

note : ARGB is the format INSIDE registers, otherwise in memory it look like BGRA...

Jimg

I just used an alternate timing method and got completely different results.  And the results varied greatly depending upon how many test iterations one does.   It appears that this gdi stuff really runs asynchronously and it's going to take a much more sophisticated test, so please ignore my previous results and I'll get back later with some much better figures.

Rockoon

Its true. Most common GDI functions will be hardware accelerated on most systems these days. Certainly anyone with an nvidia or ati GPU and their respective drivers will have their standard bitblt rop's and font renderings accelerated. The level of acceleration can be tweaked from the users control panel.

Some things to note is that I'm pretty sure that StretchBlt() ISNT accelerated anywhere (might be something to do with microsoft seeking a consistent quality across platforms).. it definately isnt on my xp/64 system with nvidia 8800GT. Also, to get the maximum amount of acceleration, neither your source nor target can be a DIB (you need to stick to DDB's.)
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.