News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Memory access speed issue

Started by srelu, January 18, 2012, 11:23:50 PM

Previous topic - Next topic

BogdanOntanu

Also tell us about the algorithm you want to implement.

What is it? Is it an Water ripple effect, 3D bevel/bump map, refraction, fire, smoke?
We need to know in order to understand the nature of your problem...

We also need to see all 3 loops in order to get the correct idea.

If you write pixels too far away in memory (and thus out of cache) then you should break your image into smaller chunks (rectangles) and try to iteratively apply the effect on each chunk in order to keep memory references together in cache.

Memory writes are fast unless you do write them always out of the cache or misaligned. As for heap you could try using VirtualAlloc() it has slightly better alignment but you could also align your heap buffer with the same result.

I really doubt that the "mov [eax],ecx" is really your problem. It just looks like it is because you make another mistake.

I would use mov [edi],eax but that is just me and I do not think it would help you in anyway.
I like to keep ECX for counters, EDI for destination pointers and ESI for source pointers :P

If you can not reduce the amount of reads per pixel, you can not avoid MUL at EVERY PIXEL and can not avoid  a pixel write that is out of  the CPU cache each and every time... Then I guess that "This Is IT" ;)

Beside using SEE that might help do/calculate/write more pixels at once there is no magic ASM bullet that will undo doing all conceptual errors at once when writing a single pixel / dword to memory. You do need to change the "point of view" for this problem.

Maybe move into 64 bits where you have more registers or try to also use the XMM registers if the target CPU permits this.

Also: How much memory do you read/write? Is it a lot? More than your target CPU cache?

Seriously now, we do need more info...

Ambition is a lame excuse for the ones not brave enough to be lazy.
http://www.oby.ro

MichaelW

I didn't have time to go any further with this, and my results are undoubtedly colored by the choices I made for the missing data.

;==============================================================================
include \masm32\include\masm32rt.inc
.686
include \masm32\macros\timers.asm
;==============================================================================

printf MACRO format:REQ, args:VARARG
    IFNB <args>
        invoke crt_printf, cfm$(format), args
    ELSE
        invoke crt_printf, cfm$(format)
    ENDIF
    EXITM <>
ENDM

;==============================================================================
.data
      dwWidthEx dd 16
      dwColor   dd 0
      buffer1   dd 20000 dup(4)
      buffer2   dd 20000 dup(OFFSET buffer2)
.code
;==============================================================================
start:

    invoke Sleep, 4000

    timer_begin 10000, HIGH_PRIORITY_CLASS
    timer_end
    printf("%dms\n", eax)

    timer_begin 10000, HIGH_PRIORITY_CLASS
        mov esi, OFFSET buffer1
        mov edi, OFFSET buffer2
        mov ecx, 1000
      cycle1:
        push ecx
        ; Next, load the y coordinate of the shape and multiply it by the width of the image
        mov eax, [esi]
        mul dwWidthEx
        add esi, 4                       ; Advance to the x coordinate
        add eax, [esi]                  ; Add the x coordinate (it's premultiplied by 4)
        add esi, 4                       ; Advance to the next y coordinate (for the next iteration)
        ; Next, add an absolute address from the picture buffer
        ; that will set the location where the shape will be drawn
        add eax, [edi]
        mov ecx, dwColor            ; for now, just load a color (later it will be taken from a table)
        mov [eax], ecx                ; write the pixel into the memory - SLOW!!!
        pop ecx
        loop cycle1
    timer_end
    printf("%dms\n", eax)

    timer_begin 10000, HIGH_PRIORITY_CLASS
        mov esi, OFFSET buffer1
        mov edi, OFFSET buffer2
        mov ecx, 100
      cycle2:
        push ecx
        ; Next, load the y coordinate of the shape and multiply it by the width of the image
        mov eax, [esi]
        mul dwWidthEx
        add esi, 4                       ; Advance to the x coordinate
        add eax, [esi]                  ; Add the x coordinate (it's premultiplied by 4)
        add esi, 4                       ; Advance to the next y coordinate (for the next iteration)
        ; Next, add an absolute address from the picture buffer
        ; that will set the location where the shape will be drawn
        add eax, [edi]
        mov ecx, dwColor            ; for now, just load a color (later it will be taken from a table)
        ;;mov [eax], ecx                ; write the pixel into the memory - SLOW!!!
        pop ecx
        loop cycle2
    timer_end
    printf("%dms\n", eax)

    inkey
    exit
end start


Typical results running on my P3:

0ms
336ms
24ms


I think you need a more cache-friendly algorithm.

eschew obfuscation

cmpxchg

didn't test this

    ; PUSHEs,  as needed and as much as you want
    ;push     ...
    ;push     ...
    ;push     ...   

    mov     ebp, [edi]
    mov     edi, pLMpy
    cld
cycle:
    lodsd                         ; load y coordinate
    mov     ebx, [eax*4+edi]      ; some weird table ????
    lodsd                         ; load x coordinate
    mov     edx, dwColor
    add     ebx, eax              ; += x
    mov     [ebx+ebp], edx        ; mov [coordinate + bitmap offset], color
    dec     ecx
    jnz     cycle

    ;pop      ...
    ;pop      ...
    ;pop      ... 




    mov     ebp, dwWidthEx       ; width of one bitmap line in bytes
    mov     edi, [edi]           ; bitmap address

    shr     ecx, 1               ; assuming that ecx >= 1
    jnc     _2pxLoop

    mov     eax, [esi]
    mov     edx, dwColor
    imul    eax, ebp
    add     eax, [esi+4]
    add     esi, 8
    mov     [eax+edi], edx
    test    ecx, ecx
    jz done

_2pxLoop:
    mov     eax, [esi]          ; y
    mov     edx, [esi+8]
    mov     ebx, dwColor

    ; prior to multiplication must ensure that result will not cross 2GB value
    imul    eax, ebp            ; y * width
    imul    edx, ebp

    add     eax, [esi+4]        ; +=  x*4
    add     edx, [esi+12]
    add     esi, 16

    mov     [eax+edi], ebx   ; mov [coordinate + bitmap_offset], color
    mov     [edx+edi], ebx

    dec     ecx
    jnz     _2pxLoop
done:                       




    mov     ebp, dwWidthEx       ; width of one bitmap line in bytes
    mov     edi, [edi]           ; bitmap address

    shr     ecx, 1               ; assuming that ecx >= 1
    jnc     _2pxLoop

    mov     eax, [esi]
    mov     edx, dwColor
    imul    eax, ebp
    add     eax, [esi+4]
    add     esi, 8
    mov     [eax+edi], edx
    test    ecx, ecx
    jz done

_2pxLoop:
    mov     eax, [esi]          ; y
    mov     edx, [esi+8]
    movd    mm0, dwColor
    movd    mm1, dwColor2

    ; prior to multiplication must ensure that result will not cross 2GB value
    imul    eax, ebp            ; y * width
    imul    edx, ebp

    add     eax, [esi+4]        ; +=  x*4
    add     edx, [esi+12]
    add     esi, 16

    movd    [eax+edi], mm0     ; mov [coordinate + bitmap_offset], color
    movd    [edx+edi], mm1

    dec     ecx
    jnz     _2pxLoop
done:                       

vanjast

Quote from: srelu on January 20, 2012, 12:34:17 AM
Frankly, I don't know anything about what you call "cache". Can you point me in the right direction? It might be of interrest.  Maybe some knowledge about the cache would enlighten me and make me able to see the reason I can't see now.
From my distant memory, diluted with spatterings of matured red grape juice

Cache is very fast memory, usually much faster than your DRAM (dynamic ram) chips that you see plugged into your motherboard.
Cache memory cells are larger and consume more power than Dram memory therefore not much is used.

There are generally 2 (or 3) levels of cache, each increasing level slower than the previous level.

Level-1: Usually situated on the CPU itself, and can be accessed in 1 cycle (hence super fast)
Level-2: Usually external to the CPU (but situated very close to it) (slower than level 1)
Level-3: Same as level-2, but usually slower.

On the motherboard/cpu, there is a cache controller, which keeps track of DRAM memory accesses in what is called page files, which are anywhere from 1024-4096 bytes in size (Sizes may vary - this is just an example)

When you access dram memory, the cache controller loads that page into cache memory, and any further accesses to this memory area (cache hit) are diverted to cache accesses - hence it is much faster. As soon as you access dram memory outside this page (cache miss), the cache controller has to flush(save) the current page and reload the new page. This can degrade performance if every cycle counts - so as advised, it is better to keep your data as close together as possible and inner loop sections as tight as you can.

There can be many pages in cache, all from different areas of Dram memory.

:8)

clive

x86 pages are usually 4 KB, there are extensions to support super pages of 2MB/4MB/1GB depending on the architecture. These are typically used by the OS to describe large linear regions of memory used by the OS itself.

DRAMs often have "pages" defined by the row/column/banking internally, but these are typically totally disconnected from the cache structures. Sequential accesses within these pages requires less setup, and thus less latency. How effectively these are used depends on the memory controller, and the access patterns.

Cache lines are universally smaller than 4KB, they are typically in the order of 64 bytes. Want to check, use something like CPU-Z to decode the descriptors. This typically acts a a minimum read width of memory, prefetches or speculative reads.

The TLB (translation look-aside buffer) is a caching mechanism for the page table traversals that convert virtual addresses to physical addresses. If you are churning accesses across large regions of memory you will expose the latency of traversing the tables.

Caching is a fairly standard "computer architecture" topic, there should be plenty of texts on the topic. The "MindShare" series, usually by Tom Shanley, had some good analysis of the older Intel architectures. I've not dug into the more recent ones.

http://www.amazon.com/Pentium-Pro-System-Architecture-2nd/dp/0201309734/ref=sr_1_9?s=books&ie=UTF8&qid=1328032491&sr=1-9

Video memory is often very much slower than other memory, either by being uncacheable, or not allowing write-combining operations. Write speeds will be exposed if you consume all the available "Write Buffers" within the system.

As I mentioned earlier, the use of XCHG exposes the worst case timing of the underlying memory, where all the smoke and mirrors are removed.
It could be a random act of randomness. Those happen a lot as well.

Farabi

Quote from: srelu on January 20, 2012, 12:43:16 AM
@Hutch
Got the loop and sub issue. There are no memorry operands in the inner loop. Still those problems are of secondary importance. The BIG problem is the outrageous time consumed by the "mov [eax], ecx" instruction. Are factors that can cause difficulties for the CPU to write into the memory?

Remember that on some computer system, reading memory is the same like reading a harddisk, if Im not mistaken it is called  virtual memory. So, youre not accessing the RAM directly but reading it from a harddisk. It making the system memory bigger of course, even you had only 128 MB RAM, but you can allocate 4 GB each thread, but the problem is, when 2 Thread access the same harddsik at the same time. Your code need to wait another operation to be done.
Those who had universe knowledges can control the world by a micro processor.
http://www.wix.com/farabio/firstpage

"Etos siperi elegi"