News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Memory access speed issue

Started by srelu, January 18, 2012, 11:23:50 PM

Previous topic - Next topic

srelu

Although I was once a good Zilog Z80 programmer, I did very little programming with Intel processors. I try to write some code for graphic processing. Considering the great amount of the required operations, I inserted Assembler code into a Visual C++ project. The code works well, I only have a speed issue. Using the GetThickCount function, the execution takes about 2700 thicks, that's about 5 seconds.

I have an instruction, in the core of three loops, that's's the only one that writes something into the memory.
mov [eax], ecx

If I remove that single instruction, the duration drops from 2700 thicks to 500 (about a second). That's a five times speed increase. Practically, I can't keep that instruction removed because that's the one that generates the graphics, it writes an RGB pixel value to a given address.

My question is why does that instruction slow down so severily the execution, and what could I do to fix that. Is memory access so slow on Intel processors, or maybe I' doing something wrong?

I used for testing a Pentium 4 processor at 2.66 Ghz because I want to make sure that the code will run properly on any computer. On a Centrino at 2 Ghz the execution takes about 1 second instead of 5, but it's still too slow. I need the graphics to be updated in real time while the user operates the controls that changes the perameters of the drawing (like dragging a slider).

I did everything possible to speed up the algorithm, I even replaced all the complex calculations with tables of precalculated values, the only arithmetic operations left are add and shl.

At this time I'm out of ideas. Do you have any? (Other than waiting for faster processors.)

hutch--

Just post a bit more of your assembler code so that the context makes sense, it may be something simple that can be fixed.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

dedndave

yes - we definitely need to see more of the surrounding code
intel processors may perform instructions "out of order" to increase performance
sometimes, simply changing the order things are done can make a big change
also - overall loop design can be a problem, as well

donkey

Hi,

Well you have not provided any code so we can't provide any specific answers. There are a number of things that can affect the execution speed, first off memory external to the processor is always going to be slower and 5x slower though pushing the high end of the curve is possible. Consider that you have a 2.66 Ghz processor and an 800 Mhz bus, accessing external memory will be at least 3.325x slower. However, depending on the complexity of your algorithm, I doubt that the actual memory access makes up a huge part of the routine as a whole so the difference should be negligible. The next and somewhat more critical problem would be stalls, if the processor has to wait for an instruction pipeline to flush completely before beginning the next instruction, it will cause a stall and eat up clock ticks (not thicks). After that you may have branch prediction problems, by creating branches that are at odds with the branch prediction hardware you can cause the instruction cache to be invalidated quite often, this can in some circumstances slow down a program dramatically. That's the big three, without the full routine it is not possible to guess which is most likely or if its something else completely.

A last thing to aim for would be if you can fit the loops in a cache line, I have written routines that once I reduced them to a cache line in size ran up to 3x faster, adding a single instruction outside the cache line caused another fetch and severely impacted speed.

When doing intensive graphics processing you should be looking to the SIMD instruction sets, they are designed for that purpose and can cut execution time exponentially in some cases. Which to choose depends on the exact nature of the calculations, MMX will almost always give you some speed increase however. If your application requires even more horsepower you may be able use the GPU to execute some portions of your code, this is extremely complex (I have never successfully got it working) but rumour has it that the speed increase is phenominal.

Agner Fog is a must read if you want to optimize code:

http://www.agner.org/optimize/optimizing_assembly.pdf

Edgar
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

srelu

Thank you for your kind and fast answers.

Here's the innermost loop of the three. It's purpose is to draw a shape into a device independent bitmap buffer. The shape is taken from another buffer addressed by esi. That buffer contains the coordinates of the points of the shape. The coordinates are relative to the center of the buffer that holds them, some positive, some negative.  First is the y coordinate, next is the x coordinate (reverted order for the confort of the calculations). The x coordinate is premultiplied by 4 since the bitmap is 32 bit color deepth.

The code contains a failed attempt to speed up calculations by replacing a multiplication operation with values taken from a multiplication table. No significant speed gain was achieved.

No need to verify the logic and math of the calculations. I did it manually running several iterations step by step with the debugger and verifying each result manually. Everything goes as intended.

The other two outer loops only prepare parameters and manage loop variables. The entire fully functional code would be near to impossible to be posted here since it relays on huge tables. Not to mention that programmers are not very pleased to make public all their code.


cycle: push ecx
; Next, load the y coordinate of the shape and multiply it by the width of the image
mov eax, [esi]         
shl eax, 2                       ; (New) - Replacing multiplication with value taken from a multiplication table
add eax, pLMpy               ; (New) -add the pointer to the middle of the multiplication table (before the midle there are negative values)
mov eax, dword ptr[eax]  ; (New) -load the premultiplied value from the table
; mul dwWidthEx               ; This commented out multiplication is replaced by the three previous lines of code (no visible speed gain)
add esi, 4                       ; Advance to the x coordinate
add eax, [esi]                  ; Add the x coordinate (it's premultiplied by 4)
add esi, 4                       ; Advance to the next y coordinate (for the next iteration)
; Next, add an absolute address from the picture buffer
; that will set the location where the shape will be drawn
add eax, [edi]
mov ecx, dwColor            ; for now, just load a color (later it will be taken from a table)
mov [eax], ecx                ; write the pixel into the memory - SLOW!!!
pop ecx
loop cycle

hutch--

First thing, get rid of the LOOP instruction, it is very slow.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

BogdanOntanu

You are doing it all wrong conceptually.

1) Do NOT push and pop ECX for each pixel :P use another register (EBX).
2) avoid LOOP instruction

3) Most important: Do NOT MUL with Y for each pixel.

Instead arrange your data in such a way as to have all pixel from the same Y together in a row  like this:
[Y Line1] x1 x2 x3 x4... xn
[Y line 2] x1 x2 ... xn
...

4) You do to many reads, try to reduce them
5) Y can also bepremultiplied by 4
6) Forget about look-up tables in this case.
It will slow you down because you are bound by MEMORY read/write speed (bandwidth) and one extra read/write will kill speed.
The MUL instruction is fast enough IF you do it only once per Y line (as needed)

Forget about "protecting" it ...
It is not a new idea algorithm :P I did it 10 years ago (correctly) in HE RTS sprites / shapes I still use it in my OS, just do it right ;)

And I am kind of sure somebody else has had the same idea before me ;)
Ambition is a lame excuse for the ones not brave enough to be lazy.
http://www.oby.ro

dedndave

also - if you have 3 loops, it is likely that it can be simplified

clive

Quote from: sreluUsing the GetThickCount function, the execution takes about 2700 thicks, that's about 5 seconds. .. to 500 (about a second) ..

GetTickCount returns ticks in milliseconds, you'd need 5000 of those to be 5 seconds. So, 500 would be half a second. For precision one might use RDTSC to read a 64-bit value which would tick at 2.66 Billion ticks per second on the processor mentioned.

Want to determine the raw memory access speed, do an "XCHG [eax],ecx", this would expose the speed to read and then write to the same location. Normally this would be hidden by caches, write buffers, etc.
It could be a random act of randomness. Those happen a lot as well.

jj2007

#9
push ebx
  mov ebx, pLMpy
cycle:
...
lea eax, [4*eax+ebx]
; shl eax, 2                       ; (New) - Replacing multiplication with value taken from a multiplication table
; add eax, pLMpy               ; (New) -add the pointer to the middle of the multiplication table (before the midle there are negative values)

Quotemov ecx, dwColor

- You trash edx anyway with the mul, so ecx could be replaced by edx, saving you the push ecx/pop ecx sequence
- As others said, loop is very slow.
- imul is much faster than mul, check if you can replace that.

- probably faster:
add eax, [esi+4]                  ; Add the x coordinate (it's premultiplied by 4)
add esi, 8                       ; Advance to the next y coordinate

Otherwise good luck. QueryPerformanceCounter and the timer macros are your friends. And SSE2, of course :bg

Attached a little testbed - v0 is your version. First results look promising :bg
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE
181     cycles for v0
186     cycles for v1
180     cycles for v2
68      cycles for v3

srelu

1. I replaced loop with "dec ecx" and "jnz". No significant gain. I cannot make an accurate estimation since the exact number of ticks depends on haw many threads are active at a certain moment. Anyway it's still too slow, the source of my happiness is not there.

2. I replaced "mov eax, [ecx]" with "xchg eax, [ecx]". Result: a catastrofic speed loos, tick cout over 9000.

3. I don't have extra registers to use, don't forget this is just the innermost loop, there are two others too. They also need registers. I only have the edx but I would like to use it as a poiter into an RGB table not implemented yet.

4. I cannot avoid multiplication for each pixel, that's the only way to convert a 2-dim space (x,y) into 1-dim address list. I cannot use premultiplied calculations since the shape must be drawn to different locations. Also, the the dimensions of the target bitmap may change between two executions of the routine.

5. The three loops are necessary, that's the algorithm. It's not just about drawing something on the screen, it's about achieving a special graphic effect.

6. I see no reason in getting the exact cpu cycles required by this routine. I can tell you without any specialized software that it's too slow.

7. Indeed the time is not 5 seconds, it's only 2.5. When something it's slower than you expect, you'll find it slower than really is.

8. I tried to add a second memory write to the loop. It only increased the execution by 400 ticks.

9. Additional info: I write in the heap, the buffer was created with the "new" operator.

10. The root of all evil is the "mov [eax], ecx" instruction. That single instruction uses 5 times more time than all the other instructions togheter. Please focus your attention on that. Something's wrong there.

jj2007

0. You didn't even bother to download my testbed, which would have shown you how to get almost a factor three improvement, at least if the destination memory is not too large for the cache.

Your points 1, 2, 3, 6, 7, 8, 9 and 10 might trigger some reactions in this thread - not from myself, though. Good luck with your project.

hutch--

A number of things, get rid of LOOP, find a way to avoid the push pop on ECX on an iteration basis then keep in mind that your inner loop is more time critical than the outer loops so keep your registers for where the action is in the inner loops. You can use memory operands for the less critical outer loops. Use "sub ecx, 1 ||| jnz label", Intel recommend ADD or SUB over INC and DEC.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

srelu

jj2007, thank you for your answer. I didn't use the software you reccomended since I see no reason to do it. Frankly, I don't know anything about what you call "cache". Can you point me in the right direction? It might be of interrest.  Maybe some knowledge about the cache would enlighten me and make me able to see the reason I can't see now.

srelu

@Hutch
Got the loop and sub issue. There are no memorry operands in the inner loop. Still those problems are of secondary importance. The BIG problem is the outrageous time consumed by the "mov [eax], ecx" instruction. Are factors that can cause difficulties for the CPU to write into the memory?