Memory faster than register - subtle optimization issue?

Started by andreuri2000, November 17, 2007, 04:23:10 PM

Previous topic - Next topic

andreuri2000

While testing a register allocator for a hobby compiler, I ran into a very counterintuitive situation where storing a value in memory runs much faster than storing it in an available register, and I was wondering if some expert might have an explanation for it. 

The example follows.  Apologies in advance for the unusual syntax (I'm using the Sassy assembler - the meaning should hopefully be obvious).  This code, which calculates 1! a large number of times, is non-optimal in several ways that are still due to an immature compiler, but that is not what concerns me.

What is very surprising to me is that if I use an available register instead of the memory location temp below, the code slows down by a factor of 2 on my Pentium 4.  More specifically, the following program, which uses a memory location for the temporary:

          (data   (align 32)
                  (label temp (dwords 0)))
          (text
           (label fact
                  (mov ebx ecx)
                  (cmp ecx 0)
                  (je return)
                  (sub ecx 1)
                  (push ebx)
                  (call fact)
                  (pop ebx)
                  (imul eax ebx)
                  (ret))
           (label return 
                  (mov eax 1)
                  (ret))
           (label loop
                  (mov (& temp) ebx)   <--
                  (cmp ebx 0)
                  (je return2)
                  (mov ecx 1)
                  (call fact)
                  (mov ebx (& temp))   <--
                  (sub ebx 1)
                  (jmp loop))
           (label return2
                  (ret))
           (label _WinMain
                  (mov ebx 100000000)
                  (jmp loop)))

is faster by an approximate factor of 2 than the following equivalent program using the register edx instead of temp (the instructions that are different are indicated by arrows):

          (text
           (label fact
                  (mov ebx ecx)
                  (cmp ecx 0)
                  (je return)
                  (sub ecx 1)
                  (push ebx)
                  (call fact)
                  (pop ebx)
                  (imul eax ebx)
                  (ret))
           (label return 
                  (mov eax 1)
                  (ret))
           (label loop
                  (mov edx ebx)   <--
                  (cmp ebx 0)
                  (je return2)
                  (mov ecx 1)
                  (call fact)
                  (mov ebx edx)   <--
                  (sub ebx 1)
                  (jmp loop))
           (label return2
                  (ret))
           (label _WinMain
                  (mov ebx 100000000)
                  (jmp loop)))

This is quite the opposite of what I would have expected, and contradicts the basic assumption on which my register allocator has been based (namely that putting something in a free register should not be worse than putting  it in memory). 

For some reason, the slowdown effect disappears if I replace the imul with an add.   It also seems to disappear when I inline the call to fact in the outer loop.  Neither makes sense to my puny mind.

Andre

bozo

i use similar method of saving registers instead of using push/pop, when i need code to be fast.
the registers are saved in data section, rather than on stack..and it always seemed to be faster.

one of the first things i do when optimizing code is remove as many push/pop instructions as possible,
replacing with mov.

never liked the p4, but Mr Larson or someone with more knowledge of it could no doubt tell you why
your code is running faster with certain instructions.

Frank

Hi Andre,

what you describe seems next to impossible. EDX does not participate in any dependency chain (unless it has a false dependency on the two-operand IMUL on a Pentium 4, but that would certainly be documented somewhere). Alas I have no Pentium 4 here; I'll try it out at home and report back later today.

Frank

Rockoon

My guess is that this is a code alignment issue, something to do with winmain not exiting properly, or a mistake calculating the time.
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

andreuri2000

Quote from: Rockoon on November 18, 2007, 01:52:39 PM
My guess is that this is a code alignment issue, something to do with winmain not exiting properly, or a mistake calculating the time.

I do not believe it is a problem with exiting or time calculation.  The timings increase linearly with the loop range, and the 2x effect is preserved (e.g. 1s/2s, 5s/10s, 10s/20s, ...), which seems to exclude these two possibilities. 

As for code alignment, I am somewhat clueless as to where to start investigating this.

Andre

andreuri2000

Hmm, the slowdown disappears if I put a nop after the imul.  Specifically, the following code:

           (label fact
                  (mov ebx ecx)
                  (cmp ecx 0)
                  (je return)
                  (sub ecx 1)
                  (push ebx)
                  (call fact)
                  (pop ebx)
                  (imul eax ebx)
                  (nop)               <--
                  (ret))
           (label return 
                  (mov eax 1)
                  (ret))
           (label loop
                  (mov edx ebx)    *
                  (cmp ebx 0)
                  (je return2)
                  (mov ecx 1)
                  (call fact)
                  (mov ebx edx)    *
                  (sub ebx 1)
                  (jmp loop))
           (label return2
                  (ret))
           (label _WinMain
                  (mov ebx 100000000)
                  (jmp loop))

slows down by a factor of about 2 on my Pentium 4 when I omit the nop.  Omitting the nop makes no difference to the version where the lines indicated by asterisks moved the temporary to and from a memory location instead of to and from edx. 

I am not sure how to explain this (alignment issue?, pipeline issue?, ...).  Given that the slowdown is so dramatic, it is obviously something I need to address, and I would be grateful for any further comments or literature references.

Andre


Rockoon

The issue arrises at the targets of branches.

In the "ideal" world, all branch targets fall right at the start of a cache line OR within the cache line of the branch instruction.

Cache lines are typically 32 bytes on modern hardware, which is frustrating because WIN32 only aligns programs at 16 byte boundaries.

The worst case scenario is that the target instruction of a branch straddles two cache lines, and in most situations this is the only case you should really try to avoid.

MASM has an 'align' directive that can force the alignment of an instruction or label at a 2 byte, 4 byte, 8 byte, or 16 byte boundary. 1-byte NOP's are inserted before the instruction or label.
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

andreuri2000

Quote from: Rockoon on November 18, 2007, 05:26:40 PM
The worst case scenario is that the target instruction of a branch straddles two cache lines, and in most situations this is the only case you should really try to avoid.

Indeed, I checked the offset of the various labels emitted by the assembler, and the 2-byte "mov" instruction located at the main "loop" label was at 0x1F in the slow version.  Aligning it to 0x20 indeed gives the 2x increase in speed. 

Thank you very much for the precise identification of the problem and for the rule of thumb.  I can imagine that determining the necessity of and positions at which to insert such "align" directives can be laborious if done manually.  I am wondering if x86 assembler programmers commonly handle this issue in practice, and if so, how they go about it.   I also wonder if I can expect the effect to be as significant beyond the P4.   

Andre 

Mark Jones

The AMD CodeAnalyst package is amazing at pinpointing these types of bottlenecks. I analyzed a routine I had made to factorize big numbers, and found that a memory operation was slowing down the loop, so I changed the code to use registers and amazingly, it ran even slower. Just goes to show that timing your code is very important. :wink

But I would highly recommend the CodeAnalyst software... get from:

http://developer.amd.com/downloads.jsp
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

Frank

Quote from: andreuri2000 on November 18, 2007, 07:39:11 PM
the main "loop" label was at 0x1F in the slow version.  Aligning it to 0x20 indeed gives the 2x increase in speed.

With that additional information it was finally possible to reproduce the effect. Nice!

Quote from: Rockoon on November 18, 2007, 05:26:40 PM
The worst case scenario is that the target instruction of a branch straddles two cache lines, and in most situations this is the only case you should really try to avoid.

Loop alignment and/or cache line straddling do not really explain the speed difference between the program versions. The "loop" label was at the same location (0x1F) for both the slow and the fast version, and in both cases the target instruction straddled cache lines (mov edx, ebx: 2 bytes; mov [temp], ebx: 6 bytes).

Rockoon

The instruction after a call is ALSO a branch target.
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

andreuri2000

Quote from: Rockoon on November 19, 2007, 01:59:02 AM
The instruction after a call is ALSO a branch target.

I do not think I understand.  Assuming a non-aligned loop entry point of 0x1F, then in one case
(saving in a register) the instruction after the call is at offset 0x34 and in the other case
(saving in a memory location) at offset 0x38.  Neither jumps to my attention as problematic,
yet the second case is 2x faster.

Andre

Rockoon

When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

Frank

Quote from: Rockoon on November 19, 2007, 01:59:02 AM
The instruction after a call is ALSO a branch target.

Yes, but that branch target does not straddle cache lines, whether the loop starts at 0x1F or whether the loop entry is moved to 0x20 -- see the disassembly:


; store to memory
0040101F 891D00304000           mov     [403000h],ebx   ; [temp]
00401025 83FB00                 cmp     ebx,0
00401028 7415                   jz      loc_0040103F    ; return2
0040102A B901000000             mov     ecx,1
0040102F E8CCFFFFFF             call    fn_00401000     ; fact
00401034 8B1D00304000           mov     ebx,[403000h]   ; [temp]
0040103A 83EB01                 sub     ebx,1
0040103D EBE0                   jmp     loc_0040101F    ; loop

; store to EDX
0040101F 89DA                   mov     edx,ebx
00401021 83FB00                 cmp     ebx,0
00401024 7411                   jz      loc_00401037    ; return2
00401026 B901000000             mov     ecx,1
0040102B E8D0FFFFFF             call    fn_00401000     ; fact
00401030 89D3                   mov     ebx,edx
00401032 83EB01                 sub     ebx,1
00401035 EBE8                   jmp     loc_0040101F    ; loop


My impression is that a full explanation will have to account for the fact that we are talking about a Pentium 4 here. The P4 has no traditional code-cache, but has a trace-cache instead, with rather different rules.