News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Global variables are slow/bad?

Started by RedXVII, August 19, 2006, 12:56:33 AM

Previous topic - Next topic

RedXVII

1)  I seem to remember reading somewhere that global variables are slow(er) than local variables, is this true?

2)  Also, I also saw someone doing something like this:

WndProc proc hWnd:HWND, uMsg:UINT, wParam:WPARAM, lParam:LPARAM
  ...
  Lots of code here
  ...
  mov hMain, eax
  ret

;my variables
hMain  dd   0

WndProc endp


This is to get around the problem that you cant store a local variable from a wndproc for the next loop (It also means there arent so many global variables, making things look neater). Is putting the variable in the .code section slower (than local or variable) or will it cause lots of unforseen problems?

Thanks  :U
RedXVII

hutch--

Red,

Memory is memory and once it is in close cache, there is no difference. The arguments for using LOCAL variables is because the stack frame and / or the arguments have just been passed to the procedure so they are in cache.

The real difference between LOCAL and GLOBAL variables is the scope you need for a variable. If it will only ever be used within a single procedure, a LOCAL makes sense as it does not clutter up your .DATA section with so many names and it makes re-entrant procedures easy to write but if you need a variable to be visibile across different procedures, a GLOBAL is the way to go.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

dsouza123

Globals and allocated (heap) storage are also used for programs with more than one thread.

In a single threaded program that doesn't use the XMM (SSE2) registers
and _IF_ the CPU has SSE2 support the eight 128 bit SSE2 registers
can be used to hold data.
The eight registers can be viewed as 32 dword variables.
Using pshufd the dwords inside a SSE2 register can be swapped, then
transferred to one of the regular registers ECX, EDX etc or written
to locals, (on the stack).
A SSE2 register could be used to hold values for a proc across calls,
like globals.

Like Hutch said memory is memory once in the L1 cache,
registers are even faster because they are internal to the CPU.

RedXVII

#3
Thanks hutch and dsouza123.

What parts of the program are put into the cache then? Surely not the entire program, I dont think theres enough memory in the cache(s) by my last count to fit everything.

As to question 2, the variable is in the .text (EDIT: a.k.a .code) section, right after my WndProc ret instruction. Is this potentially a bad thing to do?

Red

dsouza123

I've never used a .text section, the following code
is a very stripped down part of an assembly program
with multiple sections.


.686
.model flat, stdcall
option casemap:none

include \masm32\include\windows.inc
include \masm32\include\kernel32.inc
include \masm32\include\user32.inc
include \masm32\include\shlwapi.inc

includelib \masm32\lib\kernel32.lib
includelib \masm32\lib\user32.lib
includelib \masm32\lib\shlwapi.lib

WinMain proto :DWORD, :DWORD, :DWORD, :DWORD

.const
  szMsg db "A fixed text string",0

.data
  initvar dd 0
  szBuf db 1024 dup (0)

.data?
  hInstance HINSTANCE ?
  uninitvar dd ?

.code
start:
  invoke GetModuleHandle, NULL
  mov hInstance, eax
  invoke WinMain, hInstance, NULL, eax, SW_SHOWDEFAULT
  invoke ExitProcess, eax

WinMain proc hInst:HINSTANCE, hPrevInst:HINSTANCE, lpszCmdLine:LPSTR, iCmdShow:DWORD
  LOCAL sBuf[256] : BYTE
  LOCAL dwPort : DWORD

  mov sBuf, 49
  mov sBuf+1, 50
  mov sBuf+2, 0
  invoke  StrToInt, addr sBuf
  mov  dwPort, eax
  xor eax, eax
  ret
WinMain endp

end start

RedXVII

Looks like a normal program to me, Im sorry, but I dont see your relevence or what this does to help me answear my question.  :eek

dsouza123

I haven't seen a .text section so the example code was to show the types of sections that I've found in programs.

Is/can the .code section be read only ?

Would a system with a 64 bit CPU with DEP, running Windows XP SP2 have a problem
with part of the code section being modified/overwritten ?

u

global data doesn't slow-down your app.
After having some experience, you'll know when it's more convenient to have some data as global or local.


I wouldn't recommend making your .text (aka .code) section RW. When I need some global vars (that in C procs are labeled as "static") , I do:


ExampleProc proc param1,param2
.data
ExampleProc_static1 dd 0
ExampleProc_static2 dd 0
.code
... (some code)
ret
ExampleProc endp


A real example (copy/paste/trim: )

FS_EnumerateFoldersToMenu proc UseMost hMenu,lpszRootDir
local strSize,hFind,curItem
local DaString[260]:BYTE
local fd:WIN32_FIND_DATA

.data
FS_enumfoltomenu_bmp1 dd 0
.code
.if !FS_enumfoltomenu_bmp1
invoke LoadImage,hinst,107,IMAGE_BITMAP,0,0,LR_LOADTRANSPARENT
mov FS_enumfoltomenu_bmp1,eax
.endif


cmp lpszRootDir,0
je _ret


mov curItem,32768

lea eax,DaString
... (more code)


Btw, some time ago I started coding for a cursed OS where code didn't have globals. I coded a lot and hard to make it possible to have global data, other coders got themselves severely limited by having only locals. Thus, my company didn't have hard competition for quite a while :)  - (globals ARE important)
Please use a smaller graphic in your signature.

RedXVII

Nice one Ultrano, thanks. I suppose I'll use a few globals like that, I dont like my code getting messy, and i hate scrolling up to add another globalvariable to my huge list so i try and keep the number of them low. I was under the impression that the .data section wasnt in the processor cache, and was therefore slower to acess - (actually, can anyone clear that up for me).

Cheers  :U

Mark Jones

Hi Red. Don't forget you can stick all your globals and includes into an .inc file and include that instead of having all of it in one file. :)

"Stack" memory is still system memory, just like .data only defined elsewhere. .data might point to 00403000 while the stack points to 00120000 or so. The only advantage to using the stack is that PUSH and POP are generally very fast compared to MOV+MEM PTR, and the data in the stack is kept close to cached if not only because it is constantly being accessed... so... unless you're coding tight loops which demand the utmost speed optimization, save yourself a lot of frustration and simply use globals for "global data" and locals for temporary (or "local") data. :U

Btw, run-time reserved memory (GlobalAlloc, VirtualAlloc, etc) behave as global data.

Also, for caching performance analysis, check out AMD's CodeAnalyst. If you have an Athlon CPU this is an awesome tool which can tell you specific details about the cache state during code execution.
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

Mark_Larson

Quote from: Ultrano on August 19, 2006, 09:53:48 PM
global data doesn't slow-down your app.

  Actually it can.  The problem occurs when you have a lots of data in your program that don't all fit into the cache.  If you have a global variable, it's not guaranteed to be in the cache.  I use Locals whenever possible in large programs because of this.  Locals will be on the stack, which gets allocated dynamically when you enter the procedure, therefore ensuring it's in the cache.

BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

u

Yes, and same goes to the dynamically-allocated data (HeapAlloc) .

So, here RenderSound1 is faster... usually.


RenderSound1 proc
    local Sound[512]:real4
    ; some code here to fill-in the Sound buffer
    ;...
    ret
RenderSound1 endp


.data?
  RenderSound2_Sound real4 512 dup (?)
.code
RenderSound2 proc
    ; some code here to fill-in the RenderSound2_Sound buffer
    ;...
    ret
RenderSound2 endp


Yet, if you bear in mind the cache while you're coding your apps, you can always get the best results, regardless of data position.
Please use a smaller graphic in your signature.

dsouza123

AMD since the Athlon have had a split L1 cache 128 KB total, with 64 KB code, 64 KB data
there is also a L2 cache of varying size 128 KB to 1 MB.

The current generation Intel Core 2 Microarchitecture L1 cache is 32 KB code, 32 KB data,
the L2 cache is 1 MB or 2 MB.

All values are per core.

If a proc needs all/some of the variables to retain the value between calls
either globals or heap/alloc or some register scenario needs to be used,
unless there is some way to keep locals on the stack from being cleared out
then reloaded.  Maybe subroutines using an alternative proc system that
doesn't require pushing values and reallocating stack for locals on each call.

If the globals in total or a highly used subset are small enough 1, 2 or 4 KB
would the CPU keep them in the L1 cache (or L2 ) even with the OS
and other programs getting their time slice ?

How can the cache retention/hit rate for some data be determined ?
Are there APIs for it ?
Does the CPU keep track and if so can the info be accessed ?

u

[correct me if I'm wrong]
When the OS switches to another process' thread, the cache is invalidated (just marked as not valid). Since the cpu updates the SDRAM via the write-through method, it needn't upload the whole cache into RAM when task-switching.
So, when the OS switches back to your thread, the whole L1 and L2 are empty. You start filling them in again.

"Write-through" basically is:
when the cpu executes
mov dword ptr[00401234h], 77777777h
it writes the 777.. value in  L1,L2 and RAM.
(but initially gets 16 bytes from RAM->L2->L1 from address 00401230h, combines with the new 777.. dword value, and uploads to RAM)
On newest cpus, I'm not sure but I guess the RAM bus is not 128-bit but 256-bit ?

I'm not sure, but maybe only when you write a whole 16-byte xmm reg into 16-byte-aligned RAM, you needn't preload the surrounding data into L1 and L2. (with a specific opcode of sse and 3dnow)
Please use a smaller graphic in your signature.

NPNW

Ultrano,

I don't believe it invalidates the cache when you switch to another thread or process. You can specify to do this in the CPU, and would depend on the operating system.

I would think it would be better optimized for the code and its use. They have an algorithm that invalidates the cache lines. The best source to confirm this would be the intel manuals.