Fast Memory Clear

Dinosaur · September 14, 2005, 07:31:12 AM

Hi all

The code below is used to clear a memory block from 4mb to 6mb in Unreal Flat memory.
However it costs me a lot of time, (25msec on a 300Mhz cpu).
Can sks point out to me how to improve this.

Code Select


	.MODEL MEDIUM
	.486
	.CODE
	PUBLIC    ClrMem
	;------------------22 msec on 300mhz cpu
ClrMem 	PROC    FAR
	PUSH    BP
	MOV     BP,SP
	;------------
	PUSH    EAX
	PUSH    ECX
	PUSH    EDX
	PUSH    DS
	;------------
	cli                                                 ;absolutely required.
	xor       eax,eax						
	xor       edx,edx
	mov      ds,ax                                 ;DS = AX = 0  ..Use UNREAL LINEAR ADDRESSING !
	mov      eax,400000h                      ;Start is beginning of 4th MegaByte
	mov      ecx,200000h	/ 4                ;do it up to start of 6th Megabyte
Clear:
	mov      dword ptr DS:[EAX],EDX	    ;Clear memory !!!
	add       eax,4
	loopd    Clear
	sti
    ;------------
	POP     DS
	POP     EDX
	POP     ECX
	POP     EAX
	;---------
	POP     BP
	RET     
ClrMem		ENDP
;----------------------
	END

Regards

MichaelW · September 14, 2005, 09:06:08 AM

Instead of

loopd Clear

You could try

dec ecx
jnz Clear

And you could try eliminating the DS override. DS is the default, and if the override is actually being encoded it could slow the loop down.

Dinosaur · September 14, 2005, 09:21:49 AM

Michael, it is actually 1 msec slower.

I was hoping for some radical memory clearing method that I was not aware off.
Otherwise I will have clear only the area that I have used each time, which may be 1mb less,
or clear the address before I write to it.

Regards

MichaelW · September 14, 2005, 09:34:11 AM

You could try clearing more than one dword per loop: [EAX], [EAX+4], etc (with appropriate adjustments to the address increment and loop count).

Gustav · September 14, 2005, 09:51:53 AM

or clear 16 bytes instead of 4:

Code Select


 .MODEL MEDIUM
.486
.CODE
PUBLIC    ClrMem
;------------------22 msec on 300mhz cpu
ClrMem PROC    FAR
PUSH    BP
MOV     BP,SP
;------------
PUSH    EAX
PUSH    ECX
PUSH    EDX
PUSH    DS
;------------
cli                                                 ;absolutely required.
xor       eax,eax
xor       edx,edx
mov      ds,ax                                 ;DS = AX = 0  ..Use UNREAL LINEAR ADDRESSING !
mov      eax,400000h                      ;Start is beginning of 4th MegaByte
lea        ecx, [eax+200000h]
Clear:
mov      dword ptr [EAX],EDX
mov      dword ptr [EAX+4],EDX
mov      dword ptr [EAX+8],EDX
mov      dword ptr [EAX+12],EDX
add       eax,16
cmp      eax, ecx
jnz        Clear
sti
    ;------------
POP     DS
POP     EDX
POP     ECX
POP     EAX
;---------
POP     BP
RET     
ClrMem ENDP
;----------------------
END

Dinosaur · September 14, 2005, 10:01:11 AM

Thank you Gustav, BUT
I must be limited by mem write speeds or something, cause
I cant get below 22msec. ?

The timing I am doing is a simple loop, using Timer Tick.

Code Select


    print dostime&(0)
    FOR XQ = 1 TO 100
        Clrmem
    NEXT XQ
    print dostime&(0)

Regards

Tedd · September 14, 2005, 10:26:31 AM

It's just a thought, but try using 16-bit registers instead of 32-bit ones.
In 16-bit mode they're prefixed instructions, so they have a small penalty.
I'm not sure how much this will be contributing, but it's worth trying :wink

Gustav · September 14, 2005, 11:01:07 AM

Most likely the problem is that dostime uses the IRQ 0 timer, with very limited resolution.

more accurate time messure can be reached by:

1. use rdtsc opcode
2. use timer at IRQ 8 (bios int 15h, ah=86h)
3. reprogram timer IRQ 0

Dinosaur · September 14, 2005, 11:47:04 AM

I doubt that the timer is to blame.
The changes suggested should have made msec's of difference and they didnt.
The error in the timing would have to be quite large, and I should see different values for different tests.
But doing the same test 20 times gives me the same reading.

The test lasts 42 ticks each 100 loops.

Regards

MichaelW · September 14, 2005, 12:18:34 PM

Running this code on a P3:

Code Select


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .586
    include timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      buffer    dd 1024 dup (0)
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    counter_begin 1000000, HIGH_PRIORITY_CLASS
      xor   edx, edx
      mov   eax, OFFSET buffer
      mov   ecx, 1024/4
    Clear:
      mov   dword ptr DS:[EAX],EDX
      add   eax,4
      loopd Clear
    counter_end
    print ustr$(eax)
    print chr$(" cycles",13,10)
    mov   eax, input(13,10,"Press enter to exit...")
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

I determined that the loop consumes about 1580 clock cycles when clearing 1024 bytes. So for 2MB the loop should consume about 3235840 clock cycles, or about 11ms on a 300MHz processor with instruction timings similar to a P3. The timings for real mode might be somewhat different, but I doubt that the difference would be large. So I think your suspicion that the loop is limited by memory write speeds is correct.

You might be able to verify this by temporarily switching to faster memory/cache timings

Dinosaur · September 14, 2005, 12:33:12 PM

Actually I should have put brain into gear before mouth.

The improvements would have to cut at least 25msec of the total ticks before the ticks would change.
So, Gustav you are correct, I would not see any improvements until it was more then 1/2 of a tick.

Regards

ninjarider · September 14, 2005, 12:47:49 PM

i cant remember wer i found it but i remember theres a v-channel and a u-channel maybe u could try pairing the instruction together, u might have to add a little bit more to your loop to get it to fit. but it would make your code faster.

the only other thing i can think of is its time to upgrade your ram to ddr 400

Dinosaur · September 14, 2005, 12:48:57 PM

Michael that was verified.

Changing the sdram timing made 11 ticks difference on the 100 loops.
That's now 17msec per 2 mb

Knowing that now, I guess improvements will have to come from other areas.

Thanks to all.

Regards

News:

Fast Memory Clear

Dinosaur

MichaelW

Dinosaur

MichaelW

Gustav

Dinosaur

Tedd

Gustav

Dinosaur

MichaelW

Dinosaur

ninjarider

Dinosaur