News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Fast Memory Clear

Started by Dinosaur, September 14, 2005, 07:31:12 AM

Previous topic - Next topic

Dinosaur

Hi all

The code below is used to clear a memory block from 4mb to 6mb in Unreal Flat memory.
However it costs me a lot of time, (25msec on a 300Mhz cpu).
Can sks point out to me how to improve this.

.MODEL MEDIUM
.486
.CODE
PUBLIC    ClrMem
;------------------22 msec on 300mhz cpu
ClrMem PROC    FAR
PUSH    BP
MOV     BP,SP
;------------
PUSH    EAX
PUSH    ECX
PUSH    EDX
PUSH    DS
;------------
cli                                                 ;absolutely required.
xor       eax,eax
xor       edx,edx
mov      ds,ax                                 ;DS = AX = 0  ..Use UNREAL LINEAR ADDRESSING !
mov      eax,400000h                      ;Start is beginning of 4th MegaByte
mov      ecx,200000h / 4                ;do it up to start of 6th Megabyte
Clear:
mov      dword ptr DS:[EAX],EDX     ;Clear memory !!!
add       eax,4
loopd    Clear
sti
    ;------------
POP     DS
POP     EDX
POP     ECX
POP     EAX
;---------
POP     BP
RET     
ClrMem ENDP
;----------------------
END



Regards

MichaelW

Instead of

  loopd Clear

You could try

  dec  ecx
  jnz  Clear

And you could try eliminating the DS override. DS is the default, and if the override is actually being encoded it could slow the loop down.


eschew obfuscation

Dinosaur

Michael, it is actually 1 msec slower.

I was hoping for some radical memory clearing method that I was not aware off.
Otherwise I will have clear only the area that I have used each time, which may be 1mb less,
or clear the address before I write to it.

Regards

MichaelW

You could try clearing more than one dword per loop: [EAX], [EAX+4], etc (with appropriate adjustments to the address increment and loop count).



eschew obfuscation

Gustav


or clear 16 bytes instead of 4:


.MODEL MEDIUM
.486
.CODE
PUBLIC    ClrMem
;------------------22 msec on 300mhz cpu
ClrMem PROC    FAR
PUSH    BP
MOV     BP,SP
;------------
PUSH    EAX
PUSH    ECX
PUSH    EDX
PUSH    DS
;------------
cli                                                 ;absolutely required.
xor       eax,eax
xor       edx,edx
mov      ds,ax                                 ;DS = AX = 0  ..Use UNREAL LINEAR ADDRESSING !
mov      eax,400000h                      ;Start is beginning of 4th MegaByte
lea        ecx, [eax+200000h]
Clear:
mov      dword ptr [EAX],EDX
mov      dword ptr [EAX+4],EDX
mov      dword ptr [EAX+8],EDX
mov      dword ptr [EAX+12],EDX
add       eax,16
cmp      eax, ecx
jnz        Clear
sti
    ;------------
POP     DS
POP     EDX
POP     ECX
POP     EAX
;---------
POP     BP
RET     
ClrMem ENDP
;----------------------
END


Dinosaur

Thank you Gustav, BUT
I must be limited by mem write speeds or something, cause
I cant get below 22msec. ?

The timing I am doing is a simple loop, using Timer Tick.


    print dostime&(0)
    FOR XQ = 1 TO 100
        Clrmem
    NEXT XQ
    print dostime&(0)


Regards

Tedd

It's just a thought, but try using 16-bit registers instead of 32-bit ones.
In 16-bit mode they're prefixed instructions, so they have a small penalty.
I'm not sure how much this will be contributing, but it's worth trying :wink
No snowflake in an avalanche feels responsible.

Gustav


Most likely the problem is that dostime uses the IRQ 0 timer, with very limited resolution.

more accurate time messure can be reached by:

1. use rdtsc opcode
2. use timer at IRQ 8 (bios int 15h, ah=86h)
3. reprogram timer IRQ 0

Dinosaur

I doubt that the timer is to blame.
The changes suggested should have made msec's of difference and they didnt.
The error in the timing would have to be quite large, and I should see different values for different tests.
But doing the same test 20 times gives me the same reading.

The test lasts 42 ticks each 100 loops.

Regards



MichaelW

Running this code on a P3:

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .586
    include timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      buffer    dd 1024 dup (0)
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    counter_begin 1000000, HIGH_PRIORITY_CLASS
      xor   edx, edx
      mov   eax, OFFSET buffer
      mov   ecx, 1024/4
    Clear:
      mov   dword ptr DS:[EAX],EDX
      add   eax,4
      loopd Clear
    counter_end
    print ustr$(eax)
    print chr$(" cycles",13,10)
    mov   eax, input(13,10,"Press enter to exit...")
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

I determined that the loop consumes about 1580 clock cycles when clearing 1024 bytes. So for 2MB the loop should consume about 3235840 clock cycles, or about 11ms on a 300MHz processor with instruction timings similar to a P3. The timings for real mode might be somewhat different, but I doubt that the difference would be large. So I think your suspicion that the loop is limited by memory write speeds is correct.

You might be able to verify this by temporarily switching to faster memory/cache timings

eschew obfuscation

Dinosaur

Actually I should have put brain into gear before mouth.

The improvements would have to cut at least 25msec of the total ticks before the ticks would change.
So, Gustav you are correct, I would not see any improvements until it was more then 1/2 of a tick.

Regards

ninjarider

i cant remember wer i found it but i remember theres a v-channel and a u-channel maybe u could try pairing the instruction together, u might have to add a little bit more to your loop to get it to fit. but it would make your code faster.

the only other thing i can think of is its time to upgrade your ram to ddr 400

Dinosaur

Michael that was verified.

Changing the sdram timing made 11 ticks difference on the 100 loops.
That's now 17msec per 2 mb

Knowing that now, I guess improvements will have to come from other areas.

Thanks to all.

Regards