Hi all
The code below is used to clear a memory block from 4mb to 6mb in Unreal Flat memory.
However it costs me a lot of time, (25msec on a 300Mhz cpu).
Can sks point out to me how to improve this.
.MODEL MEDIUM
.486
.CODE
PUBLIC ClrMem
;------------------22 msec on 300mhz cpu
ClrMem PROC FAR
PUSH BP
MOV BP,SP
;------------
PUSH EAX
PUSH ECX
PUSH EDX
PUSH DS
;------------
cli ;absolutely required.
xor eax,eax
xor edx,edx
mov ds,ax ;DS = AX = 0 ..Use UNREAL LINEAR ADDRESSING !
mov eax,400000h ;Start is beginning of 4th MegaByte
mov ecx,200000h / 4 ;do it up to start of 6th Megabyte
Clear:
mov dword ptr DS:[EAX],EDX ;Clear memory !!!
add eax,4
loopd Clear
sti
;------------
POP DS
POP EDX
POP ECX
POP EAX
;---------
POP BP
RET
ClrMem ENDP
;----------------------
END
Regards
Instead of
loopd Clear
You could try
dec ecx
jnz Clear
And you could try eliminating the DS override. DS is the default, and if the override is actually being encoded it could slow the loop down.
Michael, it is actually 1 msec slower.
I was hoping for some radical memory clearing method that I was not aware off.
Otherwise I will have clear only the area that I have used each time, which may be 1mb less,
or clear the address before I write to it.
Regards
You could try clearing more than one dword per loop: [EAX], [EAX+4], etc (with appropriate adjustments to the address increment and loop count).
or clear 16 bytes instead of 4:
.MODEL MEDIUM
.486
.CODE
PUBLIC ClrMem
;------------------22 msec on 300mhz cpu
ClrMem PROC FAR
PUSH BP
MOV BP,SP
;------------
PUSH EAX
PUSH ECX
PUSH EDX
PUSH DS
;------------
cli ;absolutely required.
xor eax,eax
xor edx,edx
mov ds,ax ;DS = AX = 0 ..Use UNREAL LINEAR ADDRESSING !
mov eax,400000h ;Start is beginning of 4th MegaByte
lea ecx, [eax+200000h]
Clear:
mov dword ptr [EAX],EDX
mov dword ptr [EAX+4],EDX
mov dword ptr [EAX+8],EDX
mov dword ptr [EAX+12],EDX
add eax,16
cmp eax, ecx
jnz Clear
sti
;------------
POP DS
POP EDX
POP ECX
POP EAX
;---------
POP BP
RET
ClrMem ENDP
;----------------------
END
Thank you Gustav, BUT
I must be limited by mem write speeds or something, cause
I cant get below 22msec. ?
The timing I am doing is a simple loop, using Timer Tick.
print dostime&(0)
FOR XQ = 1 TO 100
Clrmem
NEXT XQ
print dostime&(0)
Regards
It's just a thought, but try using 16-bit registers instead of 32-bit ones.
In 16-bit mode they're prefixed instructions, so they have a small penalty.
I'm not sure how much this will be contributing, but it's worth trying :wink
Most likely the problem is that dostime uses the IRQ 0 timer, with very limited resolution.
more accurate time messure can be reached by:
1. use rdtsc opcode
2. use timer at IRQ 8 (bios int 15h, ah=86h)
3. reprogram timer IRQ 0
I doubt that the timer is to blame.
The changes suggested should have made msec's of difference and they didnt.
The error in the timing would have to be quite large, and I should see different values for different tests.
But doing the same test 20 times gives me the same reading.
The test lasts 42 ticks each 100 loops.
Regards
Running this code on a P3:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
include \masm32\include\masm32rt.inc
.586
include timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.data
buffer dd 1024 dup (0)
.code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
counter_begin 1000000, HIGH_PRIORITY_CLASS
xor edx, edx
mov eax, OFFSET buffer
mov ecx, 1024/4
Clear:
mov dword ptr DS:[EAX],EDX
add eax,4
loopd Clear
counter_end
print ustr$(eax)
print chr$(" cycles",13,10)
mov eax, input(13,10,"Press enter to exit...")
exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start
I determined that the loop consumes about 1580 clock cycles when clearing 1024 bytes. So for 2MB the loop should consume about 3235840 clock cycles, or about 11ms on a 300MHz processor with instruction timings similar to a P3. The timings for real mode might be somewhat different, but I doubt that the difference would be large. So I think your suspicion that the loop is limited by memory write speeds is correct.
You might be able to verify this by temporarily switching to faster memory/cache timings
Actually I should have put brain into gear before mouth.
The improvements would have to cut at least 25msec of the total ticks before the ticks would change.
So, Gustav you are correct, I would not see any improvements until it was more then 1/2 of a tick.
Regards
i cant remember wer i found it but i remember theres a v-channel and a u-channel maybe u could try pairing the instruction together, u might have to add a little bit more to your loop to get it to fit. but it would make your code faster.
the only other thing i can think of is its time to upgrade your ram to ddr 400
Michael that was verified.
Changing the sdram timing made 11 ticks difference on the 100 loops.
That's now 17msec per 2 mb
Knowing that now, I guess improvements will have to come from other areas.
Thanks to all.
Regards