Why are there mnemonics for strings like: LODS, LODSB, REP, STOS, STOSB?
Can't just a few simple mov's work?
mov al, byte ptr [esi + ch]
inc ch
... ; edit the character, do work..
mov byte ptr [edi + ch], al
the string operations can be very fast
especially when you want to copy a large section of data or clear out a large area of memory
the ESI register points to the source and EDI points to the destination - they are incremented or decremented automatically for you
the ECX register holds the count if a REP prefix is used (REP repeat, REPZ repeat if zero, REPNZ repeat if not zero)
the direction flag controls up (CLD) or down (STD)
you can mov/scan/compare/load/store bytes, words, or dwords
there are also I/O instructions - somewhat useless
for some of the instructions, the AL/AX/EAX register is used for data
from Randy Hyde's Art of Assembly:
http://www.arl.wustl.edu/~lockwood/class/cs306/books/artofasm/Chapter_6/CH06-4.html#HEADING4-162
Oh! That is neat! one instruction can make up three different ones. (REPNZ) :dance:
you should play with them a little bit - lol
here is an example - i want to make a copy of a string...
cld
mov esi,offset source_string
mov edi,offset destination_string
mov ecx,number_of_bytes
rep movsb
it's a little faster for copying words or dwords, i think - it was faster on an 8088 to copy words, at least
here is another example - i want to clear out 32 Kb of memory...
cld
mov edi,offset memory_to_clear
xor eax,eax
mov ecx,8192 ;8192 dwords = 32 Kb
rep stosd
With just simple mnemonics, I can create this:
UpThree proc uses esi edi edx ecx lpszSrc:DWORD, lpszDest:DWORD, dwCount:DWORD
mov esi, lpszSrc
mov edi, lpszDest
mov edx, dwCount
xor ecx, ecx
mov al, 3
@@:
mov ah, byte ptr [esi + ecx]
cmp ah, 0
jz @F
add ah, al
mov byte ptr [edi + ecx], ah
inc ecx
cmp ecx, edx
je @F
jmp @B
@@:
ret
UpThree endp
I just can't understand how to optimize it with those higher mnemonics (rep, lodsb)
i am not sure the string instructions may be applied here - at least, not in a way to make things go faster
you could use lodsb and stosb for single bytes, but without the REP prefix, they are kinda slow
one thing i see is the way you maintain the loop count and branch at the end of the loop
the ECX register is traditionally used as a count register, so....
mov ecx,dwCount
.
.
loop_start:
.
.
dec ecx
jnz loop_start
that eliminates the need to compare ECX with EDX
the processor is happy when moving data in and out of AL, as opposed to AH
also - the base+index addressing slows you down a little....
mov esi, lpszSrc
mov edi, lpszDest
mov ecx, dwCount
mov ah, 3
@@:
mov al,[esi]
or al,al
jz @F
add al,ah
inc esi
mov [edi],al
inc edi
dec ecx
jnz @B
@@:
ret
you could make the thing run faster by accessing all data in 4-aligned dwords
it would take a lot more code, though - you have to sort out the first few bytes until you are 4-aligned
then, load dwords and, in register, sort out if any of the bytes are 0
then, add 3 to 4 bytes at a time and store them as (again, 4-aligned) dwords
you can see where the code gets messy - but it could make the routine run quite a bit faster
it would take 3 loops
one to handle a few bytes at the beginning
one to handle the bulk of the string in dwords
and another to handle may be misaligned bytes at the end
the fact that you look for a null terminator OR a terminal count really throws a wrench in the works - lol
Oh! I see that "dec" can set a zero-flag.
Quotethe base+index addressing slows you down a little.
Yeah, it does. :red
Are you packing data (4-aligned) to speed things up? Because that is what it looks like.
no - i just simplified your code - i am not packing anything - lol
the loop i posted is about as good as it will get while accessing the data as bytes
you have two different strings - one source - one destination
if one is aligned and the other is not, you're underwear will get all bunchy - lol
Beware of the old string instructions, unless used in a very limited way they can be very slow. There is special case circuitry for REP MOVS? and a few others but used individually they are way off the pace.
In most instances incremented pointer code is faster and even the special case circuitry of REP MOVSD can be beaten by MMX/XMM instructions.
dec ecx
jnz @B
nice trick,does it have to be ecx or can be any register?
At a byte level here is the masm32 library procedure to do it.
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
comment * -----------------------------------------------
copied length minus terminator is returned in EAX
----------------------------------------------- *
align 4
szCopy proc src:DWORD,dst:DWORD
push ebp
push esi
mov edx, [esp+12]
mov ebp, [esp+16]
mov eax, -1
mov esi, 1
@@:
add eax, esi
movzx ecx, BYTE PTR [edx+eax]
mov [ebp+eax], cl
test ecx, ecx
jnz @B
pop esi
pop ebp
ret 8
szCopy endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
Quote from: E^cube on November 17, 2009, 06:21:52 AM
dec ecx
jnz @B
nice trick,does it have to be ecx or can be any register?
It works with any register.
hutch,
Why to use PUSH ESI, MOV EAX, -1 , MOV ESI, 1 and pop esi ?
It could be:
; note: copied length minus terminator is returned in EAX
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
szCopy proc src:DWORD,dst:DWORD
push ebp
mov edx, [esp+8] ; src
mov ebp, [esp+12] ; dst
xor eax, eax
@@:
movzx ecx, BYTE PTR [edx+eax]
mov [ebp+eax], cl
add eax, 1
test ecx, ecx
jnz @B
sub eax, 1
pop ebp
ret 8
szCopy endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
Rui
Rui,
You have an error in the instructions that access the parameters:
mov edx, [esp+12]
mov ebp, [esp+16]
Because you have only one push at the top of the procedure, they should be:
mov edx, [esp+8]
mov ebp, [esp+12]
Hi MichaelW,
Yes i know, that should be
mov edx, [esp+8] ; src
mov ebp, [esp+12] ; dst
i used copy-paste and i forgot args
Rui
Rui,
mov eax, -1
mov esi, 1
The "mov eax, -1" could be slightly shorter wit "or eax, -1" but it hardly matters.
Preseting EAX with -1 and putting the ADD EAX before the byte copy means you don't have to correct the result on exit from the loop.
Using ESI to store 1 allows you to do the ADD on a register to register which is faster on some hardware than using an immediate. "add eax, esi"