LEA

bomz · July 04, 2011, 10:00:07 PM

Somebode use it? Say a few word about LEA.

String2Dword proc uses ecx edi ebx edx esi String:DWORD

	mov esi, String
	xor eax, eax
	xor ecx, ecx
@@:
	mov cl, byte ptr[esi]
	cmp cl, 0
	jz @F

	mov ebx, eax
	shl eax, 2
	add eax, ebx

	shl eax, 1

	sub cl, 48
	add eax, ecx
	add esi, 1
	jmp @B
@@:
        ret

String2Dword endp

Code Select

.386

.model flat, stdcall 
option casemap :none 

include \MASM32\INCLUDE\windows.inc
include \MASM32\INCLUDE\user32.inc
include \MASM32\INCLUDE\kernel32.inc
includelib \MASM32\LIB\user32.lib
includelib \MASM32\LIB\kernel32.lib

.data
form db "Number: %u", 0
String db "4294967295",0

.data?
buffer db 512 dup(?)

.code
start:
	lea esi, String
	xor eax, eax
	xor ecx, ecx
@@:
	mov edx, eax
	mov cl, byte ptr[esi]
	cmp cl, 0
	jz @F
	lea eax, [4*EAX+EDX]	;1 tick!!!
	add esi, 1
	lea eax, [2*EAX-48+ECX]	;1 tick!!!

	jmp @B
@@:

	invoke wsprintf,ADDR buffer,ADDR form,eax
	invoke MessageBox,0,ADDR buffer,0,MB_ICONASTERISK
	invoke ExitProcess,0
end start

1.5 times quicker in ticks

dedndave · July 04, 2011, 11:12:22 PM

Code Select

mov esi, String
lea esi, String

not the same thing

Code Select

mov esi, String
loads ESI with the dword value at String

Code Select

lea esi, String
loads ESI with the address of String

Code Select

mov esi, offset String
loads ESI with the address of String

bomz · July 04, 2011, 11:15:23 PM

This I know. I am talking about that LEA may do for 1 tick one shl-shr, two add-sub, and move result to register. I never know about this and not sure how to use. I am tring - it working. but correctly?

may be not for one tick but quickly

Code Select

lea eax, [2*EAX-48+ECX]

redskull · July 04, 2011, 11:32:25 PM

LEA is just a way to make the CPU perform its fancy effective address calculation, i.e. "mov eax, [displacement+base+index*scale]", without actually moving anything in or out of memory. It's most effective when the address requires the fancy calulations; without it, you would have to manually perform the additions and multiplcations using the ADD and MUL commands. If you are just talking about regular, non-stack, non-array, data section values, LEA is equivelent to a MOV OFFSET.

The "trick" is that if you have to perform two additions and a multiply (by the allowed values), you don't have to be "calculating" and address at all; you can use it to calculate any result you care to know, and use it as a "super adder" instruction.

-r

bomz · July 04, 2011, 11:35:53 PM

the situation becomes clear

AGI PPlain PMMX -- ????

redskull · July 05, 2011, 12:10:06 AM

Quote from: bomz on July 04, 2011, 11:35:53 PM
AGI PPlain PMMX -- ????

Are you really optimizing for P5? Either way, an "Address Generation Interlock" happens when the CPU needs the value of one of the registers to calculate the address, but the result isn't ready. Because P5's use "pairing" and not uOps, you obviously can't execute an instruction that calculates the value of a register at the same time as an instruction (i.e. LEA) that needs that value. New CPU's don't really have it (well, just have it in other forms).

-r

bomz · July 05, 2011, 12:17:54 AM

Quote

AGI PPlain PMMX

Are you really optimizing for P5?

I see it working and more effective, but what it is this P5 AGI PPlain PMMX I don't now.

This not working on Pentium 4? I have Pentium 4

bomz · July 05, 2011, 12:20:40 AM

Code Select

lea eax, [2*EAX-48+ECX]
this have sence or leave

Code Select

mov ebx, eax
shl eax, 2
add eax, ebx
shl eax, 1
sub cl, 48
add eax, ecx

?????????????????????????

redskull · July 05, 2011, 12:40:56 AM

Quote from: bomz on July 05, 2011, 12:17:54 AM
I see it working and more effective, but what it is this P5 AGI PPlain PMMX I don't now.
This not working on Pentium 4? I have Pentium 4

P5 - CPU microarchitecture including the original Pentium chip (PPlain) and the Pentium with MMX extensions (PMMX). AGI is the Address Generation Interlock previously described.

Pentium 4 (PIV) is built on the "Netburst" microarchitecture, which also includes the Pentium D. I personally have very little experience with Netburst, so another member would be better equipped to answer the question. I would assume it doesn't, as I believe it uses the same basic uop setup as other P6-based chips.

-r

raymond · July 05, 2011, 03:22:20 AM

A few suggestions.

Instead of:
cmp cl, 0
jz @F
Do:
sub cl,48
jc @F

That would exit the conversion with any character having an ascii value lower than "0". If you ever intend that other unknown persons could use your app, you should also add the following for error checking so that conversion would stop with any non-numerical character input:
cmp cl,9
ja @F

Also, instead of:
mov edx, eax
lea eax, [4*EAX+EDX]

You can do:
lea eax,[eax*4+eax]

and, since cl would already be converted to binary, you would only need:
lea eax,[eax*2+ecx]
resulting in a reduction of code size by 4 bytes. :clap:

bomz · July 05, 2011, 08:21:36 AM

bytes... I have HDD 500 g and memory 2.5 g. How many it's need's tick's. It's better add 10 mb but do the same 10 times quickly

Code Select

lea eax, [4*EAX+EAX] - This work, I didn't know is it possible to use one register if this does not violate the rule that in previous tick's the same register call the proccesor pause. if you need bytes use mul 10

lea eax, [4*EAX+EAX] - my proccessor do this slightly quickly than lea eax, [4*EAX+EDX], and after shl the value of eax don't change

bomz · July 05, 2011, 09:50:34 AM

Code Select

	lea edx, String
	xor eax, eax
	xor ecx, ecx
@@:
	mov cl, byte ptr[edx]
	sub cl, 48
	jc @F
	lea eax, [4*EAX+EAX]
	add edx, 1
	lea eax, [2*EAX+ECX]

for this code enough register's which usually destroy by Windows API, so it don't need push pop in many cases

bomz · July 05, 2011, 06:48:38 PM

Strange but there is very little information about this LEA using. as I understand processor have arithmetic part, but command lea do the same - count real address, so we can use it for count some restricted arithmetic operation and it do it quickly because don't change flags for ex. I can't find any mentions about using the same register in operation, but really it work. How it influence - the using the same register which use in previous ticks may call processor stop - unknown. May be this work only on my old P4.

jj2007 · July 05, 2011, 07:02:55 PM

Quote from: raymond on July 05, 2011, 03:22:20 AM
A few suggestions.
...
You can do:
lea eax,[eax*4+eax]

and, since cl would already be converted to binary, you would only need:
lea eax,[eax*2+ecx]
resulting in a reduction of code size by 4 bytes. :clap:

Ray,
That looks damn close to my favourite (read: fastest) ascii to float algo. Here is its first innermost loop:

QuoteIsDot1:   inc esi
   mov ecx, edx   ; first zero, then dotpos, if any
align 8         ; this loop is align 8 by default
   .Repeat
      movzx ebx, byte ptr [esi]   ; much faster than mov bl on P4 and Celeron M
      cmp ebx, "."
      je IsDot1
      cmp ebx, "9"   ; faster than cmp bl
      ja Done
      sub ebx, "0"   ; could move up, saves one byte with test ebx, ebx below but is ca. 1% slower
      js Done
      lea eax, [eax+4*eax]   ; *5 - imul much slower
      inc edx    ; dot pos count
      lea eax, [2*eax+ebx]   ; *5, plus new byte (...+ebx-48 plus cmp instead of sub: slower on CM)
      inc esi
   .Until edx>=8   ; zero flag set
   dec esi
Done:   ...; follows FPU part
[/color]

bomz · July 05, 2011, 07:31:04 PM

my P4 do with lea 1.5 quicker than without

QuoteField   Value
CPU Properties
CPU Type   Intel Pentium 4, 2266 MHz (17 x 133)
CPU Alias   Northwood
CPU Stepping   C1
Instruction Set   x86, MMX, SSE, SSE2
Original Clock   2266 MHz
Min / Max CPU Multiplier   17x / 17x
Engineering Sample   No
L1 Trace Cache   12K Instructions
L1 Data Cache   8 KB
L2 Cache   512 KB (On-Die, ECC, ATC, Full-Speed)

Code Select

lea edx, String
xor eax, eax
;xor ecx, ecx
@@:
movzx ecx, byte ptr[edx]
sub cl, 48
jc @F
lea eax, [4*EAX+EAX]
add edx, 1
lea eax, [2*EAX+ECX]

any code may be optimized

News:

LEA