News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

LEA

Started by bomz, July 04, 2011, 10:00:07 PM

Previous topic - Next topic

bomz


Somebode use it? Say a few word about LEA.
String2Dword proc uses ecx edi ebx edx esi String:DWORD

mov esi, String
xor eax, eax
xor ecx, ecx
@@:
mov cl, byte ptr[esi]
cmp cl, 0
jz @F

mov ebx, eax
shl eax, 2
add eax, ebx

shl eax, 1

sub cl, 48
add eax, ecx
add esi, 1
jmp @B
@@:
        ret

String2Dword endp



.386

.model flat, stdcall
option casemap :none

include \MASM32\INCLUDE\windows.inc
include \MASM32\INCLUDE\user32.inc
include \MASM32\INCLUDE\kernel32.inc
includelib \MASM32\LIB\user32.lib
includelib \MASM32\LIB\kernel32.lib

.data
form db "Number: %u", 0
String db "4294967295",0

.data?
buffer db 512 dup(?)

.code
start:
lea esi, String
xor eax, eax
xor ecx, ecx
@@:
mov edx, eax
mov cl, byte ptr[esi]
cmp cl, 0
jz @F
lea eax, [4*EAX+EDX] ;1 tick!!!
add esi, 1
lea eax, [2*EAX-48+ECX] ;1 tick!!!

jmp @B
@@:

invoke wsprintf,ADDR buffer,ADDR form,eax
invoke MessageBox,0,ADDR buffer,0,MB_ICONASTERISK
invoke ExitProcess,0
end start


1.5 times quicker in ticks

dedndave

mov esi, String
lea esi, String

not the same thing

mov esi, String
loads ESI with the dword value at String

lea esi, String
loads ESI with the address of String

mov esi, offset String
loads ESI with the address of String

bomz

This I know. I am talking about that LEA may do for 1 tick one shl-shr, two add-sub, and move result to register. I never know about this and not sure how to use. I am tring - it working. but correctly?

may be not for one tick but quickly

lea eax, [2*EAX-48+ECX]

redskull

LEA is just a way to make the CPU perform its fancy effective address calculation, i.e. "mov eax, [displacement+base+index*scale]", without actually moving anything in or out of memory.  It's most effective when the address requires the fancy calulations; without it, you would have to manually perform the additions and multiplcations using the ADD and MUL commands.  If you are just talking about regular, non-stack, non-array, data section values, LEA is equivelent to a MOV OFFSET. 

The "trick" is that if you have to perform two additions and a multiply (by the allowed values), you don't have to be "calculating" and address at all; you can use it to calculate any result you care to know, and use it as a "super adder" instruction. 

-r
Strange women, lying in ponds, distributing swords, is no basis for a system of government

bomz

the situation becomes clear


AGI  PPlain  PMMX -- ????

redskull

Quote from: bomz on July 04, 2011, 11:35:53 PM
AGI  PPlain  PMMX -- ????

Are you really optimizing for P5?  Either way, an "Address Generation Interlock" happens when the CPU needs the value of one of the registers to calculate the address, but the result isn't ready.  Because P5's use "pairing" and not uOps, you obviously can't execute an instruction that calculates the value of a register at the same time as an instruction (i.e. LEA) that needs that value.  New CPU's don't really have it (well, just have it in other forms).

-r
Strange women, lying in ponds, distributing swords, is no basis for a system of government

bomz

Quote

AGI  PPlain  PMMX

Are you really optimizing for P5? 



I see it working and more effective, but what it is this P5 AGI  PPlain  PMMX I don't now.

This not working on Pentium 4? I have Pentium 4

bomz

lea eax, [2*EAX-48+ECX]
this have sence or leave
mov ebx, eax
shl eax, 2
add eax, ebx
shl eax, 1
sub cl, 48
add eax, ecx


?????????????????????????

redskull

Quote from: bomz on July 05, 2011, 12:17:54 AM
I see it working and more effective, but what it is this P5 AGI  PPlain  PMMX I don't now.
This not working on Pentium 4? I have Pentium 4

P5 - CPU microarchitecture including the original Pentium chip (PPlain) and the Pentium with MMX extensions (PMMX).  AGI is the Address Generation Interlock previously described.

Pentium 4 (PIV) is built on the "Netburst" microarchitecture, which also includes the Pentium D.  I personally have very little experience with Netburst, so another member would be better equipped to answer the question.  I would assume it doesn't, as I believe it uses the same basic uop setup as other P6-based chips.

-r
Strange women, lying in ponds, distributing swords, is no basis for a system of government

raymond

A few suggestions.

Instead of:
   cmp cl, 0
    jz @F
Do:
   sub cl,48
   jc  @F

That would exit the conversion with any character having an ascii value lower than "0". If you ever intend that other unknown persons could use your app, you should also add the following for error checking so that conversion would stop with any non-numerical character input:
   cmp cl,9
   ja  @F

Also, instead of:
   mov edx, eax
   lea eax, [4*EAX+EDX]

You can do:
   lea eax,[eax*4+eax]

and, since cl would already be converted to binary, you would only need:
   lea eax,[eax*2+ecx]
resulting in a reduction of code size by 4 bytes. :clap:
When you assume something, you risk being wrong half the time
http://www.ray.masmcode.com

bomz

bytes... I have HDD 500 g and memory 2.5  g. How many it's need's tick's. It's better add 10 mb but do the same 10 times quickly


lea eax, [4*EAX+EAX] - This work, I didn't know is it possible to use one register if this does not violate the rule that in previous tick's the same register call the proccesor pause. if you need bytes use mul 10


   lea eax, [4*EAX+EAX] - my proccessor do this slightly quickly than    lea eax, [4*EAX+EDX], and after shl the value of eax don't change

bomz

lea edx, String
xor eax, eax
xor ecx, ecx
@@:
mov cl, byte ptr[edx]
sub cl, 48
jc @F
lea eax, [4*EAX+EAX]
add edx, 1
lea eax, [2*EAX+ECX]


for this code enough register's which usually destroy by Windows API, so it don't need push pop in many cases

bomz

Strange but there is very little information about this LEA using. as I understand processor have arithmetic part, but command lea do the same - count real address, so we can use it for count some restricted arithmetic operation and it do it quickly because don't change flags for ex. I can't find any mentions about using the same register in operation, but really it work. How it influence - the using the same register which use in previous ticks may call processor stop - unknown. May be this work only on my old P4.

jj2007

Quote from: raymond on July 05, 2011, 03:22:20 AM
A few suggestions.
...
You can do:
   lea eax,[eax*4+eax]

and, since cl would already be converted to binary, you would only need:
   lea eax,[eax*2+ecx]
resulting in a reduction of code size by 4 bytes. :clap:


Ray,
That looks damn close to my favourite (read: fastest) ascii to float algo. Here is its first innermost loop:

QuoteIsDot1:   inc esi
   mov ecx, edx   ; first zero, then dotpos, if any
align 8         ; this loop is align 8 by default
   .Repeat
      movzx ebx, byte ptr [esi]   ; much faster than mov bl on P4 and Celeron M
      cmp ebx, "."
      je IsDot1
      
cmp ebx, "9"   ; faster than cmp bl
      ja Done
      sub ebx, "0"   ; could move up, saves one byte with test ebx, ebx below but is ca. 1% slower
      js Done
      lea eax, [eax+4*eax]   ; *5 - imul much slower
      inc edx    ; dot pos count
      lea eax, [2*eax+ebx]   ; *5, plus new byte (...+ebx-48 plus cmp instead of sub: slower on CM)
      inc esi
   .Until edx>=8   ; zero flag set
   dec esi
Done:   ...; follows FPU part
[/color]

bomz

my P4 do with lea 1.5 quicker than without

QuoteField   Value
CPU Properties   
CPU Type   Intel Pentium 4, 2266 MHz (17 x 133)
CPU Alias   Northwood
CPU Stepping   C1
Instruction Set   x86, MMX, SSE, SSE2
Original Clock   2266 MHz
Min / Max CPU Multiplier   17x / 17x
Engineering Sample   No
L1 Trace Cache   12K Instructions
L1 Data Cache   8 KB
L2 Cache   512 KB  (On-Die, ECC, ATC, Full-Speed)


lea edx, String
xor eax, eax
;xor ecx, ecx
@@:
movzx ecx, byte ptr[edx]
sub cl, 48
jc @F
lea eax, [4*EAX+EAX]
add edx, 1
lea eax, [2*EAX+ECX]


any code may be optimized