News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

szLen optimize...

Started by denise_amiga, May 31, 2005, 07:42:44 PM

Previous topic - Next topic

Ratch

#150
Mark Jones,
Quote
The easiest solution is as Michael had said, simply time the proc on as many hardware(s) as you can, and just accept the results. :)

     I agree with just about everything you said in your last post.  Unfortunately I don't have the hardware or inclination to test my algo on a suite of platforms.  But no matter.  I usually try to just keep the highly used portions of the programs such as the loops as short as possible, and avoid the "bad" instructions like DIV, LOOP, REP SCASB, XLAT, etc.  Ratch

hutch--

I worked on an algo recently with Jim and we both use different hardware, mine is a 2.8 gig PIV and Jim was testing on a late model AMD pre 64 bit processor. We were getting different times on different code techniques which displayed the hardware differences so together we produced a version that averaged the best across both processors.

This is basically the joys of writing mixed model code where if you want general purpose code, you must test across a number of different machines and correlate the results.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Codewarp

#152
Quote from: Jimg on August 18, 2005, 03:50:17 PM

...I do think it's really important to have general purpose routines.  There are very few cpus left which don't have mmx capability, but the number that can do sse2 is still a small percentage of the total.  SSE2 is fun and very appealing, but I certainly wouldn't use it without checking if the cpu was capable first.  And checking if the cpu is capable takes longer than any savings in the code itself on these little gp routines.

Jimg,

As I have mentioned before in other postings here, none of these routines should be using cpuid directly--ever!  Rather, the library startup code containing the routines in question should use cpuid to set feature-DWORDs testable from anywhere using a single memory test instruction.  This method reduces the test overhead to a single cycle or less.  My overhead for some of this is down to zero--because my library simply refuses to load if mmx and conditional moves are not supported, so no tests for these features are necessary.  This affords me complete freedom to employ sse, sse2 and sse3 as I see fit anywhere in my supporting libraries.

Quote
Also, I agree, I really think our timing assumptions are seriously flawed.  Unfortunately, it's all we have.  There needs to be some way to find the real world performance of a routine as it is normally used.  We seldom call any of these routines over and over a million times in a tight loop.  The real timing test is how long does your code take when it is called just once and is not in the cache.  Does it take longer to execute a routine the first time because it is long and complicated than the time it saves by being tricky?  How do we test such a thing?  I've been playing with calling each routine once in sequence, then repeating the sequence and averaging the results, but I can't get any consistency at all.  Anyone have any ideas?

Clock cycles are the best measure we have of pure cpu cost of execution implied by a particular sequence of code.  If you muddy the waters by injecting the differences in cache contents, page faults, motherboard design, chipsets, cpu clock multipliers and financial resources for obtaining the fastest memory sticks, clock cycles will tell you little more a State of the Union address--next to nothing.

Your desire for a real-world test is legitimate and shared by many others, but I am afraid that clock cycle counts are not it.  If clock counts represent the best case of a run, why not set up the conditions for the worst case, call the routine exactly once, then report the cycles consumed?  Unfortunately, every machine would report something different--not all that useful.  Our testing is not flawed at all--you have requirements that go beyond this particular metric, that's all.

Furthermore, some aspects that greatly affect execution time--such as cache utilization and locality of reference--cannot be tested outside the context of an entire running application, nor tested with clock cycles.  Real-world performance testing--look elsewhere. 

One of my favorite techniques, is to determine the cost of a particular routine in a complete application context by weighting it down with additional known cost, then measuring the drag on overall performance in real time.  From this data you can compute the percentage of overall execution time taken by this routine, and hence the real time spent in that routine.  You can then compare the difference between your "idealized" clock performance and your real-time performance.  However don't be fooled--this difference will vary across machines and other factors.

Snouphruh

what do you think about this code? :

szBuffer            db 'sdkjahgkyugkuygfkljashdgvlkasgdfkluygqweoiugalsdkf', 0
...
                    push offset szBuffer
                    call myStrLength
...
myStrLength:
                    pop edi
                    pop esi
                    push edi
                    mov edx, esi
ll_Loop:
                    mov al, byte ptr [esi]
                    inc esi
                    test al, al
                    jnz ll_Loop
                    dec esi
                    mov eax, esi
                    sub eax, edx
                    retn

lingo

Snouphruh, :bg

"what do you think about this code?"

You can use eax and ecx registers (without preserving)
rather than edi and esi

myStrLength:
      or ecx, -1
      mov eax, [esp+4] ; eax->szBuffer
      sub ecx, eax ; ecx->-szBuffer-1
ll_Loop:
              cmp byte ptr [eax], 0
              lea eax,[eax+1]
              jnz ll_Loop
      add eax, ecx 
              ret 4
   

Regards,
Lingo

Snouphruh

nice! very nice!
but I heard LEA is slow.
and MOV EAX, [ESP + 4] takes 4 bytes long.

what if...:

myStrLength:
                   or ecx, -1
                   pop edx         ; edx = return address
                   pop eax         ; eax = pointer to the buffer
                   sub ecx, eax
ll_Loop:
                   cmp byte ptr [eax], 0
                   lea eax, [eax + 1]
                   jnz ll_Loop
                   add eax, ecx
                   jmp edx


but my previous example has body loop:

                   mov al, byte ptr [esi]
                   inc esi
                   test al, al
                   jnz ll_Loop

which takes only 2 CPU clocks, 'cause MOV and INC are being paired as well as TEST and JNZ are.

jdoe


Sorry for that old topic rivial  :P

I just would like to have some timing results of these functions. Please, don't forget to write your processor AMD/Intel.
These ones a much slower than the others but I need to choose and I don't know which one is better for Intel.

Thanks a lot


AMD Athlon XP 1800+

Quote
lstrlenA return value     : 191
AzmtStrLen1A return value : 191
AzmtStrLen2A return value : 191

lstrlenA     : 439 cycles
AzmtStrLen1A : 210 cycles
AzmtStrLen2A : 203 cycles

Press any key to exit...



[attachment deleted by admin]

jj2007

Celeron M:

lstrlenA return value     : 191
AzmtStrLen1A return value : 191
AzmtStrLen2A return value : 191

lstrlenA     : 431 cycles
AzmtStrLen1A : 283 cycles
AzmtStrLen2A : 224 cycles

sinsi

q6600

lstrlenA     : 255 cycles
AzmtStrLen1A : 191 cycles
AzmtStrLen2A : 192 cycles

Light travels faster than sound, that's why some people seem bright until you hear them.

jdoe



Wow sinsi, I'm surprised that lstrlen does not perform more worstly than that.


Biterider

PIII 500

lstrlenA return value     : 191
AzmtStrLen1A return value : 191
AzmtStrLen2A return value : 191

lstrlenA     : 457 cycles
AzmtStrLen1A : 241 cycles
AzmtStrLen2A : 229 cycles

Biterider

Biterider

Hi
After looking into your code, the only problem i see is that if the string pointer is not aligned to 4 and the string is at the end of an allocated memory page, the algo can produce a GPF.
I suggest to check the lower 2 bits of the string pointer and jump according to them into to comparison chain. Previously you have to set the lower 2 bits to zero and load ecx with the content of an aligned address.

Biterider

jdoe


You're right, I completly forgot to make the string aligned of a 4 byte boundary.
This algo is made for aligned string so I not planning to change it... but thanks anyway.

:U







[attachment deleted by admin]

MichaelW

I added a (not very well tested) procedure that is essentially one posted by Mark Larson, roughly 3 years ago. I think my very early version of Windows 2000, even though it has the latest SP installed, has a slow version of lstrlen. This is running on a P3:

lstrlenA return value     : 191
AzmtStrLen1A return value : 191
AzmtStrLen2A return value : 191
markl_szlen return value  : 191

lstrlenA      : 826 cycles
AzmtStrLen1A  : 238 cycles
AzmtStrLen2A  : 227 cycles
markl_szlen   : 177 cycles



[attachment deleted by admin]
eschew obfuscation

jdoe

Seeing the Michael's result, there is no doubt that coding our own string length algo is not worthless.

The problem with this algo of  Mark Larsen is that the end of the string must be 4 null-char and not just one.