News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

My first useful Function, comments wanted

Started by rags, March 22, 2005, 05:49:29 PM

Previous topic - Next topic

hutch--

Michael,

I have usually found you get anomalies like tis due to alignment of the procedure entry point. Its probably a good idea to align all of the procs so they are less order dependent but the difference may in fact be a larger alignment based on where it is in the code section.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Mark_Larson

Quote from: MichaelW on March 24, 2005, 07:44:00 AM
I modified _TrimLeadingSpaces replacing MOVSB, added an alignment directive to TrimLeadingSpaces, and added Hutch's and lingo's procedure. All four procedures seem to produce the same results, including the same returned buffer length. On the timing tests, I encountered an anomaly that I cannot explain. I had made a copy of Hutch's procedure because it was the fastest (on my P3), and before I starting trying to further optimize it, I measured the cycle counts for all five procedures as:

_TrimLeadingSpaces: 563 cycles
TrimLeadingSpaces: 1235 cycles
TrimLe: 575 cycles
block_ltrim: 485 cycles
_block_ltrim: 664 cycles

The strange thing is that _block_ltrim is an exact copy of block_ltrim with only the name changed. If I swap the location of the procedure definitions, with _block_ltrim defined first, it returns the lower cycle count and block_ltrim the higher cycle count. If I call the same procedure from both timing loops, I get essentially the same cycle count for both loops, and the count is again determined by the relative position of the procedure definitions. I tried changing the alignment, and using a local copy of the MASM32 szCopy procedure, hopefully to get the procedures closer together address-wise, and neither produced any significant change. Does anyone have any ideas on what the problem might be?


  Go Hutch--!!!  I also noticed weird behavior Michael.  For some reason TrimLeadingSpaces is reporting in the 1600 cycle range on my home P4 but the 1200 range at my work P4.  And _TrimLeadingSpaces was running around 685, and now it's running around 1000.  I tried bumping up the number of loops by 10 and that didn't work.  So I was wondering if Norton Antivirus has anything to do with it.  It's running on my home laptop but not on my work system.  If you also have NAV maybe you should try keeping it from loading to see how it effects things.


Quote from: hutch-- on March 24, 2005, 08:38:01 AM
Michael,

I have usually found you get anomalies like tis due to alignment of the procedure entry point. Its probably a good idea to align all of the procs so they are less order dependent but the difference may in fact be a larger alignment based on where it is in the code section.

  I missed Hutch's reply before I wrote up mine.  For the P3 he's probably right.  But the P4 doesn't have code alignment issues.  So I am not sure why I am getting such different numbers between here and work ( It's about 400 cycles longer at home, which means it is running 50% slower).
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

lingo

Michael,  :bdg

I corrected a bit my proc TrimLe
so pls include it in your tests

Regards,
Lingo

hutch--

Mark,

Something I have found timing different algos over time on two different PIVs is that alignment does effect the timings on a PIV. The problem is it have never been predictable and there are times when I have aligned a label and timed it and it dropped dead and there are other times when it made the algo faster. It has also been effected by the size of the alignment, often I found that align 4 was faster than align 16.

There is probably a way to test this, copy the algo to an aligned executable piece of memory and test it with big alignments like 4k, 16k etc .... I have also found that preceding code at times effects the algo timings even though I have jumped over it to an aligned label. I get the impression that the innards of a PIV are more complex than the technical data suggests.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

MichaelW

Mark,

Yes I do have NAV running. I'll run some tests without it later today.
eschew obfuscation

Mark Jones

Rags, you need to make a TrimTrailingSpaces now. :)
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

rags

In these snippets from Marks code I have a question:
Snippet 1: 

  linestart:
    ;;lodsb
    ;lob           ; MASM32 macro, fast replacement for lodsb
    mov   al,[esi]
    add   esi,1

   
Snippet 2:

    .if   al == 10              ;lf?
      .if byte ptr[esi+1] == 13 ;cr?

;        movsb                   ;(this is a precaution
                                ;against a reversal of the
                                ;crlf bytes by some editors
mov al,[esi]
add esi,1
mov [edi],al
add edi,1

      .endif
      jmp   linestart
    .endif


First off, in the first snippet, a character gets moved to the al and esi gets incremented by 1.

In the second snippet, al is checked to see if it holds a LF. If it does, the next byte  in the buffer is checked
for a CR, by using .if byte ptr[esi+1] == 13, to see if the crlf byte order has been reversed.

My question is, since esi has already been incremented past the LF in the first snippet,
would'nt checking esi +1 be checking 2 bytes past the LF, and a reversed crlf never be detected?


Rags

God made Man, but the monkey applied the glue -DEVO

MichaelW

Rags,

I think you found a potential problem in not just Raymond's code. When I reverse the order of CR and LF in my test buffer, lingo's procedure is the only one that doesn't break:

                                         I'm Larry,
                         this is my brother darryl,
and this is my other brother darryl.
  _TrimLeadingSpaces:
I'm Larry,
                         this is my brother darryl,
and this is my other brother darryl.
  buffer1 length:85       TrimLeadingSpaces:
I'm Larry,
                         this is my brother darryl,
and this is my other brother darryl.
  buffer1 length:85       TrimLe:
I'm Larry,
this is my brother darryl,
and this is my other brother darryl.
  buffer1 length:78       block_ltrim:
I'm Larry,
                         this is my brother darryl,
and this is my other brother darryl.
  buffer1 length:85       _block_ltrim:
I'm Larry,
                         this is my brother darryl,
and this is my other brother darryl.
  buffer1 length:85


And if I change the statement to:

.if byte ptr[esi] == 13 ;cr?

then Raymond's version also produces the correct result. I counted the bytes and the correct length is 78.

Mark,

The problem is not NAV. I also ran the test on my K5 system under Windows XP HE and got cycle counts of:
61
1375
-58
-9
18

I think the odd counts can be attributed to a very oddball processor, but the pattern seems to be similar.
eschew obfuscation

raymond

I apologize for my error. :red  ESI is effectively pointing to the next character in my snippet and the instruction should have been:

.if byte ptr[esi] == 13

That, unfortunately, is what may occasionally happen with untested code. :boohoo:

Raymond
When you assume something, you risk being wrong half the time
http://www.ray.masmcode.com

Mark_Larson

#24
Quote from: MichaelW on March 24, 2005, 08:27:31 PM
Rags,

I think you found a potential problem in not just Raymond's code. When I reverse the order of CR and LF in my test buffer, lingo's procedure is the only one that doesn't break:

Good spot, rags :)


Quote from: MichaelW on March 24, 2005, 08:27:31 PM
Mark,

The problem is not NAV. I also ran the test on my K5 system under Windows XP HE and got cycle counts of:
61
1375
-58
-9
18

I think the odd counts can be attributed to a very oddball processor, but the pattern seems to be similar.


Crud.  I also disabled NAV after posting this morning.  I got identical #'s.  I will try what Hutch-- suggested and try aligning the code (grumbles).  I've seen aligned code on the P4 be slower, but not faster.  Slower due to the extra code that gets added from the align.  If it does make it faster, I am going to go kick Intel.  They have a facility in Austin.

I am going to be so thankful when I switch to my Athlon ( they also have a facility in town).  I guess Donkey and I are going to have to write optimized Athlon-64 code together.  Ah bliss...



BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

MichaelW

Before this got lost completely, I went back and re-tested with Lingo's updated version:

_TrimLeadingSpaces: 568 cycles
TrimLeadingSpaces: 1238 cycles
TrimLe: 531 cycles
block_ltrim: 486 cycles
_block_ltrim: 651 cycles

Closer, but Hutch's version is still faster. I guess we'll never know why an identical procedure at the same alignment (I tried up to ALIGN 16) can take 34% more cycles to execute.
eschew obfuscation

hutch--

Michael,

It can also be a factor of what has been run before it so it may be worth changing the order of which procedures are tested. I guess you could set all of the procedures so they followed a seperate identical algo so they all had the same lead in.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

lingo

Michael,  :lol

"Before this got lost completely, I went back and re-tested with Lingo’s updated version:

Code:
_TrimLeadingSpaces: 568 cycles
TrimLeadingSpaces: 1238 cycles
TrimLe: 531 cycles
block_ltrim: 486 cycles
_block_ltrim: 651 cycles"


Thank you Michael,   :clap:
but I received different results:
CPU Pentium 4 3.6GHz Prescott
and Windows XP SP2


"I'm Larry,
this is my brother darryl,
and this is my other brother darryl.
  _TrimLeadingSpaces:
I'm Larry,
this is my brother darryl,
and this is my other brother darryl.
  buffer1 length:78       TrimLeadingSpaces:
I'm Larry,
this is my brother darryl,
and this is my other brother darryl.
  buffer1 length:78       TrimLe:
I'm Larry,
this is my brother darryl,
and this is my other brother darryl.
  buffer1 length:78       block_ltrim:
I'm Larry,
this is my brother darryl,
and this is my other brother darryl.
  buffer1 length:78       _block_ltrim:
I'm Larry,
this is my brother darryl,
and this is my other brother darryl.
  buffer1 length:78
Press enter to continue...

_TrimLeadingSpaces: 977 cycles
TrimLeadingSpaces: 1609 cycles
TrimLe: 699 cycles
block_ltrim: 962 cycles
_block_ltrim: 959 cycles

Press enter to exit...


Regards,
Lingo

MichaelW

Quote from: lingo on April 02, 2005, 02:29:11 PM
Thank you Michael,   :clap:
but I received different results:
CPU Pentium 4 3.6GHz Prescott
and Windows XP SP2

Sorry, it was late and I forgot to add that I was running the test on a P3. I should have expected that you would optimize for P4.

eschew obfuscation

Mark_Larson

Quote from: lingo on April 02, 2005, 02:29:11 PM
but I received different results:
CPU Pentium 4 3.6GHz Prescott
and Windows XP SP2


That's because Michael has a P3.  The P4 is complelely different from the P3.  So a number of things that are fast on the P3 are slow on the P4 and vice versa.
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm