News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

InstrCi - case-insensitive string search

Started by jj2007, June 15, 2008, 10:09:46 PM

Previous topic - Next topic

jj2007

Quote from: MichaelW on May 12, 2009, 03:59:54 AM
I haven't been following this thread, but if the alignment of the return address is a problem

Thanks. It is unlikely that this is the problem because the return address is 004036E3 in both the slow and fast case. Somewhere about 1000 cycles per call get lost, and I would really like to understand why...

MichaelW

I can't see any reason for the difference, but your app is so complex that it would be easy to miss a problem. You could try isolating the problem components in a simple test app, where there are fewer opportunities for errors and side effects.
eschew obfuscation

jj2007

Quote from: MichaelW on May 12, 2009, 06:27:35 AM
I can't see any reason for the difference, but your app is so complex that it would be easy to miss a problem. You could try isolating the problem components in a simple test app, where there are fewer opportunities for errors and side effects.

Thanks, Michael, for looking at this. The app is 28k including the "Mainstring" it searches; it may be complex in a way, but algo start is para-aligned, data alignment is fixed, too, and it is so small that code and data should fit into the cache. So how can moving the algo 16 bytes further down cost 1,000 cycles per call?

Isolating the problem components is of course a good idea, but most probably I won't be able to reproduce the effect any more.
I am lost... ::)

EDIT: Just tested the two exes attached above on P4:

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)

InstrCi11_IfNop_OFF.exe
4233     InstrJJ (jj2007)
4020     BMLinDB (Lingo)

InstrCi11_IfNop_ON.exe
4275     InstrJJ (jj2007)
4045     BMLinDB (Lingo)


Not the slightest difference. On the Celeron, there is a consistent difference of 1,000 cycles for InstrJJ but none for BMLinDB. So it is even CPU-specific... and the P4 is usually more sensitive to alignment than the more recent Celeron M.

jj2007

Quote from: Jimg on May 11, 2009, 08:14:15 PM
Why are you so surprised?  I see this all the time on amd.

Jim,
Did you ever find an explanation why this can happen? Some pattern that would point to possible causes? Does it happen in your big and small apps alike? What is the order of magnitude you observed: +40%=1000 cycles difference, or much less?

MichaelW

Perhaps the code with the extra 16 bytes will not fit in the cache, or some critical part of it will not fit. If this is so, adding more bytes should eventually cause the code to misbehave on other processors.
eschew obfuscation

jj2007

#50
Quote from: MichaelW on May 12, 2009, 08:37:24 AM
Perhaps the code with the extra 16 bytes will not fit in the cache, or some critical part of it will not fit. If this is so, adding more bytes should eventually cause the code to misbehave on other processors.


The Celeron M 420 has 32kb of L1 cache, so the 28k exe should comfortably fit into the cache. But I am not a hardware specialist...

By the way, I cross-checked with GetTickCount, and the cycle counts are correct. Calling the algo is 1,000 cycles slower when it sits 16 bytes further down up:boohoo:

EDIT: Found a good article on Optimizing for instruction caches:

(page 2): "The effective size of a function in the cache can vary depending on its alignment. Suppose a function requires less than one line of cache, but that it overlaps a cache line boundary. In this case, the function will require allocation of two cache lines, substantially increasing it effective size".

Sounds plausible, but can an executable with only 28k overlap with a cache line boundary if the L1 cache is 32k? Is that the explanation?? If I understand the article correctly, two cache lines may be allocated, fine. But only once, or for every call, costing 1000 cycles??

Jimg

Quote from: jj2007 on May 12, 2009, 07:44:15 AM
Quote from: Jimg on May 11, 2009, 08:14:15 PM
Why are you so surprised?  I see this all the time on amd.

Jim,
Did you ever find an explanation why this can happen? Some pattern that would point to possible causes? Does it happen in your big and small apps alike? What is the order of magnitude you observed: +40%=1000 cycles difference, or much less?

I only know it happens.  In fact, a couple of years ago, I set up my time testing routines to actually copy the code under test to the same physical memory location before starting the test, so every call was from and to the same location.  About the time I got this all working, I realized I would never be able to play with the big boys because I had an AMD and they were all Pentium-4 centric.  Took all the fun out of it.

Here's a sample.  I was getting rock solid repeatability with only ten loops.

[attachment deleted by admin]

dedndave

I only see slight differences with a Prescott....

InstrCi11_IfNop_ON.exe

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
Test correctness for InstrJJ: OK
Average cycle counts:
4295     InstrJJ (jj2007)
3995     BMLinDB (Lingo)
16355    InString (Masm32 Lib)
InstrJJ : InString = 26 %

InstrCi11_IfNop_OFF.exe

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
Test correctness for InstrJJ: OK
Average cycle counts:
4286     InstrJJ (jj2007)
4045     BMLinDB (Lingo)
16671    InString (Masm32 Lib)
InstrJJ : InString = 25 %

jj2007

Quote from: Jimg on May 12, 2009, 02:56:44 PM

I only know it happens.  In fact, a couple of years ago, I set up my time testing routines to actually copy the code under test to the same physical memory location before starting the test, so every call was from and to the same location.  About the time I got this all working, I realized I would never be able to play with the big boys because I had an AMD and they were all Pentium-4 centric.  Took all the fun out of it.

Here's a sample.  I was getting rock solid repeatability with only ten loops.

Thanks, very interesting. It seems to be linked to
instruction cache locality, although I still do not understand how this should affect a 300-byte algo in a 28k executable on a CPU with 32k L1 cache...

Quote from: dedndave on May 12, 2009, 02:59:40 PM
I only see slight differences with a Prescott....

My P4 is a Prescott, too - same result: no difference. It is CPU-specific.

UtillMasm

c:\china>\masm32\bin\ml.exe /c /coff instrnew.asm
Microsoft (R) Macro Assembler Version 6.14.8444
Copyright (C) Microsoft Corp 1981-1997.  All rights reserved.

Assembling: instrnew.asm

c:\china>\masm32\bin\link.exe /subsystem:console instrnew.obj
Microsoft (R) Incremental Linker Version 5.12.8078
Copyright (C) Microsoft Corp 1992-1998. All rights reserved.


c:\china>instrnew.exe
Type       Short   Medium     Long HighAnsi MisMatch
Source   Mainstr  Mainstr  Mainstr  Mainstr mismatch
Pattern  Substr1  Substr2  Substr3  Substr4   mmword
======= ======== ======== ======== ======== ========
OVH          264        0        0        0        0
InStrL        88      132      363      462     1155
finstr        88      286      429      550     1496
finstr2      154      209      363      363     1617
finstr3      143      209      352      352     1507
finstr4       88      308      451      594     1496
finstr5       88      319      451      605     1496
finstr6      154      209      363      363     1617

Press enter to exit...

c:\china>

dedndave

This subject brings to mind something that has troubled me a bit.
If a programmer tests/optimizes his/her code on one machine, as many of us would
during a normal session, the code is likely to become biased for that CPU.
As a group, it would be nice if we could standardize testing and document
most of our results in one easy-to-use reference. That way, if a programmer
(especially a n00b like me) wants to find an optimization that has been
tested on a variety of platforms, he/she doesn't have to search and read all
the forum threads concerning a single subject. I am not refering to major
algorithms, but simple little things that have been widely tested.
It seems impractical to have all the members test every algo we write, in
order to acquire a multi-platform sampling. On the other side of the coin,
we don't want finished code that is (sometimes unknowingly) optimized
for a specific processor. There must be a solution someplace.

As for Jochen's issue, it is machine specific, but all of us want to know what
the solution is so that we may attempt to avoid the issue in code that is
written and tested on a different CPU. It makes me wonder how many
programs out there suffer on one or more specific CPU(s).

JJ - This morning I was reading Agner's assembly optimization manual.
One of the things that made me think of your problem was his
statement "Make sure it isn't the test program". He went on to say
that many hours are spent trying to fix an algorithm, when it turns
out to be the testing method at fault. I am not saying your test-bed
has a flaw, I just wanted you to consider the possibilities. At our age,
we can't afford to pull out too much of the hair we have.

UtillMasm

it's possible, we can make a custom cpu(or SoftCpu) for our program in furture! :toothy

fearless

Quote from: dedndave on May 12, 2009, 04:17:14 PM
..he/she doesn't have to search and read all
the forum threads concerning a single subject.

I agree. Also could there be a summary thread made for all the tested algos that would allow us just to refer to it for the very latest downloads and information. Another thing i would like to see is an indicator of the definitive best solution algo (not always possible i know) or in place of that a recommendation on the best one (or more than one - but try to keep it to one) to go with and a note on the one to avoid or at use the least.

All this in one place would be a very handy resource. Some threads are massive and/or there are multiple threads for similar topics. For example szLen optimize (http://www.masm32.com/board/index.php?topic=1807.0) has 25 pages, and there are two threads on string copying: lstrcpy vs szCopy (http://www.masm32.com/board/index.php?topic=10830.0) has 4 pages, SzCpy vs. lstrcpy (http://www.masm32.com/board/index.php?topic=1589.0) has 8 pages. Could the lab implement other subforums here for different categories: string algos & optimizing, memory algos & optimizing, etc etc. I realise that could take a lot of work, also the placement of topics into the correct subforums may coz moderators grief in moving and sorting etc. Maybe a post template could also be created that defines the information that should be provided when creating a new thread in the lab for purposes of presenting a new or modified algo.

Just some ideas, hope no one takes offence.
ƒearless

jj2007

Quote from: Jimg on May 12, 2009, 02:56:44 PM

Here's a sample.  I was getting rock solid repeatability with only ten loops.

My 2 cts - Celeron M. It took me a while because it choked with ml 9.0 ;-)
Well organised table, by the way :U

Type       Short   Medium     Long HighAnsi MisMatch
Source   Mainstr  Mainstr  Mainstr  Mainstr mismatch
Pattern  Substr1  Substr2  Substr3  Substr4   mmword
======= ======== ======== ======== ======== ========
OVH          264        0        0        0        0
InStrL        96      132      348      468     1152
finstr        96      288      432      552     1500
finstr2      144      216      372      372     1608
finstr3      144      204      348      348     1512
finstr4       84      300      456      588     1500
finstr5       84      324      456      612     1488
finstr6      144      216      360      372     1608

jj2007

Quote from: fearless on May 12, 2009, 06:21:20 PM
Just some ideas, hope no one takes offence.

Excellent ideas, fearless. Some of these threads are really too detailed (and I plead guilty :bg), and unfortunately not always on topic. Good coders tend to have strong personalities :wink

Standardisation could help a lot - Michael's timer macros are a fantastic achievement, but we might add some rules, e.g. how to pass parameters (we all know that fastcall is a few cycles faster...). You may have seen the timings macro in my posts above, I have done that one to suit my own laziness, but I believe we could work out something along those lines. Not to mention LAMPs :wink