Few jumps vs many jumps in a parser loop

dedndave · December 11, 2009, 02:08:32 PM

i think you are missing what Hutch and I are trying to convey, Jochen
and, neither of us has posted the example code because we can't get a handle on the desired end-result
that is because you are trying to make the comparisons on a generic level
the methods we propose require knowledge of the specific application
we understand that it is a parser, of course - lol
it appears to be for evaluating ASCII decimal reals, but we can't be certain
i.e. we don't want to play the game, if we don't know what the rules are :P

jj2007 · December 11, 2009, 02:36:41 PM

Dave (and Hutch),
The rules are pretty simple:

- Take a source string and scan it.
- Do nothing for 90% of the characters found, in this example: 01234567890abcdfABCDF, i.e. the characters that have no specific meaning.
- Jump to a specific address if you hit an interesting character, e.g. a dot, an "h", an "e", a nullbyte, tab or space etc.

That is what this code simulates. I am sure there would be other applications, too, although it is clear that I want to use it to evaluate numbers of the WM_XXX EQU 1234h or WM_ZZZ EQU 101010b etc type.

HTH,
Jochen

dedndave · December 11, 2009, 04:01:02 PM

here you go, my friend
a generic example of what Hutch and i were thinking about

EDIT - there are ways to speed this up - it is merely intended to demonstrate the concept

hutch-- · December 11, 2009, 05:25:19 PM

JJ,

I just had a quick "struggle" through the technocolor code and noticed one thing, the use of the mnemonic "lodsb" which is known to be very slow on a PIV, especially without the REP prefix as you have no special case circuitry to make it faster. Shoot it and use a register which you increment and you should see a speed improvement.

jj2007 · December 11, 2009, 05:58:07 PM

@Dave: I hope this is correctly implemented. It seems to work but is not very fast. The reason might be that you have three instead of two memory accesses:

Code Select

Class4:
	movzx eax, byte ptr [esi]
	inc esi
	mov al, XlatTbl[eax]			; get the "class"
	jmp VectTbl[eax]

Code Select

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
1394    cycles for parseTable, 171 bytes,       #h=50, #dots=50
1336    cycles for parseTableS, 137 bytes,      #h=50, #dots=50
2542    cycles for parseManyAhAl, 111 bytes,    #h=50, #dots=50
2783    cycles for parseFewJumps, 99 bytes,     #h=50, #dots=50
2743    cycles for parseManyJumps, 99 bytes,    #h=50, #dots=50
2742    cycles for parseElseif, 111 bytes,      #h=50, #dots=50
3248    cycles for parseSwitch, 112 bytes,      #h=50, #dots=50
3175    cycles for parseDave, 109 bytes,        #h=50, #dots=400

@Hutch:
The lodsb is in the part that builds the table. It is run once and therefore has no influence on the timings.

dedndave · December 11, 2009, 06:45:12 PM

well - i don't think it is, Jochen
first, it appears as though you rebuild the one-time table in each call to parsedave
second, the program is a demonstration of a concept
it does not do what you do to parse
in order to compare times, you have to design a class list and build the table as required
so that it performs the same task as the other routines
i am sure there are ways to make it faster
the 3 accesses are not as bad as it may seem
the one thing that might be problematic is, as Hutch has mentioned in the past:
accessing a byte just before accessing a dword (in a loop situation) on a p4 is sluggish
that may well be part of the problem with some of the other algos, also

jj2007 · December 11, 2009, 07:13:57 PM

Quote from: dedndave on December 11, 2009, 06:45:12 PM
well - i don't think it is, Jochen
first, it appears as though you rebuild the one-time table in each call to parsedave

Code Select Expand
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3) 1805 cycles for parseTableS, 136 bytes, #h=50, #dots=50 3868 cycles for parseDave, 129 bytes, #h=50, #dots=400 ... 43 cycles for parseTableS, 136 bytes, #h=1, #dots=1 769 cycles for parseDave, 129 bytes, #h=1, #dots=8

These are timings for disabling these two lines:

Code Select

		; cmp dword ptr [edi], 0		; table already built?
		; jne ParseTheString

As you see, the Build is called only once, just as in the other routines.

Quotesecond, the program is a demonstration of a concept
it does not do what you do to parse
in order to compare times, you have to design a class list and build the table as required
so that it performs the same task as the other routines

It does not do exactly the same, but for the timings the difference is irrelevant since I scan the same Src

Quote
i am sure there are ways to make it faster
the 3 accesses are not as bad as it may seem
the one thing that might be problematic is, as Hutch has mentioned in the past:
accessing a byte just before accessing a dword (in a loop situation) on a p4 is sluggish
that may well be part of the problem with some of the other algos, also

Timings above are for Celeron, not P4. Anyway, if you see a way to speed up the innermost loops, please let me know.

EDIT: I managed to shorten parseTableS by 2 bytes, and speed it up by 25%. Don't ask me why ::)

Code Select

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
1349    cycles for parseTableS, 134 bytes       #h=50, #dots=50
1840    cycles for parseTable, 171 bytes        #h=50, #dots=50
3166    cycles for parseDave, 134 bytes         #h=50, #dots=400

dedndave · December 11, 2009, 10:44:19 PM

well - if you really want to speed the thing up, here's what you need to do - and this will make the P4's happy
the Src string may or may not be 4-aligned
build the parser to parse bytes until the address IS 4-aligned
then, switch to a different parser that reads 4-aligned dwords from Src
inside the dword parser, use one register to hold the unparsed bytes (EDX, maybe)
shift the bytes into DL (SHR EDX,8) and parse them from there
one of the biggest improvements i made to my ling long kai fang bignum conversion
routines (LLKF9) was to access 4-aligned dwords for both input and output

EDIT - i notice you have Src 16-aligned
that isn't a real-world condition
in fact, you should test it with different alignments (mod 4 should be good enough)
if you look at my ling long kai fang code (LLKF9), in the init section, i scan down non-signifigant bytes from the top of the input value
that code is somewhat similar to what you want, only mine has 3 loops - yours only needs the first 2 loops
i tested that code with "mod 4" alignment cases
no matter how many bytes or how they are aligned, that code is ~ 3 times faster than STD, REP SCASB on a P4 prescott

jj2007 · December 12, 2009, 03:24:07 AM

Quote from: dedndave on December 11, 2009, 10:44:19 PM
EDIT - i notice you have Src 16-aligned
that isn't a real-world condition

There is absolutely no difference in timings between aligned and misaligned strings. Which is not surprising for modern CPUs in general, and a byte scanner in particular...
As to the shift alternative, I doubt that it can speed things up. Can you show me how you would modify the inner loop?

Code Select

plSct0:	mov esi, Src		; let's go
		; ----------- innermost loop ----------------------------------------
plSct:	movzx eax, byte ptr [esi]			; load the char from the source string
		inc esi
		jmp dword ptr [edi+4*eax]		; jump to the address obtained from the table
		; ----------- end of innermost loop ------------------------------

hutch-- · December 12, 2009, 03:33:45 AM

JJ,

On a PIV,

Code Select


;  change this

 mov al, XlatTbl[eax]   ; get the "class"
jmp VectTbl[eax]

;    to this

movzx eax, BYTE PTR XlatTbl[eax]
; stick something in here if the algo allows it.
jmp VectTbl[eax]

This should prevent a read after write stall.

Also on a PIV ala Intel, use ADD REG/VAR, 1, not INC REG/VAR.

jj2007 · December 12, 2009, 03:54:17 AM

Hutch,
movzx eax instead of mov al rocks indeed, in Dave's algo:

Code Select

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
1346    cycles for parseTableS, 134 bytes       #h=50
2470    cycles for parseDave, 135 bytes         #h=50 movzx
3156    cycles for parseDave, 134 bytes         #h=50 mov al

But it is still almost a factor 2 slower than my parseTableS.
Can you give me timings for the P4?

I had tested inc esi vs. add esi, 1 in my own algo on my PIV, no significant difference at >12000 cycles.

dedndave · December 12, 2009, 05:13:13 AM

prescott...

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
9735 cycles for parseTableS, 134 bytes #h=50, #dots=50
10125 cycles for parseTable, 171 bytes #h=50, #dots=50
3049 cycles for parseFewJumps, 99 bytes #h=50, #dots=50
3806 cycles for parseManyJumps, 99 bytes #h=50, #dots=50
12009 cycles for parseDave, 135 bytes #h=50, #dots=400

2003 cycles for parseTableS, 134 bytes #h=10, #dots=10
2073 cycles for parseTable, 171 bytes #h=10, #dots=10
724 cycles for parseFewJumps, 99 bytes #h=10, #dots=10
887 cycles for parseManyJumps, 99 bytes #h=10, #dots=10
2543 cycles for parseDave, 135 bytes #h=10, #dots=80

254 cycles for parseTableS, 134 bytes #h=1, #dots=1
264 cycles for parseTable, 171 bytes #h=1, #dots=1
98 cycles for parseFewJumps, 99 bytes #h=1, #dots=1
73 cycles for parseManyJumps, 99 bytes #h=1, #dots=1
300 cycles for parseDave, 135 bytes #h=1, #dots=8

if i get some time tomorrow, i may play with an aligned loop JJ :8)

sinsi · December 12, 2009, 07:22:52 AM

Are you wanting to scan a string and get true/false? Use a byte scanner and return when one is found.
Are you counting certain characters? Use a table (inc chars[eax]) and bsf.
Have you tried call/ret instead of jmp/jmp?
Or are you processing the source to a destination on the fly?

Playing around with aligning threw other procs out of whack, but I did get the 'S' one 100 cycles faster.
Maybe you could have one exe per proc and discard the slow ones?

jj2007 · December 12, 2009, 08:10:57 AM

Quote from: sinsi on December 12, 2009, 07:22:52 AM
Are you wanting to scan a string and get true/false?

Sinsi,
The scanner is supposed to act, i.e. jump, for specific characters.
For example, look at your post: You have used lots of lowercase characters - ordinary text, no need to act, so get the next char. But do something specific if you find ?/[].' etc, or if you encounter an uppercase character.
My specific application is to get the value of e.g.

Code Select

1010101b
( 101010 y)
12345
12345h
$12345
0x12345
123.456
(123.456)
-123e99
123eh

I guess you are able to recognise, using your knowledge and logic and a set of rules, that these are all valid number formats; some of them are being used in Windows.inc, and ml.exe has no problem distinguishing them. That is what the scanner should do: Get the info for applying the rules, then, later, branch to the decimal or hex or bin$ algo.

hutch-- · December 12, 2009, 10:23:04 AM

For most of you number formats, just having a leading number makes the parser a lot easier, branch on anything between ascii "0" to "9" and the minus sign "-" and keep scanning until you find the end of the number, usually by a punctuation character. Depending on what is at the front or end you can determine the numeric format and convert it accordingly. The ones in brackets "(" and ")" are usually parsed in the preprocessor.

The leading "$" notation is very easy to parse because you only have to test the 1st character to determine the data type but if you are using a general purpose word parser its just as easy to rad the last character as MASM does.

News:

Few jumps vs many jumps in a parser loop