I was testing the hypothesis that jumping is slower than not jumping - see snippet below. Code size is identical, timings are surprisingly similar, but what struck my eye was that the longer string took almost 4 times as long as expected. Any explanation...?
Celeron M:
587 cycles few jumps, 66 bytes, len=40
616 cycles many jumps, 66 bytes, len=40
45 cycles few jumps, 66 bytes, len=10
43 cycles many jumps, 66 bytes, len=10
Quotealign 16
Src1 db "123456789.1234789e-123 and a few more..", 0
align 16
Src2 db "1234.567h", 0
.code
start:
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
S1:
mov esi, offset Src1
jmp @F
HasE:
jmp @F
HasDot:
jmp @F
HasH:
jmp @F
HasB:
jmp @F
HasY:
jmp @F
HasB1:
jmp @F
HasB2:
@@:
mov al, [esi]
inc esi
.if al<"0" || al>"9" ; few jumps
test al, al
je @F
or al, 20h
cmp al, "e"
je HasE
cmp al, "."
je HasDot
cmp al, "h"
je HasH
cmp al, "b"
je HasB
cmp al, "y"
je HasY
cmp al, "("
je HasB1
cmp al, "("
je HasB2
.endif
jmp @B
@@:
E1:
sub esi, offset Src1
counter_end
print str$(eax), 9, "cycles few jumps, "
mov eax, E1
sub eax, S1
print str$(eax), " bytes, len="
print str$(esi), 13, 10
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
S2:
mov esi, offset Src1
@@:
mov al, [esi]
inc esi
.if al<"0" || al>"9" ; many jumps
test al, al
je @F
or al, 20h
.if al=="e"
jmp @B
.endif
.if al=="."
jmp @B
.endif
.if al=="h"
jmp @B
.endif
.if al=="b"
jmp @B
.endif
.if al=="y"
jmp @B
.endif
.if al=="("
jmp @B
.endif
.if al=="("
jmp @B
.endif
.endif
jmp @B
@@:
E2:
sub esi, offset Src1
counter_end
print str$(eax), 9, "cycles many jumps, "
mov eax, E2
sub eax, S2
print str$(eax), " bytes, len="
print str$(esi), 13, 10
[/b]
could it be because it is trying to parse through...
Quote" and a few more.."
DOH !
lol
as for jumps - i find that short jumps are pretty quick
it is still good to reduce the number of jumps executed in a loop, if practical
Quote from: dedndave on December 08, 2009, 04:56:38 PM
could it be because it is trying to parse through...
Quote" and a few more.."
DOH !
lol
Yep, you are right. These are non-numbers that slow the algo down... how stupid of me.
504 cycles few jumps, 66 bytes, len=100
502 cycles many jumps, 66 bytes, len=100
45 cycles few jumps, 66 bytes, len=10
43 cycles many jumps, 66 bytes, len=10
>as for jumps - i find that short jumps are pretty quick
>it is still good to reduce the number of jumps executed in a loop, if practical
Probably yes, although this test reveals no difference between jumping and not jumping.
JJ,
In isolation the difference is reasonably hard to measure but thge general drift is if an algorithm is anything like well advanced in speed terms extra jumps start to become one of the barriers to making it go faster. With very high speed code you will see the difference but with code full of stalls, dependencies and the like it just does not matter much.
Hutch,
Maybe I have not been clear. What I wanted to see was the difference between two parser loops, assuming that the text rarely has a dot:
A)
cmp al, "."
je HasDot ; ... do stuff there, otherwise fall through
B)
.if al=="."
... do stuff
.endif
Now A) is obviously rarely taken; in 99% of cases, it falls through.
In B), the .if translates to:
cmp al, "."
jne @F
... do stuff
@@:
So here it's the opposite: That jump is taken 99% of the time.
According to Opcodes.chm, a jump taken is 3 cycles, not taken is one cycle. An unknown author writes there:
Quote- It's a good programming practice to organize code so the
expected case is executed without a jump since the actual
jump takes longer to execute than falling through the test.
:wink
JJ,
That is conventional wisdom and if it is the only factor its usually correct, a fall through is faster than a jump over but almost exclusively there are other factors that effect instruction scheduling. For what its worth a character table usually eats a sequential scanner aliver in terms of character evaluation.
what isn't readily obvious or easily quantifiable is "how much work is the jump doing for you"
sometimes, a jump is a sign of clumsy code
other times, it performs a great service in terms of easing the workload for the rest of the routine
sometimes, you can replace a jump inside a loop with one outside the loop
well - you get the idea
if the jump is "doing a lot of work" it is worth the effort
testing conditionals in the right order can speed up the simplest code
I was wondering if a branch hint makes any difference?
JJ,
Make a 256 member table of label offsets which you can divide into as many characters classes as you like and for the price of one table access you can seperate punctuation from either lower or upper class words, characters you don't want < 32 except for 9 10 13, numbers, arithmetic operators etc ..... any division you like.
Unwanted characters go back to the loop start, punctuation to another label, arithmetic operators to another label etc .... This type of algo will rip the titz off a sequential evaluation block and this is one place where you need the pace if the text you are parsing is large.
Quote from: sinsi on December 09, 2009, 04:10:49 AM
I was wondering if a branch hint makes any difference?
Sinsi, DednDave, Hutch, thanks. The branch hint is not helpful, newer processors do it automatically, the character table is feasible but kind of an overkill if you have only a few branches. I might give it a try, though.
The point is rather that the code with branches runs at 5 cycles per character, independently of whether most branches are taken or not taken. The Opcodes.chm rule that a taken branch needs 3 cycles, a not taken one needs one cycle is thus obsolete.
It might be different for older processors, though - test yourself, code attached.
There are 2 instructions to help with conditional jumps
2EH—Branch not taken (used only with Jcc instructions)
3EH—Branch taken (used only with Jcc instructions)
I think they are the instructions for segment overrides (CS: and DS:) so older processors ignore it.
So do newer processors, but I can't find the Intel or Agner Fog page that specifies why ::)
We had that instruction in this thread (http://www.masm32.com/board/index.php?topic=9370.msg84835#msg84835) already.
>We had that instruction in this thread already.
Can you recall the reply? 8 pages is too much...
Quote from: sinsi on December 09, 2009, 08:14:14 AM
>We had that instruction in this thread already.
Can you recall the reply? 8 pages is too much...
There was no reply, but I found the reference: Agner Fog (http://www.agner.org/optimize/microarchitecture.pdf). Search the pdf for branch hint to see several quotes of the "Branch hint prefixes have no useful effect" type.
i think the 3 cycles for "branch taken" is from the 486 days - which went by rather fast - lol
actually, i remember thinking how nice that processor was :bg
well - compared to the 286, it was a super-star
:bg
I did too but the 33 meg i486 chip cost me $1500 AU, the board cost me about the same and the 300 meg ESDI full height cost me $2200 AU and then I had to buy a controller card for it. I still own the lot but the controller card dies shortly after I retired the box and the next time I tried to get the disk contents it would not work.
The full height NEC disk would make a good doorstopper though. :bdg
lol
i have an old (XT old) seagate 40 mb full height drive
it is a good drive, really - lol
but the heads have a self-parking solenoid (one of the first)
when that thing kicks in or out, you want to open it to see if your clothes are dry - lol
Quote from: hutch-- on December 09, 2009, 06:26:29 AM
JJ,
Make a 256 member table of label offsets ...This type of algo will rip the titz off a sequential evaluation block and this is one place where you need the pace if the text you are parsing is large.
Hutch,
Yes it rips the titz off, twice as fast as the sequential ones, but only on my Celeron M... on a P4, it stinks. Unless you have a better design...? Code is attached.
3804 cycles few jumps, 66 bytes, len=500
10640 cycles char table, 135 bytes, len=500
3233 cycles many jumps, 66 bytes, len=500
2921 cycles many + xchg al, ah, 78 bytes, len=500
Quote .data?
PARSETABLE STRUCT
TheChar BYTE ?
TheLabel DWORD ?
PARSETABLE ENDS
...
ChTable dd 256 dup(?) ; CHARTABLE parser
.data
pt PARSETABLE <0, plCTE1>, ; if nullbyte found, jump out of the loop
<"e", ChkE>, <"E", ChkE>, ; exponent, lowercase or uppercase
<".", ChkDot>,
<"h", ChkH>, <"H", ChkH>, ; hex
<"b", ChkB>, <"B", ChkB>, ; binary
<"y", ChkY>, <"Y", ChkY>, ; binary
<"(", ChkB1>, <")", ChkB2> ; brackets
ptEnd db -128 ; flag end of table
.code
push edi
push esi
mov esi, offset Src1
mov edi, offset ChTable
cmp dword ptr [edi], 0
jne plCT0 ; chartable already created, so dive directly into the loop
mov ecx, 256 ; this part will be run only once, so it can be slow
push edi
mov eax, plCT ; set defaults
rep stosd
pop edi
push esi
mov esi, offset pt
.Repeat
lodsb
.Break .if al==-128
movzx edx, al
lodsd
mov [edi+4*edx], eax
.Until 0
pop esi
jmp plCT0
ChkE:
jmp plCT
ChkDot:
jmp plCT
ChkH:
jmp plCT
ChkB:
jmp plCT
ChkY:
jmp plCT
ChkB1:
jmp plCT
ChkB2:
plCT0:
xor eax, eax
plCT:
movzx eax, byte ptr [esi]
inc esi
jmp dword ptr [edi+4*eax] ; jump to the address obtained from the table
plCTE1:
JJ,
A table like this. 0 is one class, the numbers are class 1, upper case is class 2 and lower case is class 3.
db 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
db 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
db 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
db 1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0
db 0,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
db 2,2,2,2,2,2,2,2,2,2,2,0,0,0,0,0
db 0,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3
db 3,3,3,3,3,3,3,3,3,3,3,0,0,0,0,0
db 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
db 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
db 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
db 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
db 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
db 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
db 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
db 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Then make another data table which has the label offsets as members.
lbltbl dd lbl0,lbl1,lbl2,lbl2
In the code,
lbl0:
; code
lbl1:
; code
lbl2:
; code
lbl3:
; code
2 table accesses to get to the right label after evaluating each character.
Hutch,
That's exactly what my code does. It's very fast on a Celeron but not on a P4, unfortunately. Here is the innermost loop.
movzx eax, byte ptr [esi]
inc esi
jmp dword ptr [edi+4*eax] ; jump to the address obtained from the table
Can I have some timings for other processors, please?
Thanx, jj
why not "pre-multiply" the byte values in the table by 4 ? (unless, of course, they are over 63 - lol)
alternatively...
movzx eax,byte ptr [esi]
shl eax,2
inc esi
jmp dword ptr [edi+eax]
the [edi+4*eax] kills you on the P4
Quote from: dedndave on December 10, 2009, 11:06:02 AM
why not "pre-multiply" the byte values in the table by 4 ? (unless, of course, they are over 63 - lol)
alternatively...
Can you post a working example? The shl eax, 2 solution is slower, unfortunately...
db 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
db 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
db 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
db 4,4,4,4,4,4,4,4,4,4,0,0,0,0,0,0
db 0,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8
db 8,8,8,8,8,8,8,8,8,8,8,0,0,0,0,0
db 0,12,12,12,12,12,12,12,12,12,12,12,12,12,12,12
db 12,12,12,12,12,12,12,12,12,12,12,0,0,0,0,0
db 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
db 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
db 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
db 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
db 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
db 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
db 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
db 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
.
.
.
movzx eax,byte ptr [esi]
inc esi
jmp dword ptr [edi+eax]
another alternative is....
jmp lbltbl[eax]
no need to load the vector table address into edi
I was looking at the source code and noticed that there are two back-to-back cmps of the al register with ")" as the value everywhere, but the switch routine has a "(" and a ")".
I believe the all the other rountines should match the switch routine logic, because it doesn't make sense to have back-to-back compares with the same values, waste of space and never gets executed?
JJ,
I just downloaded your test piece but I don't know if I will live long enough to decypher it. A passable description of wqhat you are parsing would make it a lot easier to work on.
RE: Complex addressing mode on a PIV, I have never had a problem using it and while the suggestion Dave made can save you a base address register [table+index_reg*4] it tends to be slightly slower than using a register as the based address. A shift is far slower.
Here are the timings on my quad.
2503 cycles few jumps, 66 bytes, len=500
1637 cycles char table, 135 bytes, len=500
2421 cycles many jumps, 66 bytes, len=500
1796 cycles many + xchg al, ah, 78 bytes, len=500
2355 cycles elseif jumps, 78 bytes, len=500
2890 cycles Switch, 79 bytes, len=500
80 cycles few jumps, 66 bytes, len=10
42 cycles many jumps, 66 bytes, len=10
--- ok ---
Quote from: dedndave on December 10, 2009, 11:37:47 AM
movzx eax,byte ptr [esi]
inc esi
jmp dword ptr [edi+eax]
My belly says this will crash.
Quote from: dedndave on December 10, 2009, 11:45:56 AM
another alternative is....
jmp lbltbl[eax]
no need to load the vector table address into edi
Yes that works but it yields exactly the same cycles on Celeron (and is 4 bytes longer).
Quote from: hutch-- on December 10, 2009, 12:20:43 PM
JJ,
I just downloaded your test piece but I don't know if I will live long enough to decypher it. A passable description of wqhat you are parsing would make it a lot easier to work on.
Will work on it. Sorry that I challenged you - the string to be parsed is in the .data section, and tests possible combinations of text that could be interpreted as decimal, hex or binary values.
Quote
Here are the timings on my quad.
What happened to your trusty P4...?
Your wish is my command.
Timing on my 3.8 gig PIV.
3323 cycles few jumps, 66 bytes, len=500
10468 cycles char table, 135 bytes, len=500
3050 cycles many jumps, 66 bytes, len=500
2830 cycles many + xchg al, ah, 78 bytes, len=500
3156 cycles elseif jumps, 78 bytes, len=500
2801 cycles Switch, 79 bytes, len=500
159 cycles few jumps, 66 bytes, len=10
102 cycles many jumps, 66 bytes, len=10
--- ok ---
JJ,
Depending on your chosen notation, there are a number of ways to test what a number text type is, C uses 0xBADC0DE while basic uses &h or &B while MASM uses a leading number and a trailing "h" for hex and only a trailing "b" for binary and has to track the size by the string length. Is there a particular format you have in mind ?
QuoteMy belly says this will crash.
Jochen - look at the table values
i think your belly is tender - lol - don't make me come over there and rub it
i ain't rubbin' nuthin
Quote from: hutch-- on December 10, 2009, 02:35:44 PMIs there a particular format you have in mind ?
Njet. GfaBasic eats almost everything (and correctly), and so does MasmBasic. The current Val() is just a bit slow and a bit oversized, that's why I am testing the parser option.
@Dave:
movzx eax,byte ptr [esi]
inc esi
jmp dword ptr [edi+eax]
Addresses are 4 bytes long. If eax contains e.g. "1" aka 49, you are jumping to edi+49. What was your favourite exception? c0000005?
:bg
Jochen - look at this post, then the one of yours right after it.....
http://www.masm32.com/board/index.php?topic=12871.msg99585#msg99585
Hutch showed you an array of byte values in steps of 1
you seemed perfectly happy with that - lol - in fact, you said that is what your code does
then - look at mine again - they are byte values in steps of 4
that simply saves you from having to use [edi+4*eax]
you should try my stuff out before you dismiss it :bg
Quote from: z941998 on December 10, 2009, 12:18:59 PM
I was looking at the source code and noticed that there are two back-to-back cmps of the al register with ")" as the value everywhere, but the switch routine has a "(" and a ")".
Yes, but it's for testing the speed only. The final version would have different tests and actions. Thanks anyway for pointing this out.
Quote from: dedndave on December 10, 2009, 06:06:19 PM
then - look at mine again - they are byte values in steps of 4
that simply saves you from having to use [edi+4*eax]
you should try my stuff out before you dismiss it :bg
Apologies, my friend. There seems to be a misunderstanding, though:
.data
align 16
Src1 db "1234.567h1234.567h1234.567h1234.567h1234.567h1234.567h1234.567h1234.567h1234.567h1234.567h1234.5678h"
db "1234.567h1234.567h1234.567h1234.567h1234.567h1234.567h1234.567h1234.567h1234.567h1234.567h1234.5678h"
db "1234.567h1234.567h1234.567h1234.567h1234.567h1234.567h1234.567h1234.567h1234.567h1234.567h1234.5678h"
db "1234.567h1234.567h1234.567h1234.567h1234.567h1234.567h1234.567h1234.567h1234.567h1234.567h1234.5678h"
db "1234.567h1234.567h1234.567h1234.567h1234.567h1234.567h1234.567h1234.567h1234.567h1234.567h1234.567h", 0
That is our source string (not a very valid format, but for timings it's ok).
Now the innermost loop:
mov esi, offset Src1
plCT:
movzx eax, byte ptr [esi]
inc esi
jmp dword ptr [ChTable+4*eax] ; jump to the address obtained from the table
movzx eax, byte ptr [esi] loads a value between 0 and 255. That might be "y", ASCII 121, the terminator of a Bin$.
Now to make it clearer what happens here, I'll use a modified version of the innermost loop:
movzx eax, byte ptr [esi]
inc esi
mov edx, [ChTable+4*eax]
jmp edx
If eax was "y", then edx contains the address found in [ChTable+4*121], set before with the <"y", ChkY> structure.
What my table generator does is to preload the whole table with plCT, via rep stosd, then set a handful of jmp addresses individually. Both you and Hutch have used another table construct, not with addresses but with byte size (1,2,3) values. I have not yet grasped how the innermost loop would look like, but hope you can clarify...
New code attached, somewhat clearer than before.
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
1682 cycles for parseTable, 97 bytes, #h=55
2397 cycles for parseManyAhAl, 98 bytes, #h=55
2638 cycles for parseFewJumps, 86 bytes, #h=55
2494 cycles for parseManyJumps, 86 bytes, #h=55
3031 cycles for parseElseif, 98 bytes, #h=55
3577 cycles for parseSwitch, 99 bytes, #h=55
prescott...
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
10943 cycles for parseTable, 97 bytes, #h=55
3124 cycles for parseManyAhAl, 98 bytes, #h=55
3106 cycles for parseFewJumps, 86 bytes, #h=55
3405 cycles for parseManyJumps, 86 bytes, #h=55
3065 cycles for parseElseif, 98 bytes, #h=55
3162 cycles for parseSwitch, 99 bytes, #h=55
2231 cycles for parseTable, 97 bytes, #h=11
586 cycles for parseManyAhAl, 98 bytes, #h=11
796 cycles for parseFewJumps, 86 bytes, #h=11
618 cycles for parseManyJumps, 86 bytes, #h=11
670 cycles for parseElseif, 98 bytes, #h=11
692 cycles for parseSwitch, 99 bytes, #h=11
261 cycles for parseTable, 97 bytes, #h=1
71 cycles for parseManyAhAl, 98 bytes, #h=1
69 cycles for parseFewJumps, 86 bytes, #h=1
106 cycles for parseManyJumps, 86 bytes, #h=1
73 cycles for parseElseif, 98 bytes, #h=1
227 cycles for parseSwitch, 99 bytes, #h=1
JJ,
I asked you the question because the original topic was to do with parsing, not just number conversion and what I was after was the data IO format(s) that you were working on. Currently reading your source is not a lot of use because I don't know what it is supposed to be doing.
Your code has a bad bottleneck on Prescott core PIVs that shows up on both Dave's machine and the 3.8 gig PIV I have here and there are a massive number of PIVs that will still run for years so its worth tracking down why you have this problem.
Hutch,
The case here is pretty generic: Scan a string, and in rare cases take some action. I tried two variants, one that
takes jumps if no action is needed, the other that does
not take jumps if no action is needed. Then an experienced programmer suggested a character table:
Quote from: hutch-- on December 09, 2009, 01:01:02 AMFor what its worth a character table usually eats a sequential scanner aliver in terms of character evaluation.
Brilliant idea, so I implemented it and tested it, and voilà! on the Celeron M (a Core CPU) it eats the sequential scanner alive. Thanks a lot for pointing me to this solution - and condolencies to P4 owners :bg
Quote from: hutch-- on December 10, 2009, 11:49:11 PMthere are a massive number of PIVs that will still run for years so its worth tracking down why you have this problem.
I agree. It is indeed the innermost loop, and that one is very simple. I tested various options, including alignment, but no success.
Quote movzx eax, byte ptr [esi]
inc esi
; now jump to the address obtained from the table
if 0
jmp dword ptr [ChTable+4*eax] ; slowest
elseif 1
mov eax, [edi+4*eax] ; fastest
jmp eax
else
jmp dword ptr [edi+4*eax] ; fast
endif
(it may not look like a loop, but Olly says it is a loop :toothy)
I added a "short" variant using a 2-byte per jump table. It's even faster than the 5-byte version. Note that previous timings showed a wrong code size - the space used in the .data section was not reported.
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
1635 cycles for parseTable, 158 bytes, #h=55
1527 cycles for parseTableS, 128 bytes, #h=55 <--- NEW
2335 cycles for parseManyAhAl, 98 bytes, #h=55
2627 cycles for parseFewJumps, 86 bytes, #h=55
3111 cycles for parseManyJumps, 86 bytes, #h=55
2488 cycles for parseElseif, 98 bytes, #h=55
3388 cycles for parseSwitch, 99 bytes, #h=55
prescott...
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
11181 cycles for parseTable, 158 bytes, #h=55
10707 cycles for parseTableS, 128 bytes, #h=55
3002 cycles for parseManyAhAl, 98 bytes, #h=55
3228 cycles for parseFewJumps, 86 bytes, #h=55
3326 cycles for parseManyJumps, 86 bytes, #h=55
3341 cycles for parseElseif, 98 bytes, #h=55
3164 cycles for parseSwitch, 99 bytes, #h=55
Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz (SSE4)
1718 cycles for parseTable, 158 bytes, #h=55
1723 cycles for parseTableS, 128 bytes, #h=55
1806 cycles for parseManyAhAl, 98 bytes, #h=55
2416 cycles for parseFewJumps, 86 bytes, #h=55
2466 cycles for parseManyJumps, 86 bytes, #h=55
2456 cycles for parseElseif, 98 bytes, #h=55
2875 cycles for parseSwitch, 99 bytes, #h=55
Hi,
PIII
G:\WORK\TEMP>parserlo
☺☺☻♥ (SSE1)
5885 cycles for parseTable, 158 bytes, #h=55
5933 cycles for parseTableS, 128 bytes, #h=55
3778 cycles for parseManyAhAl, 98 bytes, #h=55
3740 cycles for parseFewJumps, 86 bytes, #h=55
4227 cycles for parseManyJumps, 86 bytes, #h=55
3841 cycles for parseElseif, 98 bytes, #h=55
5515 cycles for parseSwitch, 99 bytes, #h=55
1211 cycles for parseTable, 158 bytes, #h=11
1210 cycles for parseTableS, 128 bytes, #h=11
721 cycles for parseManyAhAl, 98 bytes, #h=11
778 cycles for parseFewJumps, 86 bytes, #h=11
877 cycles for parseManyJumps, 86 bytes, #h=11
801 cycles for parseElseif, 98 bytes, #h=11
1134 cycles for parseSwitch, 99 bytes, #h=11
140 cycles for parseTable, 158 bytes, #h=1
140 cycles for parseTableS, 128 bytes, #h=1
57 cycles for parseManyAhAl, 98 bytes, #h=1
77 cycles for parseFewJumps, 86 bytes, #h=1
84 cycles for parseManyJumps, 86 bytes, #h=1
70 cycles for parseElseif, 98 bytes, #h=1
119 cycles for parseSwitch, 99 bytes, #h=1
--- ok ---
Regards,
Steve
I give up, the problem is simply not described properly but the obvious is, a PIV has no problems with complex addressing mode so there is something wrong with the code.
Quote from: hutch-- on December 11, 2009, 01:14:20 PM
I give up, the problem is simply not described properly but the obvious is, a PIV has no problems with complex addressing mode so there is something wrong with the code.
It is simple enough, and the innermost loop is a three-liner. Besides, it works perfectly, and is a lot faster on more recent CPUs.
QuoteparseTableS proc uses esi edi Src:DWORD
.data?
ChTableS dd 512/4 dup(?) ; character table for parser
.data
ptS db 0, plOut-plBASE ; if nullbyte found, jump out of the loop
db "e", plChkE-plBASE, "E", plChkE-plBASE ; exponent, lowercase or uppercase
db ".", plChkDot-plBASE
db "h", plChkH-plBASE, "H", plChkH-plBASE ; hex
db "b", plChkB-plBASE, "B", plChkB-plBASE ; binary
db "y", plChkY-plBASE, "Y", plChkY-plBASE ; binary
db "(", plChkB1-plBASE, ")", plChkB2-plBASE ; brackets
ptEndS db -128 ; flag end of table
.code
and NumH, 0 ; first instructions for all algos:
and NumDots, 0 ; reset the test counters
mov edi, offset ChTableS
cmp dword ptr [edi], 0
jne plSct0
mov ecx, 256 ; this part creates the chartable and will be run only once, so it can be slow
push edi
mov eax, plSct ; set defaults to entry of innermost loop
rep stosd
pop edi ; edi=offset ChTableS
mov esi, offset ptS ; parse table specific chars
.While 1
lodsb
.Break .if al==-128 ; flag we are done
movzx edx, al ; the character requiring specific action, e.g. "h"
lodsb
movzx eax, al ; the byte offset against plBASE
add eax, plBASE ; translate into a dword address
mov [edi+4*edx], eax ; store into the table in the .data? section
.Endw
jmp plSct0 ; skip action zone and point esi to Src
plBASE: ; ----------------------- here is the action zone for some specific chars -----------------------
plChkE: ; e: maybe an exponent, maybe a hex char, who knows?
jmp plSct
plChkDot:
inc NumDots ; a dot: it's a float
jmp plSct
plChkH:
inc NumH ; h: end of a Hex$
jmp plSct
plChkB: ; b: maybe the end of a Bin$, maybe a hex char, who knows?
jmp plSct
plChkY: ; y: surely the end of a Bin$
jmp plSct
plChkB1: ; (: we have brackets, e.g. (1234h)
jmp plSct
plChkB2: ; ): end of a bracketed string
nop
plSct0: mov esi, Src ; let's go
[/b] ; ----------- innermost loop ----------------------------------------
plSct: movzx eax, byte ptr [esi] ; load the char from the source string
inc esi
jmp dword ptr [edi+4*eax] ; jump to the address obtained from the table
; ----------- end of innermost loop ------------------------------
plOut:
ret
parseTableS endp
i think you are missing what Hutch and I are trying to convey, Jochen
and, neither of us has posted the example code because we can't get a handle on the desired end-result
that is because you are trying to make the comparisons on a generic level
the methods we propose require knowledge of the specific application
we understand that it is a parser, of course - lol
it appears to be for evaluating ASCII decimal reals, but we can't be certain
i.e. we don't want to play the game, if we don't know what the rules are :P
Dave (and Hutch),
The rules are pretty simple:
- Take a source string and scan it.
- Do nothing for 90% of the characters found, in this example: 01234567890abcdfABCDF, i.e. the characters that have no specific meaning.
- Jump to a specific address if you hit an interesting character, e.g. a dot, an "h", an "e", a nullbyte, tab or space etc.
That is what this code simulates. I am sure there would be other applications, too, although it is clear that I want to use it to evaluate numbers of the WM_XXX EQU 1234h or WM_ZZZ EQU 101010b etc type.
HTH,
Jochen
here you go, my friend
a generic example of what Hutch and i were thinking about
EDIT - there are ways to speed this up - it is merely intended to demonstrate the concept
(http://kidshow.dcmemories.com/rrdobeeplate_226.gif)
JJ,
I just had a quick "struggle" through the technocolor code and noticed one thing, the use of the mnemonic "lodsb" which is known to be very slow on a PIV, especially without the REP prefix as you have no special case circuitry to make it faster. Shoot it and use a register which you increment and you should see a speed improvement.
@Dave: I hope this is correctly implemented. It seems to work but is not very fast. The reason might be that you have three instead of two memory accesses:
Class4:
movzx eax, byte ptr [esi]
inc esi
mov al, XlatTbl[eax] ; get the "class"
jmp VectTbl[eax]
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
1394 cycles for parseTable, 171 bytes, #h=50, #dots=50
1336 cycles for parseTableS, 137 bytes, #h=50, #dots=50
2542 cycles for parseManyAhAl, 111 bytes, #h=50, #dots=50
2783 cycles for parseFewJumps, 99 bytes, #h=50, #dots=50
2743 cycles for parseManyJumps, 99 bytes, #h=50, #dots=50
2742 cycles for parseElseif, 111 bytes, #h=50, #dots=50
3248 cycles for parseSwitch, 112 bytes, #h=50, #dots=50
3175 cycles for parseDave, 109 bytes, #h=50, #dots=400
@Hutch:
The lodsb is in the part that builds the table. It is run once and therefore has no influence on the timings.
well - i don't think it is, Jochen
first, it appears as though you rebuild the one-time table in each call to parsedave
second, the program is a demonstration of a concept
it does not do what you do to parse
in order to compare times, you have to design a class list and build the table as required
so that it performs the same task as the other routines
i am sure there are ways to make it faster
the 3 accesses are not as bad as it may seem
the one thing that might be problematic is, as Hutch has mentioned in the past:
accessing a byte just before accessing a dword (in a loop situation) on a p4 is sluggish
that may well be part of the problem with some of the other algos, also
Quote from: dedndave on December 11, 2009, 06:45:12 PM
well - i don't think it is, Jochen
first, it appears as though you rebuild the one-time table in each call to parsedave
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
1805 cycles for parseTableS, 136 bytes, #h=50, #dots=50
3868 cycles for parseDave, 129 bytes, #h=50, #dots=400
...
43 cycles for parseTableS, 136 bytes, #h=1, #dots=1
769 cycles for parseDave, 129 bytes, #h=1, #dots=8
These are timings for disabling these two lines:
; cmp dword ptr [edi], 0 ; table already built?
; jne ParseTheString
As you see, the Build is called only once, just as in the other routines.
Quotesecond, the program is a demonstration of a concept
it does not do what you do to parse
in order to compare times, you have to design a class list and build the table as required
so that it performs the same task as the other routines
It does not do exactly the same, but for the timings the difference is irrelevant since I scan the same Src
Quote
i am sure there are ways to make it faster
the 3 accesses are not as bad as it may seem
the one thing that might be problematic is, as Hutch has mentioned in the past:
accessing a byte just before accessing a dword (in a loop situation) on a p4 is sluggish
that may well be part of the problem with some of the other algos, also
Timings above are for Celeron, not P4. Anyway, if you see a way to speed up the innermost loops, please let me know.
EDIT: I managed to shorten parseTableS by 2 bytes, and speed it up by 25%. Don't ask me why ::)
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
1349 cycles for parseTableS, 134 bytes #h=50, #dots=50
1840 cycles for parseTable, 171 bytes #h=50, #dots=50
3166 cycles for parseDave, 134 bytes #h=50, #dots=400
well - if you really want to speed the thing up, here's what you need to do - and this will make the P4's happy
the Src string may or may not be 4-aligned
build the parser to parse bytes until the address IS 4-aligned
then, switch to a different parser that reads 4-aligned dwords from Src
inside the dword parser, use one register to hold the unparsed bytes (EDX, maybe)
shift the bytes into DL (SHR EDX,8) and parse them from there
one of the biggest improvements i made to my ling long kai fang bignum conversion
routines (LLKF9) was to access 4-aligned dwords for both input and output
EDIT - i notice you have Src 16-aligned
that isn't a real-world condition
in fact, you should test it with different alignments (mod 4 should be good enough)
if you look at my ling long kai fang code (LLKF9), in the init section, i scan down non-signifigant bytes from the top of the input value
that code is somewhat similar to what you want, only mine has 3 loops - yours only needs the first 2 loops
i tested that code with "mod 4" alignment cases
no matter how many bytes or how they are aligned, that code is ~ 3 times faster than STD, REP SCASB on a P4 prescott
Quote from: dedndave on December 11, 2009, 10:44:19 PM
EDIT - i notice you have Src 16-aligned
that isn't a real-world condition
There is absolutely no difference in timings between aligned and misaligned strings. Which is not surprising for modern CPUs in general, and a byte scanner in particular...
As to the shift alternative, I doubt that it can speed things up. Can you show me how you would modify the inner loop?
plSct0: mov esi, Src ; let's go
; ----------- innermost loop ----------------------------------------
plSct: movzx eax, byte ptr [esi] ; load the char from the source string
inc esi
jmp dword ptr [edi+4*eax] ; jump to the address obtained from the table
; ----------- end of innermost loop ------------------------------
JJ,
On a PIV,
; change this
mov al, XlatTbl[eax] ; get the "class"
jmp VectTbl[eax]
; to this
movzx eax, BYTE PTR XlatTbl[eax]
; stick something in here if the algo allows it.
jmp VectTbl[eax]
This should prevent a read after write stall.
Also on a PIV ala Intel, use ADD REG/VAR, 1, not INC REG/VAR.
Hutch,
movzx eax instead of mov al rocks indeed, in Dave's algo:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
1346 cycles for parseTableS, 134 bytes #h=50
2470 cycles for parseDave, 135 bytes #h=50 movzx
3156 cycles for parseDave, 134 bytes #h=50 mov al
But it is still almost a factor 2 slower than my parseTableS.
Can you give me timings for the P4?
I had tested inc esi vs. add esi, 1 in my own algo on my PIV, no significant difference at >12000 cycles.
prescott...
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
9735 cycles for parseTableS, 134 bytes #h=50, #dots=50
10125 cycles for parseTable, 171 bytes #h=50, #dots=50
3049 cycles for parseFewJumps, 99 bytes #h=50, #dots=50
3806 cycles for parseManyJumps, 99 bytes #h=50, #dots=50
12009 cycles for parseDave, 135 bytes #h=50, #dots=400
2003 cycles for parseTableS, 134 bytes #h=10, #dots=10
2073 cycles for parseTable, 171 bytes #h=10, #dots=10
724 cycles for parseFewJumps, 99 bytes #h=10, #dots=10
887 cycles for parseManyJumps, 99 bytes #h=10, #dots=10
2543 cycles for parseDave, 135 bytes #h=10, #dots=80
254 cycles for parseTableS, 134 bytes #h=1, #dots=1
264 cycles for parseTable, 171 bytes #h=1, #dots=1
98 cycles for parseFewJumps, 99 bytes #h=1, #dots=1
73 cycles for parseManyJumps, 99 bytes #h=1, #dots=1
300 cycles for parseDave, 135 bytes #h=1, #dots=8
if i get some time tomorrow, i may play with an aligned loop JJ :8)
Are you wanting to scan a string and get true/false? Use a byte scanner and return when one is found.
Are you counting certain characters? Use a table (inc chars[eax]) and bsf.
Have you tried call/ret instead of jmp/jmp?
Or are you processing the source to a destination on the fly?
Playing around with aligning threw other procs out of whack, but I did get the 'S' one 100 cycles faster.
Maybe you could have one exe per proc and discard the slow ones?
Quote from: sinsi on December 12, 2009, 07:22:52 AM
Are you wanting to scan a string and get true/false?
Sinsi,
The scanner is supposed to act, i.e. jump, for specific characters.
For example, look at your post: You have used lots of lowercase characters - ordinary text, no need to act, so get the next char. But do something specific if you find ?/[].' etc, or if you encounter an uppercase character.
My specific application is to get the value of e.g.
1010101b
( 101010 y)
12345
12345h
$12345
0x12345
123.456
(123.456)
-123e99
123eh
I guess you are able to recognise, using your knowledge and logic and a set of rules, that these are all valid number formats; some of them are being used in Windows.inc, and ml.exe has no problem distinguishing them. That is what the scanner should do: Get the info for applying the rules, then,
later, branch to the decimal or hex or bin$ algo.
For most of you number formats, just having a leading number makes the parser a lot easier, branch on anything between ascii "0" to "9" and the minus sign "-" and keep scanning until you find the end of the number, usually by a punctuation character. Depending on what is at the front or end you can determine the numeric format and convert it accordingly. The ones in brackets "(" and ")" are usually parsed in the preprocessor.
The leading "$" notation is very easy to parse because you only have to test the 1st character to determine the data type but if you are using a general purpose word parser its just as easy to rad the last character as MASM does.
Quote$12345
ok - i know the assembler doesn't recognize currency - lol
so - how is this evaluated ?
on the other hand - a nice feature
we could call this routine until the bank account is full :bg
Net_Worth_Loop:
call GetSome
add MyBankAccount,eax
jmp Net_Worth_Loop
GetSome:
mov eax,$999999999
ret
EDIT - added a little "bonus" code :P
guess i better optimize that a little bit...
(get rich quick scheme)
mov eax,$999999999
Net_Worth_Loop:
add MyBankAccount,eax
jmp Net_Worth_Loop
another question for you, Jochen...
is this parser required to evaluate reals, as well ?
if not - there is a practical limit to the length of strings
the worst case would be binary
does it need to parse 64-bit binary values ? - or is 32-bit sufficient ?
i am just curious as to what the max string length is
and one more question - as far as i know, spaces are not allowed in any of the formats - so they could be interpreted as EOL ?
mov eax,10h ;valid comment
mov eax,10h invalid comment
or do you need to evaluate the rest of the line to test validity ?
i see a little bug in some of the logic, Jochen
not timing related, but you might want to mod the code a little
this code has the potential for aliasing errors
test al, al
je @F
or al, 20h
cmp al, "e"
je HasE
cmp al, "."
je HasDot
cmp al, "h"
je HasH
cmp al, "b"
je HasB
cmp al, "y"
je HasY
cmp al, "("
je HasB1
cmp al, "("
je HasB2
changing the test order so non-alpha chars are tested prior to the "or al,20h" will eliminate the problem
test al, al
je @F
cmp al, "."
je HasDot
cmp al, "("
je HasB1
cmp al, ")" ;i think that's what you meant
je HasB2
or al, 20h
cmp al, "e"
je HasE
cmp al, "h"
je HasH
cmp al, "b"
je HasB
cmp al, "y"
je HasY
on a related note - and this may speed things up a little
because most of the characters in the strings are ascii numerics, test for that first
the null char only occurs once - even though "or al,al" may be faster than other compares, the occurance frequency is last on the list
don't test for it until you have to
test for the most likely-to-occur first
that way, fewer compares are made
i would start off with:
xor al,30h
cmp al,9
jbe Skip_It
no need to xor it back out - simply re-map the rest of the chars
also, a-f are valid hexadecimal "numbers"
you need a little bit of a state-machine to control some of the branching
also - i am still trying to find something on the "$" - anyone ?
Thanks, Dave. The logic still needs to be elaborated. Right now, I just wanted to test if a pre-scan, either sequentially or per character table, is worth the effort, speed and sizewize. The old version of Val() has almost a kByte, and that's no good...
i am trying to understand the byte-count portion of the results macro....
ifidn <algo>, <parseTable>
add eax, ptEnd-pt+1
elseifidn <algo>, <parseTableS>
add eax, ptEndS-ptS+1
elseifidn <algo>, <parseDave>
add eax, ptEndS-ptS+1 ; we treat it as if the standard jumps had been used
;add eax, Build_T-Build+1
endif
huh ? :dazzled:
you lost me, my friend - lol
i will just leave that part out
> we treat it as if the standard jumps had been used
Dave,
The macro adds the hand-counted bytes in the .data section. For better comparison, I assumed that the same number of "specific" jumps was implemented in your algo.
well, Jochen, here is my first attempt
very impressive :P
prescott:
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
14608 cycles for PAlgn4, 150 bytes #h=50,#dots=50
2827 cycles for PAlgn4, 150 bytes #h=10,#dots=10
131 cycles for PAlgn4, 150 bytes #h=1,#dots=1
Quote from: dedndave on December 12, 2009, 10:42:28 PM
well, Jochen, here is my first attempt
Thanks, Dave. Interesting approach, I hope you can tune it a little bit...
K2Z, J
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
7189 cycles for PAlgn4, 150 bytes #h=50,#dots=50
1394 cycles for PAlgn4, 150 bytes #h=10,#dots=10
84 cycles for PAlgn4, 150 bytes #h=1,#dots=1
lol - i don't think "tuning it" is going to be adequate
it needs a major overhaul :bg
but, i may see something
i made a couple small changes in ecx and got a few strokes out of it
but, i have to keep an open mind to a different concept
Quote from: dedndave on December 12, 2009, 03:15:02 PM
this code has the potential for aliasing errors
Dave, what is an aliasing error?
well - that is, perhaps, a mis-use of the terminology - lol
but, in this case, Jochen uses the OR AL,20h instruction to force upper case alpha characters to lower case
that way, he does not have to test for "A" and "a" - they both become "a"
but, half of the rest of the characters are affected, as well
for example, carriage return (0Dh) becomes hypen or minus sign (2Dh)
the OR instruction destoyed the unique identification of the 0Dh
you can no longer tell which was loaded
when you parse characters using OR 20h (or AND 0DFh to force upper case on lower case alpha characters),
it is best to test all the non-alpha characters first
notice that if you use XOR AL,30h (or SUB AL,30h) to test for numberic characters,
there are still 256 different values - they are just different - they do not lose their unique identity
the term "aliasing error" is normally used in information theory or cryptography when there is a loss of information