News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

StrLen timings needed

Started by jj2007, August 15, 2010, 09:32:10 PM

Previous topic - Next topic

Antariy

Note: in results appeared "ERROR" signal, but this is not right. Just, I set repeating string as 20 times, so 100bytes*20times = 2000bytes total length. This length showed in line with "ERROR" word - so, this is true, not error.
Also I set shorter string, but I lazy to change CodeSize macro each time. So, I will write testing string size (for comparsion with "error" :)

Unaligned StrLen, 2000bytes:

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
58       bytes for MbStrLen1
2000 - ERROR
84       bytes for MbStrLen2
2000 - ERROR
73       bytes for MbStrLen3
2000 - ERROR
80       bytes for MbStrLen4a
2000 - ERROR
71       bytes for MbStrLen4b
2000 - ERROR
78       bytes for MbStrLen5
2000 - ERROR
147      bytes for AxStrLen
2000 - ERROR

920     cycles for MbStrLen1
915     cycles for MbStrLen2
914     cycles for MbStrLen3
912     cycles for MbStrLen4a
916     cycles for MbStrLen4b
918     cycles for MbStrLen5

908     cycles for AxStrLen

909     cycles for MbStrLen1
912     cycles for MbStrLen2
913     cycles for MbStrLen3
912     cycles for MbStrLen4a
916     cycles for MbStrLen4b
918     cycles for MbStrLen5

908     cycles for AxStrLen

905     cycles for MbStrLen1
920     cycles for MbStrLen2
913     cycles for MbStrLen3
913     cycles for MbStrLen4a
915     cycles for MbStrLen4b
918     cycles for MbStrLen5

908     cycles for AxStrLen

905     cycles for MbStrLen1
912     cycles for MbStrLen2
913     cycles for MbStrLen3
913     cycles for MbStrLen4a
916     cycles for MbStrLen4b
926     cycles for MbStrLen5

907     cycles for AxStrLen

920     cycles for MbStrLen1
916     cycles for MbStrLen2
915     cycles for MbStrLen3
913     cycles for MbStrLen4a
916     cycles for MbStrLen4b
919     cycles for MbStrLen5

907     cycles for AxStrLen


--- ok ---





Aligned (16bytes) StrLen, 2000bytes:

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
58       bytes for MbStrLen1
2000 - ERROR
84       bytes for MbStrLen2
2000 - ERROR
73       bytes for MbStrLen3
2000 - ERROR
80       bytes for MbStrLen4a
2000 - ERROR
71       bytes for MbStrLen4b
2000 - ERROR
78       bytes for MbStrLen5
2000 - ERROR
147      bytes for AxStrLen
2000 - ERROR

903     cycles for MbStrLen1
909     cycles for MbStrLen2
909     cycles for MbStrLen3
906     cycles for MbStrLen4a
915     cycles for MbStrLen4b
915     cycles for MbStrLen5

903     cycles for AxStrLen

903     cycles for MbStrLen1
907     cycles for MbStrLen2
911     cycles for MbStrLen3
905     cycles for MbStrLen4a
914     cycles for MbStrLen4b
913     cycles for MbStrLen5

904     cycles for AxStrLen

902     cycles for MbStrLen1
907     cycles for MbStrLen2
910     cycles for MbStrLen3
905     cycles for MbStrLen4a
914     cycles for MbStrLen4b
913     cycles for MbStrLen5

905     cycles for AxStrLen

902     cycles for MbStrLen1
908     cycles for MbStrLen2
909     cycles for MbStrLen3
906     cycles for MbStrLen4a
913     cycles for MbStrLen4b
914     cycles for MbStrLen5

904     cycles for AxStrLen

903     cycles for MbStrLen1
907     cycles for MbStrLen2
909     cycles for MbStrLen3
905     cycles for MbStrLen4a
913     cycles for MbStrLen4b
914     cycles for MbStrLen5

903     cycles for AxStrLen


--- ok ---




Unaligned, 100bytes:

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
58       bytes for MbStrLen1
100 - ERROR
84       bytes for MbStrLen2
100 - ERROR
73       bytes for MbStrLen3
100 - ERROR
80       bytes for MbStrLen4a
100 - ERROR
71       bytes for MbStrLen4b
100 - ERROR
78       bytes for MbStrLen5
100 - ERROR
147      bytes for AxStrLen
100 - ERROR

87      cycles for MbStrLen1
97      cycles for MbStrLen2
90      cycles for MbStrLen3
90      cycles for MbStrLen4a
71      cycles for MbStrLen4b
95      cycles for MbStrLen5

67      cycles for AxStrLen

92      cycles for MbStrLen1
97      cycles for MbStrLen2
90      cycles for MbStrLen3
90      cycles for MbStrLen4a
71      cycles for MbStrLen4b
95      cycles for MbStrLen5

67      cycles for AxStrLen

92      cycles for MbStrLen1
105     cycles for MbStrLen2
81      cycles for MbStrLen3
88      cycles for MbStrLen4a
73      cycles for MbStrLen4b
93      cycles for MbStrLen5

67      cycles for AxStrLen

92      cycles for MbStrLen1
102     cycles for MbStrLen2
90      cycles for MbStrLen3
90      cycles for MbStrLen4a
71      cycles for MbStrLen4b
95      cycles for MbStrLen5

67      cycles for AxStrLen

92      cycles for MbStrLen1
97      cycles for MbStrLen2
89      cycles for MbStrLen3
98      cycles for MbStrLen4a
71      cycles for MbStrLen4b
101     cycles for MbStrLen5

68      cycles for AxStrLen


--- ok ---


Jochen, I really try get best times for all proc's. But each run, results NOT the same, but very mess: one run - one procs have big timings, next run - other proc have big timings, next - another, etc. I think, this messing is because in your procs have 2 loops (one internal), and when string is unaligned, they runs, and get biggest timings. When string aligned, they not runs, so, timings much better.

Aligned, 100bytes:

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
58       bytes for MbStrLen1
100 - ERROR
84       bytes for MbStrLen2
100 - ERROR
73       bytes for MbStrLen3
100 - ERROR
80       bytes for MbStrLen4a
100 - ERROR
71       bytes for MbStrLen4b
100 - ERROR
78       bytes for MbStrLen5
100 - ERROR
147      bytes for AxStrLen
100 - ERROR

73      cycles for MbStrLen1
66      cycles for MbStrLen2
86      cycles for MbStrLen3
65      cycles for MbStrLen4a
65      cycles for MbStrLen4b
96      cycles for MbStrLen5

64      cycles for AxStrLen

64      cycles for MbStrLen1
65      cycles for MbStrLen2
68      cycles for MbStrLen3
65      cycles for MbStrLen4a
65      cycles for MbStrLen4b
96      cycles for MbStrLen5

64      cycles for AxStrLen

84      cycles for MbStrLen1
65      cycles for MbStrLen2
86      cycles for MbStrLen3
65      cycles for MbStrLen4a
65      cycles for MbStrLen4b
96      cycles for MbStrLen5

64      cycles for AxStrLen

65      cycles for MbStrLen1
66      cycles for MbStrLen2
68      cycles for MbStrLen3
65      cycles for MbStrLen4a
65      cycles for MbStrLen4b
96      cycles for MbStrLen5

64      cycles for AxStrLen

83      cycles for MbStrLen1
65      cycles for MbStrLen2
86      cycles for MbStrLen3
65      cycles for MbStrLen4a
65      cycles for MbStrLen4b
96      cycles for MbStrLen5

64      cycles for AxStrLen


--- ok ---


As you see, if string aligned, sometimes, your shortest proc have timings near to my proc (which have size 2.53 times long).
My proc have more stable timings, but this is "price" of its size.


Next, 15bytes string, unaligned, timings:

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
58       bytes for MbStrLen1
15 - ERROR
84       bytes for MbStrLen2
15 - ERROR
73       bytes for MbStrLen3
15 - ERROR
80       bytes for MbStrLen4a
15 - ERROR
71       bytes for MbStrLen4b
15 - ERROR
78       bytes for MbStrLen5
15 - ERROR
147      bytes for AxStrLen
15 - ERROR

37      cycles for MbStrLen1
36      cycles for MbStrLen2
37      cycles for MbStrLen3
38      cycles for MbStrLen4a
38      cycles for MbStrLen4b
43      cycles for MbStrLen5

21      cycles for AxStrLen

37      cycles for MbStrLen1
36      cycles for MbStrLen2
37      cycles for MbStrLen3
38      cycles for MbStrLen4a
38      cycles for MbStrLen4b
43      cycles for MbStrLen5

21      cycles for AxStrLen

37      cycles for MbStrLen1
36      cycles for MbStrLen2
37      cycles for MbStrLen3
38      cycles for MbStrLen4a
38      cycles for MbStrLen4b
43      cycles for MbStrLen5

20      cycles for AxStrLen

37      cycles for MbStrLen1
36      cycles for MbStrLen2
37      cycles for MbStrLen3
38      cycles for MbStrLen4a
37      cycles for MbStrLen4b
43      cycles for MbStrLen5

21      cycles for AxStrLen

37      cycles for MbStrLen1
36      cycles for MbStrLen2
37      cycles for MbStrLen3
38      cycles for MbStrLen4a
38      cycles for MbStrLen4b
43      cycles for MbStrLen5

21      cycles for AxStrLen


--- ok ---



On short unaligned strings, drawbacks of two-loops looks more.
But these timings of aligned, 15 bytes string:

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
58       bytes for MbStrLen1
15 - ERROR
84       bytes for MbStrLen2
15 - ERROR
73       bytes for MbStrLen3
15 - ERROR
80       bytes for MbStrLen4a
15 - ERROR
71       bytes for MbStrLen4b
15 - ERROR
78       bytes for MbStrLen5
15 - ERROR
147      bytes for AxStrLen
15 - ERROR

21      cycles for MbStrLen1
27      cycles for MbStrLen2
29      cycles for MbStrLen3
28      cycles for MbStrLen4a
29      cycles for MbStrLen4b
32      cycles for MbStrLen5

22      cycles for AxStrLen

21      cycles for MbStrLen1
28      cycles for MbStrLen2
28      cycles for MbStrLen3
28      cycles for MbStrLen4a
29      cycles for MbStrLen4b
32      cycles for MbStrLen5

22      cycles for AxStrLen

21      cycles for MbStrLen1
27      cycles for MbStrLen2
29      cycles for MbStrLen3
27      cycles for MbStrLen4a
29      cycles for MbStrLen4b
32      cycles for MbStrLen5

22      cycles for AxStrLen

20      cycles for MbStrLen1
27      cycles for MbStrLen2
29      cycles for MbStrLen3
28      cycles for MbStrLen4a
29      cycles for MbStrLen4b
32      cycles for MbStrLen5

22      cycles for AxStrLen

21      cycles for MbStrLen1
27      cycles for MbStrLen2
29      cycles for MbStrLen3
28      cycles for MbStrLen4a
29      cycles for MbStrLen4b
32      cycles for MbStrLen5

22      cycles for AxStrLen


--- ok ---



All advantages of short code is drawed on short aligned strings. Short code fastest in this test.


So, what give analysis of test? Jochen, I think, you need to make other solution for case of unaligned strings. Because for short strings drawbacks of loops is very big (as you see, advantage of shorter code is drawed only on very large strings (2000bytes in test)).


My variant also is not very good, but I write it for testing short and long algos only.



Big ask to all peoples: test this also, please. This is interesting: how different CPUs work with a inside-loops.
If you have time, make different tests: for different string lengths and aligned/not_aligned variants. If you have no time - archive contain sources and compiled exe with unaligned 100byte string testing.



Alex
P.S. Sources not compilable by ml6.15 and earlyer. I compile them with ML8 (SSE2 movsd mnemonic).

jj2007

Hi Alex,
If you want to get rid of the ERROR, use
  movups xmm0, oword ptr Src
  .if eax!=sizeof Src-1

in the CodeSize macro.

It is true that all versions are slower for unaligned strings, but still a factor three faster than the Masm32 len() function:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
132     cycles for szLen
46      cycles for MbStrLen1
47      cycles for MbStrLen2
50      cycles for MbStrLen3
48      cycles for MbStrLen4a
53      cycles for MbStrLen4b
51      cycles for MbStrLen5


And why I preserve xmm0? Precisely to encourage people to use SSE... MasmBasic is intended to be fast and noob-friendly. The next version will be "SSE2-safe", and "SSE2-friendly", too. Try doing a Print Str$(xmm1) in ordinary assembler :wink

Antariy

Hi, Jochen!

How timings you have?
I assuraced, on CPUs with big cache unaligned access may be more fast.



Alex

Antariy

Jochen, you see results of your hex2dwon Dave's CPU?
This is incredible! What mean ONE architecture (NetBurst), if EVERY CPU in forum have different timings. Intel must write some "appendix" in each CPUs :)
For info:

14      cycles for Lingo's SSE version
13      cycles for Lingo's BIG integer version
5       cycles for Jochen's WORD-Indexed version
15      cycles for Dave's version (with minor changes)


This is fantastic!



Alex

Antariy

Jochen, I don't understand the "noob" word. What it mean? Don't forgot - I'm not good english speaker.



Alex

jj2007

Quote from: Antariy on August 16, 2010, 08:53:17 PM
Jochen, I don't understand the "noob" word. What it mean? Don't forgot - I'm not good english speaker.

noob = newbie, beginner.

Timings:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
58       bytes for MbStrLen1
80       bytes for MbStrLen4a
131      bytes for AxStrLen

Src: 100 bytes
46      cycles for MbStrLen1 (xmm0 not preserved)
48      cycles for MbStrLen4a
36      cycles for AxStrLen


Your algo is really fast, compliments :U

Antariy

Quote from: jj2007 on August 16, 2010, 09:09:52 PM
Quote from: Antariy on August 16, 2010, 08:53:17 PM
Jochen, I don't understand the "noob" word. What it mean? Don't forgot - I'm not good english speaker.

noob = newbie, beginner.

Timings:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
58       bytes for MbStrLen1
80       bytes for MbStrLen4a
131      bytes for AxStrLen

Src: 100 bytes
46      cycles for MbStrLen1 (xmm0 not preserved)
48      cycles for MbStrLen4a
36      cycles for AxStrLen


Your algo is really fast, compliments :U

No, it really BIG, not fast.
If you remove "mov eax,[esp+4]" from sources, it stand slower, try with "mov eax...".
And, for this size, it really slow.



Alex

hutch--

Can I impose on someone to include this unrolled version of Agner Fog's StrLen algo.


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 4

StrLen proc item:DWORD

    mov     eax, [esp+4]            ; get pointer to string
    lea     edx, [eax+3]            ; pointer+3 used in the end
    push    ebp
    push    edi
    mov     ebp, 80808080h

  @@:     
  REPEAT 3
    mov     edi, [eax]              ; read first 4 bytes
    add     eax, 4                  ; increment pointer
    lea     ecx, [edi-01010101h]    ; subtract 1 from each byte
    not     edi                     ; invert all bytes
    and     ecx, edi                ; and these two
    and     ecx, ebp
    jnz     nxt
  ENDM

    mov     edi, [eax]              ; read first 4 bytes
    add     eax, 4                  ; 4 increment DWORD pointer
    lea     ecx, [edi-01010101h]    ; subtract 1 from each byte
    not     edi                     ; invert all bytes
    and     ecx, edi                ; and these two
    and     ecx, ebp
    jz      @B                      ; no zero bytes, continue loop

  nxt:
    test    ecx, 00008080h          ; test first two bytes
    jnz     @F
    shr     ecx, 16                 ; not in the first 2 bytes
    add     eax, 2
  @@:
    shl     cl, 1                   ; use carry flag to avoid branch
    sbb     eax, edx                ; compute length
    pop     edi
    pop     ebp

    ret     4

StrLen endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Here it is, Hutch:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
80       bytes for MbStrLen4a
128      bytes for AxStrLen
113      bytes for StrLenAF

Src: 100 bytes
46      cycles for MbStrLen1 (xmm0 not preserved)
48      cycles for MbStrLen4a
94      cycles for StrLenAF
36      cycles for AxStrLen


@Alex: Sorry, this is a modified version, without the fancy stuff. Much shorter and 2 cycles faster on the Celeron M, but it might be slower on other CPUs, of course. The table might interest you.

Antariy

Quote from: hutch-- on August 16, 2010, 11:27:21 PM
Can I impose on someone to include this unrolled version of Agner Fog's StrLen algo.

I make test-bed.
Timings:

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
58       bytes for MbStrLen1
100 - ERROR
84       bytes for MbStrLen2
100 - ERROR
73       bytes for MbStrLen3
100 - ERROR
80       bytes for MbStrLen4a
100 - ERROR
71       bytes for MbStrLen4b
100 - ERROR
78       bytes for MbStrLen5
100 - ERROR
147      bytes for AxStrLen
100 - ERROR

86      cycles for MbStrLen1
97      cycles for MbStrLen2
78      cycles for MbStrLen3
102     cycles for MbStrLen4a
75      cycles for MbStrLen4b
90      cycles for MbStrLen5

71      cycles for AxStrLen

247     cycles for StrLen

80      cycles for MbStrLen1
88      cycles for MbStrLen2
79      cycles for MbStrLen3
92      cycles for MbStrLen4a
87      cycles for MbStrLen4b
95      cycles for MbStrLen5

70      cycles for AxStrLen

230     cycles for StrLen

77      cycles for MbStrLen1
94      cycles for MbStrLen2
82      cycles for MbStrLen3
93      cycles for MbStrLen4a
74      cycles for MbStrLen4b
98      cycles for MbStrLen5

69      cycles for AxStrLen

230     cycles for StrLen

85      cycles for MbStrLen1
90      cycles for MbStrLen2
79      cycles for MbStrLen3
91      cycles for MbStrLen4a
88      cycles for MbStrLen4b
97      cycles for MbStrLen5

77      cycles for AxStrLen

233     cycles for StrLen

82      cycles for MbStrLen1
98      cycles for MbStrLen2
79      cycles for MbStrLen3
94      cycles for MbStrLen4a
75      cycles for MbStrLen4b
94      cycles for MbStrLen5

70      cycles for AxStrLen

244     cycles for StrLen


--- ok ---



hutch--

Gratsie,


Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)
58       bytes for MbStrLen1
84       bytes for MbStrLen2
73       bytes for MbStrLen3
80       bytes for MbStrLen4a
71       bytes for MbStrLen4b
78       bytes for MbStrLen5
128      bytes for AxStrLen
113      bytes for StrLenAF

Src: 100 bytes
34      cycles for MbStrLen1 (xmm0 not preserved)
42      cycles for MbStrLen4a
68      cycles for StrLenAF
24      cycles for AxStrLen

34      cycles for MbStrLen1 (xmm0 not preserved)
42      cycles for MbStrLen4a
68      cycles for StrLenAF
24      cycles for AxStrLen

34      cycles for MbStrLen1 (xmm0 not preserved)
42      cycles for MbStrLen4a
68      cycles for StrLenAF
24      cycles for AxStrLen

34      cycles for MbStrLen1 (xmm0 not preserved)
42      cycles for MbStrLen4a
67      cycles for StrLenAF
24      cycles for AxStrLen

34      cycles for MbStrLen1 (xmm0 not preserved)
42      cycles for MbStrLen4a
68      cycles for StrLenAF
24      cycles for AxStrLen


--- ok ---
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Antariy

Quote from: jj2007 on August 16, 2010, 11:39:08 PM

@Alex: Sorry, this is a modified version, without the fancy stuff. Much shorter and 2 cycles faster on the Celeron M, but it might be slower on other CPUs, of course. The table might interest you.

No, you don't understand my nice English :)))

I mean, if you DON'T load value of [esp+4] to eax, then code almost always will be work slower, because eax - part of checking: is the string unaligned or not. And almost, proc will be work by unaligned branch. Please, post your timings for my original archive, too hard to explain in English, sorry.



Alex

Antariy

Hutch, test my version, please. Jochen made some changes, but algo works NOT in optimal way after his changes. (I see changes made by Jochen).
Test my, please.



Alex

hutch--

Alex,

This one ?


Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)
58       bytes for MbStrLen1
100 - ERROR
84       bytes for MbStrLen2
100 - ERROR
73       bytes for MbStrLen3
100 - ERROR
80       bytes for MbStrLen4a
100 - ERROR
71       bytes for MbStrLen4b
100 - ERROR
78       bytes for MbStrLen5
100 - ERROR
147      bytes for AxStrLen
100 - ERROR

34      cycles for MbStrLen1
42      cycles for MbStrLen2
37      cycles for MbStrLen3
42      cycles for MbStrLen4a
42      cycles for MbStrLen4b
41      cycles for MbStrLen5

25      cycles for AxStrLen

67      cycles for StrLen

34      cycles for MbStrLen1
46      cycles for MbStrLen2
45      cycles for MbStrLen3
42      cycles for MbStrLen4a
48      cycles for MbStrLen4b
41      cycles for MbStrLen5

24      cycles for AxStrLen

67      cycles for StrLen

34      cycles for MbStrLen1
42      cycles for MbStrLen2
37      cycles for MbStrLen3
41      cycles for MbStrLen4a
42      cycles for MbStrLen4b
41      cycles for MbStrLen5

24      cycles for AxStrLen

67      cycles for StrLen

34      cycles for MbStrLen1
46      cycles for MbStrLen2
46      cycles for MbStrLen3
42      cycles for MbStrLen4a
48      cycles for MbStrLen4b
41      cycles for MbStrLen5

24      cycles for AxStrLen

67      cycles for StrLen

34      cycles for MbStrLen1
42      cycles for MbStrLen2
37      cycles for MbStrLen3
42      cycles for MbStrLen4a
42      cycles for MbStrLen4b
40      cycles for MbStrLen5

24      cycles for AxStrLen

67      cycles for StrLen


--- ok ---
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Quote from: Antariy on August 16, 2010, 11:48:37 PM
Hutch, test my version, please. Jochen made some changes, but algo works NOT in optimal way after his changes. (I see changes made by Jochen).
Test my, please.

Alex,

Sorry, I should have split it but was too tired yesterday night. On the other hand, look at the timings: for Hutch, it's 24 cycles for both versions, for me it's 2 cycles faster without the "extras". And 128 instead of 147 bytes means 8 instead of 9 16-byte instruction cache slots.