News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

ZeroMemory Speed Test!

Started by ecube, January 23, 2007, 03:32:37 AM

Previous topic - Next topic

zooba

I'm impressed. For once Microsoft has done it faster than the competition  :bdg

Has someone here sold them some source code recently...

u

lol there's nothing high-tech about the "rep stosd" approach, that MS obviously uses :)
Please use a smaller graphic in your signature.

zooba

There doesn't have to be, it works and it works pretty well over a full range. There are some much faster variations for small blocks though, which are much more common than larger ones. My pick is an algo which checks the size of the block, gives it to Memfill if it's less than 512 bytes (though I'd do more tests to pick a better crossover) and RtlZeroMemory if it's larger. (Checking the size of the block is an extremely cheap test, since it has to be passed as a parameter anyway)

Cheers,

Zooba :U

MichaelW

Based on the surprising performance of memset for the largest block, and on zooba's idea for a hybrid, I created a procedure that combined a modified memfill with a rep stosd for blocks >= 512 bytes, and added it to the test app.

P3:

; ------------- Sample size = 16 bytes ---------------------
zero_it - hutch: 49
xzero_it - The Dude of Dudes: 41
xzero_it2 - pro3carp3: 52
RtlZeroMemory - Microsoft: 33
ZeroMemD - unknown: 56
fZeroMemory- Four-F: 60
memfill - masm32 lib: 20
AzmtMemZero - jdoe 44
RtlFillMemory - Microsoft(NT+ only)50
msvcrt memset - Microsoft: 41
_memfill - modified memfill & rep stosd : 23
; ------------- Sample size = 64 bytes ---------------------
zero_it - hutch: 146
xzero_it - The Dude of Dudes: 67
xzero_it2 - pro3carp3: 80
RtlZeroMemory - Microsoft: 61
ZeroMemD - unknown: 82
fZeroMemory- Four-F: 88
memfill - masm32 lib: 33
AzmtMemZero - jdoe 63
RtlFillMemory - Microsoft(NT+ only)77
msvcrt memset - Microsoft: 72
_memfill - modified memfill & rep stosd : 31
; ------------- Sample size = 256 bytes ---------------------
zero_it - hutch: 533
xzero_it - The Dude of Dudes: 208
xzero_it2 - pro3carp3: 222
RtlZeroMemory - Microsoft: 204
ZeroMemD - unknown: 226
fZeroMemory- Four-F: 224
memfill - masm32 lib: 103
AzmtMemZero - jdoe 130
RtlFillMemory - Microsoft(NT+ only)222
msvcrt memset - Microsoft: 212
_memfill - modified memfill & rep stosd : 100
; ------------- Sample size = 1024 bytes ---------------------
zero_it - hutch: 2100
xzero_it - The Dude of Dudes: 339
xzero_it2 - pro3carp3: 351
RtlZeroMemory - Microsoft: 333
ZeroMemD - unknown: 355
fZeroMemory- Four-F: 353
memfill - masm32 lib: 337
AzmtMemZero - jdoe 372
RtlFillMemory - Microsoft(NT+ only)350
msvcrt memset - Microsoft: 340
_memfill - modified memfill & rep stosd : 329


P4 Willamette:

; ------------- Sample size = 16 bytes ---------------------
zero_it - hutch: 42
xzero_it - The Dude of Dudes: 96
xzero_it2 - pro3carp3: 147
RtlZeroMemory - Microsoft: 112
ZeroMemD - unknown: 158
fZeroMemory- Four-F: 142
memfill - masm32 lib: 14
AzmtMemZero - jdoe 22
RtlFillMemory - Microsoft(NT+ only)120
msvcrt memset - Microsoft: 71
_memfill - modified memfill & rep stosd : 12
; ------------- Sample size = 64 bytes ---------------------
zero_it - hutch: 136
xzero_it - The Dude of Dudes: 112
xzero_it2 - pro3carp3: 174
RtlZeroMemory - Microsoft: 118
ZeroMemD - unknown: 174
fZeroMemory- Four-F: 168
memfill - masm32 lib: 38
AzmtMemZero - jdoe 30
RtlFillMemory - Microsoft(NT+ only)136
msvcrt memset - Microsoft: 78
_memfill - modified memfill & rep stosd : 25
; ------------- Sample size = 256 bytes ---------------------
zero_it - hutch: 516
xzero_it - The Dude of Dudes: 212
xzero_it2 - pro3carp3: 264
RtlZeroMemory - Microsoft: 226
ZeroMemD - unknown: 274
fZeroMemory- Four-F: 258
memfill - masm32 lib: 148
AzmtMemZero - jdoe 151
RtlFillMemory - Microsoft(NT+ only)236
msvcrt memset - Microsoft: 188
_memfill - modified memfill & rep stosd : 144
; ------------- Sample size = 1024 bytes ---------------------
zero_it - hutch: 1980
xzero_it - The Dude of Dudes: 392
xzero_it2 - pro3carp3: 444
RtlZeroMemory - Microsoft: 397
ZeroMemD - unknown: 446
fZeroMemory- Four-F: 427
memfill - masm32 lib: 548
AzmtMemZero - jdoe 549
RtlFillMemory - Microsoft(NT+ only)399
msvcrt memset - Microsoft: 343
_memfill - modified memfill & rep stosd : 337



[attachment deleted by admin]
eschew obfuscation

j_groothu

PIV - Northwood

; ------------- Sample size = 16 bytes ---------------------
zero_it - hutch: 35
xzero_it - The Dude of Dudes: 103
xzero_it2 - pro3carp3: 160
RtlZeroMemory - Microsoft: 119
ZeroMemD - unknown: 174
fZeroMemory- Four-F: 153
memfill - masm32 lib: 14
AzmtMemZero - jdoe 15
RtlFillMemory - Microsoft(NT+ only)130
msvcrt memset - Microsoft: 65
_memfill - modified memfill & rep stosd : 16
; ------------- Sample size = 64 bytes ---------------------
zero_it - hutch: 137
xzero_it - The Dude of Dudes: 119
xzero_it2 - pro3carp3: 175
RtlZeroMemory - Microsoft: 126
ZeroMemD - unknown: 187
fZeroMemory- Four-F: 168
memfill - masm32 lib: 30
AzmtMemZero - jdoe 33
RtlFillMemory - Microsoft(NT+ only)143
msvcrt memset - Microsoft: 78
_memfill - modified memfill & rep stosd : 26
; ------------- Sample size = 256 bytes ---------------------
zero_it - hutch: 511
xzero_it - The Dude of Dudes: 224
xzero_it2 - pro3carp3: 279
RtlZeroMemory - Microsoft: 237
ZeroMemD - unknown: 292
fZeroMemory- Four-F: 273
memfill - masm32 lib: 150
AzmtMemZero - jdoe 149
RtlFillMemory - Microsoft(NT+ only)247
msvcrt memset - Microsoft: 184
_memfill - modified memfill & rep stosd : 144
; ------------- Sample size = 1024 bytes ---------------------
zero_it - hutch: 2021
xzero_it - The Dude of Dudes: 393
xzero_it2 - pro3carp3: 448
RtlZeroMemory - Microsoft: 406
ZeroMemD - unknown: 471
fZeroMemory- Four-F: 446
memfill - masm32 lib: 549
AzmtMemZero - jdoe 536
RtlFillMemory - Microsoft(NT+ only)421
msvcrt memset - Microsoft: 356
_memfill - modified memfill & rep stosd : 345
Press any key to continue ...

TomRiddle

Not totally sure what this was suppose to do, but here is what I got:
Athlon K-7 (Model 2) @ 600mhz with 256mb
I'll go ahead and read the whole post :D


; ------------- Sample size = 16 bytes ---------------------
zero_it - hutch: 50
xzero_it - The Dude of Dudes: 31
xzero_it2 - pro3carp3: 43
RtlZeroMemory - Microsoft: 30
ZeroMemD - unknown: 48
fZeroMemory- Four-F: 41
memfill - masm32 lib: 23
AzmtMemZero - jdoe 20
RtlFillMemory - Microsoft(NT+ only)33
msvcrt memset - Microsoft: 38
_memfill - modified memfill & rep stosd : 22
; ------------- Sample size = 64 bytes ---------------------
zero_it - hutch: 147
xzero_it - The Dude of Dudes: 43
xzero_it2 - pro3carp3: 55
RtlZeroMemory - Microsoft: 42
ZeroMemD - unknown: 60
fZeroMemory- Four-F: 53
memfill - masm32 lib: 22
AzmtMemZero - jdoe 29
RtlFillMemory - Microsoft(NT+ only)45
msvcrt memset - Microsoft: 50
_memfill - modified memfill & rep stosd : 21
; ------------- Sample size = 256 bytes ---------------------
zero_it - hutch: 536
xzero_it - The Dude of Dudes: 92
xzero_it2 - pro3carp3: 104
RtlZeroMemory - Microsoft: 91
ZeroMemD - unknown: 109
fZeroMemory- Four-F: 102
memfill - masm32 lib: 58
AzmtMemZero - jdoe 76
RtlFillMemory - Microsoft(NT+ only)94
msvcrt memset - Microsoft: 99
_memfill - modified memfill & rep stosd : 59
; ------------- Sample size = 1024 bytes ---------------------
zero_it - hutch: 2128
xzero_it - The Dude of Dudes: 302
xzero_it2 - pro3carp3: 309
RtlZeroMemory - Microsoft: 285
ZeroMemD - unknown: 304
fZeroMemory- Four-F: 303
memfill - masm32 lib: 223
AzmtMemZero - jdoe 223
RtlFillMemory - Microsoft(NT+ only)288
msvcrt memset - Microsoft: 293
_memfill - modified memfill & rep stosd : 285

asmfan

I would propose you to remake TIMERS.asm - there be less context switches by adding SetThreadPriority func with THREAD_PRIORITY_TIME_CRITICAL and to change HIGH_PRIORITY_CLASS to REALTIME_PRIORITY_CLASS then we will see more or less accurate differences among this procs.
Russia is a weird place

TomRiddle

Wait a minute...

Under Mine(AMD Athlon K-7 @ 600mhz (Model 2)(Orion)) using 1024 byte blocks
TomRiddle: 2128, 302, 309, 285, 304, 303, 223, 223, 288, 293, 285

j_groothu using PIV Northwood (He didn't say the speed)
j_groothu: 2021, 393, 448, 406, 471, 446, 549, 536, 421, 356, 345

MichaelW using PIV Willamette (Also didn't mention it)
2100, 339, 351, 333, 355, 353, 337, 372, 350, 340, 329

Winner

Weird...

Almost forgot, I wanted to run it again to make sure it wasn't a fluke...so I ran it three times
2072, 283, 295, 282, 300, 293, 214, 220, 285, 290, 282
2068, 283, 295, 283, 300, 293, 214, 220, 286, 290, 282
2068, 283, 295, 283, 300, 293, 214, 220, 286, 290, 282

j_groothu

IMO - the 600Mhz AMD would be better compared against a P3 ( which has many instructions that take less cycles than on the p4, a different animal).
After all, More cycles on a 1GHz+ CPU is still a lot faster than less cycles on a 600MHz CPU, but i would imagine requires more sophisticated circuitry to achieve the higher clockrates.

- really memory, cache and pipeline differences across the platforms make the 100 or so difference in cycle count pretty insignificant.  This means to me that the apparent consistency across platforms validates the tests of the algorithms in use,  rather than as a useful CPU or platform benchmark.

Relative timings could be more useful for comparing CPUs/Platforms, In which case a Core 2 duo would wipe the floor, but probably take many more cycles to do so. [If  each core's cycles are summed for a total]

FYI,
Mine is the 2GHZ Northwood, I beleive it is "reclassified" as a "Northwood 2.0A" , partly because it is Pre-Hyperthreading.  It is the last core revision before Hyperthreading enters which mucks up timings in later Northwoods and Prescotts ( So I beleive, I don't own a later P4 for comparison).  Later Northwoods & Prescotts than mine seem to have  much better SSE2 etc.. but introduce other wierdnesses.

Jason

Rockoon

AMD Athlon 64 X2 3800+
Manchester Core
Memory settings: 2.5 3 2 5 (1)


; ------------- Sample size = 16 bytes ---------------------
zero_it - hutch: 61
xzero_it - The Dude of Dudes: 23
xzero_it2 - pro3carp3: 35
RtlZeroMemory - Microsoft: 21
ZeroMemD - unknown: 40
fZeroMemory- Four-F: 32
memfill - masm32 lib: 15
AzmtMemZero - jdoe 14
RtlFillMemory - Microsoft(NT+ only)27
msvcrt memset - Microsoft 30
; ------------- Sample size = 64 bytes ---------------------
zero_it - hutch: 206
xzero_it - The Dude of Dudes: 29
xzero_it2 - pro3carp3: 41
RtlZeroMemory - Microsoft: 28
ZeroMemD - unknown: 46
fZeroMemory- Four-F: 39
memfill - masm32 lib: 19
AzmtMemZero - jdoe 23
RtlFillMemory - Microsoft(NT+ only)34
msvcrt memset - Microsoft 37
; ------------- Sample size = 256 bytes ---------------------
zero_it - hutch: 783
xzero_it - The Dude of Dudes: 54
xzero_it2 - pro3carp3: 65
RtlZeroMemory - Microsoft: 52
ZeroMemD - unknown: 70
fZeroMemory- Four-F: 63
memfill - masm32 lib: 51
AzmtMemZero - jdoe 59
RtlFillMemory - Microsoft(NT+ only)58
msvcrt memset - Microsoft 61
; ------------- Sample size = 1024 bytes ---------------------
zero_it - hutch: 3092
xzero_it - The Dude of Dudes: 150
xzero_it2 - pro3carp3: 162
RtlZeroMemory - Microsoft: 148
ZeroMemD - unknown: 166
fZeroMemory- Four-F: 159
memfill - masm32 lib: 182
AzmtMemZero - jdoe 216
RtlFillMemory - Microsoft(NT+ only)154
msvcrt memset - Microsoft 157
Press any key to continue ...

When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

EduardoS

The Riddle game...

Under Mine(AMD Athlon K-7 @ 600mhz (Model 2)(Orion)) using 1024 byte blocks
2128, 302, 309, 285, 304, 303, 223, 223, 288, 293

j_groothu using PIV Northwood (He didn't say the speed)
[b]2021[/b], 393, 448, 406, 471, 446, 549, 536, 421, 356

MichaelW using PIV Willamette (Also didn't mention it)
2100, 339, 351, 333, 355, 353, 337, 372, 350, 340

MichaelW using P3
3108, 340, 350, 334, 353, 354, 337, 371, 351, 340

Rockoon
3092, [b]150, 162, 148, 166, 159, 182, 216, 154, 157[/b]


Now the important part, the objective is long buffers? Why not using MMX/SSE?

Rockoon

Quote from: EduardoS on February 03, 2007, 12:12:32 PM
Now the important part, the objective is long buffers? Why not using MMX/SSE?

I think mainly because there is only a small benefit for the break in compatability.

If you are in a high performance inner loop then you arent calling a library function, and if you are not in an inner loop the gains are negligable.

(hutch's 8-bit per iteration zero_it() performs fairly bad on my 64-bit system, so its obviously a technique to be avoided in the future.. on the other hand its simplicity and footprint are probably good for the size freaks)

When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

jdoe

Because of the nature of masm32 memfill (only dword mov), it is a fast function but few more modifications can make it faster.

1) Removing the stack frame
2) Replacing "DEC ECX" by "SUB ECX, 1"
3) "SUB EDX, EAX" before the big loop to make it all dword aligned
4) Removing useless "CMP" because "SHR" and "AND" already sets the zero flag


OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 16

jdoe_memfill proc lpmem:DWORD,ln:DWORD,fill:DWORD

    mov edx, [esp+4]           ; buffer address
    mov eax, [esp+12]          ; fill chars

    mov ecx, [esp+8]           ; byte length
    sub edx, eax
    shr ecx, 5                 ; divide by 32
    jz rmndr

    align 4

  ; ------------
  ; unroll by 8
  ; ------------
  @@:
    mov [edx+eax+28], eax
    mov [edx+eax+24], eax
    mov [edx+eax+20], eax
    mov [edx+eax+16], eax
    mov [edx+eax+12], eax
    mov [edx+eax+8],  eax
    mov [edx+eax+4],  eax      ; put fill chars at address in edx
    mov [edx+eax],    eax
    add edx, 32
    sub ecx, 1
    jnz @B

  rmndr:

    and dword ptr [esp+8], 31  ; get remainder
    jz mfQuit
    mov ecx, [esp+8]
    shr ecx, 2                 ; divide by 4

  @@:
    mov [edx+eax], eax
    add edx, 4
    sub ecx, 1
    jnz @B

  mfQuit:

    ret 12

jdoe_memfill endp

OPTION PROLOGUE:PROLOGUEDEF
OPTION EPILOGUE:EPILOGUEDEF



AMD Athlon XP 1800+ (1.53 GHz)

Quote
; ------------- Sample size = 16 bytes ---------------------

zero_it - hutch: 51
xzero_it - The Dude of Dudes: 31
xzero_it2 - pro3carp3: 43
RtlZeroMemory - Microsoft: 30
ZeroMemD - unknown: 47
fZeroMemory- Four-F: 41
AzmtMemZero - jdoe 20
RtlFillMemory - Microsoft(NT+ only)35
msvcrt memset - Microsoft: 38
_memfill - modified memfill & rep stosd : 22
memfill - masm32 lib: 23
jdoe_memfill - modified memfill by jdoe : 19

; ------------- Sample size = 64 bytes ---------------------

zero_it - hutch: 152
xzero_it - The Dude of Dudes: 43
xzero_it2 - pro3carp3: 56
RtlZeroMemory - Microsoft: 42
ZeroMemD - unknown: 60
fZeroMemory- Four-F: 54
AzmtMemZero - jdoe 29
RtlFillMemory - Microsoft(NT+ only)46
msvcrt memset - Microsoft: 51
_memfill - modified memfill & rep stosd : 21
memfill - masm32 lib: 22
jdoe_memfill - modified memfill by jdoe : 17

; ------------- Sample size = 256 bytes ---------------------

zero_it - hutch: 546
xzero_it - The Dude of Dudes: 92
xzero_it2 - pro3carp3: 103
RtlZeroMemory - Microsoft: 91
ZeroMemD - unknown: 109
fZeroMemory- Four-F: 102
AzmtMemZero - jdoe 65
RtlFillMemory - Microsoft(NT+ only)95
msvcrt memset - Microsoft: 99
_memfill - modified memfill & rep stosd : 57
memfill - masm32 lib: 58
jdoe_memfill - modified memfill by jdoe : 46

; ------------- Sample size = 1024 bytes ---------------------

zero_it - hutch: 2102
xzero_it - The Dude of Dudes: 286
xzero_it2 - pro3carp3: 297
RtlZeroMemory - Microsoft: 284
ZeroMemD - unknown: 303
fZeroMemory- Four-F: 296
AzmtMemZero - jdoe 221
RtlFillMemory - Microsoft(NT+ only)289
msvcrt memset - Microsoft: 293
_memfill - modified memfill & rep stosd : 285
memfill - masm32 lib: 217
jdoe_memfill - modified memfill by jdoe : 170

; ------------- Sample size = 2048 bytes ---------------------

zero_it - hutch: 4174
xzero_it - The Dude of Dudes: 548
xzero_it2 - pro3carp3: 557
RtlZeroMemory - Microsoft: 545
ZeroMemD - unknown: 563
fZeroMemory- Four-F: 555
AzmtMemZero - jdoe 418
RtlFillMemory - Microsoft(NT+ only)549
msvcrt memset - Microsoft: 553
_memfill - modified memfill & rep stosd : 545
memfill - masm32 lib: 411
jdoe_memfill - modified memfill by jdoe : 314



[attachment deleted by admin]

hutch--

JD,

Try using one extra register.


  mov [edx+eax+28], eax


Make the fill character in ESI or similar so that EAX is not being used both sides of the instruction. You may have a dependency that can be removed to make it faster.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jdoe

Quote
Try using one extra register.

hutch,

I tried and it's slower. I did a fast search in the AMD, Intel and Agner Fog documentations and I didn't found something about such dependency. I know it can looks unusual but on my AMD it's not a slowdown. It is exponently faster when the memory size gets larger and larger.

I only found this little piece that could looks like what I did in memfill and is using the same register in both sides.

; Example 5.4a. Register read stall
mov [edi + esi], eax
mov ebx, [esp + ebp]

; Example 5.4b. No register read stall
mov [edi + esi], edi
mov ebx, [edi + edi]



If someone can test the latest zeromem on an Intel processor so we can see how jdoe_memfill performs.

Thanks