zero_it - hutch: 3156
xzero_it - The Dude of Dudes: 151
xzero_it2 - pro3carp3: 163
RtlZeroMemory - Microsoft: 151
ZeroMemD - unknown: 169
fZeroMemory- Four-F: 163
[attachment deleted by admin]
I have added mine "AzmtMemZero". The timing is not so bad on my AMD Athlon.
zero_it - hutch: 3116
xzero_it - The Dude of Dudes: 282
xzero_it2 - pro3carp3: 294
RtlZeroMemory - Microsoft: 284
ZeroMemD - unknown: 300
fZeroMemory- Four-F: 295
AzmtMemZero: 219
[attachment deleted by admin]
zero_it - hutch: 3241
xzero_it - The Dude of Dudes: 153
xzero_it2 - pro3carp3: 168
RtlZeroMemory - Microsoft: 154
ZeroMemD - unknown: 173
fZeroMemory- Four-F: 166
AzmtMemZero: 224
Sempron 3000+, DDR400
Quotezero_it - hutch: 1996
xzero_it - The Dude of Dudes: 443
xzero_it2 - pro3carp3: 454
RtlZeroMemory - Microsoft: 448
ZeroMemD - unknown: 460
fZeroMemory- Four-F: 455
AzmtMemZero: 550
on Intel Celeron 2.53 GHz, 512MB memory
Intel PIV (Northwood) @ 2GHz
zero_it - hutch: 2003
xzero_it - The Dude of Dudes: 398
xzero_it2 - pro3carp3: 451
RtlZeroMemory - Microsoft: 407
ZeroMemD - unknown: 464
fZeroMemory- Four-F: 447
AzmtMemZero: 548
WinXP Home (32bit) @ Turion64 2GHz
zero_it - hutch: 3142
xzero_it - The Dude of Dudes: 151
xzero_it2 - pro3carp3: 160
RtlZeroMemory - Microsoft: 148
ZeroMemD - unknown: 167
fZeroMemory- Four-F: 160
AzmtMemZero: 218
This is a 2am quick play. I added the memfill proc from the masm32 library and set the timings to run 4 different lengths of fills. REP STOSD is difficult to beat at 1k and over but easy to beat under 1k. I removed the stack frame from the byte version but it will not compete with DWORD versions over very short byte counts.
Here are the timings I get from the tweaked test piece. Note that I fixed both Dude's and pro3carp3's version as they modified EDI without preserving it.
; ------------- Sample size = 16 bytes ---------------------
zero_it - hutch: 49
xzero_it - The Dude of Dudes: 88
xzero_it2 - pro3carp3: 118
RtlZeroMemory - Microsoft: 99
ZeroMemD - unknown: 126
fZeroMemory- Four-F: 112
memfill - masm32 lib: 13
; ------------- Sample size = 64 bytes ---------------------
zero_it - hutch: 148
xzero_it - The Dude of Dudes: 110
xzero_it2 - pro3carp3: 138
RtlZeroMemory - Microsoft: 120
ZeroMemD - unknown: 147
fZeroMemory- Four-F: 132
memfill - masm32 lib: 32
; ------------- Sample size = 256 bytes ---------------------
zero_it - hutch: 535
xzero_it - The Dude of Dudes: 237
xzero_it2 - pro3carp3: 271
RtlZeroMemory - Microsoft: 238
ZeroMemD - unknown: 285
fZeroMemory- Four-F: 269
memfill - masm32 lib: 147
; ------------- Sample size = 1024 bytes ---------------------
zero_it - hutch: 2071
xzero_it - The Dude of Dudes: 422
xzero_it2 - pro3carp3: 453
RtlZeroMemory - Microsoft: 429
ZeroMemD - unknown: 469
fZeroMemory- Four-F: 451
memfill - masm32 lib: 579
Press any key to continue ...
[attachment deleted by admin]
Quote; ------------- Sample size = 16 bytes ---------------------
zero_it - hutch: 40
xzero_it - The Dude of Dudes: 76
xzero_it2 - pro3carp3: 86
RtlZeroMemory - Microsoft: 79
ZeroMemD - unknown: 92
fZeroMemory- Four-F: 84
memfill - masm32 lib: 26
; ------------- Sample size = 64 bytes ---------------------
zero_it - hutch: 161
xzero_it - The Dude of Dudes: 98
xzero_it2 - pro3carp3: 112
RtlZeroMemory - Microsoft: 102
ZeroMemD - unknown: 118
fZeroMemory- Four-F: 110
memfill - masm32 lib: 39
; ------------- Sample size = 256 bytes ---------------------
zero_it - hutch: 511
xzero_it - The Dude of Dudes: 292
xzero_it2 - pro3carp3: 301
RtlZeroMemory - Microsoft: 299
ZeroMemD - unknown: 312
fZeroMemory- Four-F: 303
memfill - masm32 lib: 136
; ------------- Sample size = 1024 bytes ---------------------
zero_it - hutch: 1994
xzero_it - The Dude of Dudes: 452
xzero_it2 - pro3carp3: 454
RtlZeroMemory - Microsoft: 447
ZeroMemD - unknown: 471
fZeroMemory- Four-F: 457
memfill - masm32 lib: 572
still on Intel Celeron 2.53 GHz, 512MB memory :)
PIII 1GHz
; ------------- Sample size = 16 bytes ---------------------
zero_it - hutch: 49
xzero_it - The Dude of Dudes: 41
xzero_it2 - pro3carp3: 52
RtlZeroMemory - Microsoft: 33
ZeroMemD - unknown: 56
fZeroMemory- Four-F: 60
memfill - masm32 lib: 22
; ------------- Sample size = 64 bytes ---------------------
zero_it - hutch: 146
xzero_it - The Dude of Dudes: 67
xzero_it2 - pro3carp3: 80
RtlZeroMemory - Microsoft: 61
ZeroMemD - unknown: 82
fZeroMemory- Four-F: 89
memfill - masm32 lib: 33
; ------------- Sample size = 256 bytes ---------------------
zero_it - hutch: 534
xzero_it - The Dude of Dudes: 206
xzero_it2 - pro3carp3: 221
RtlZeroMemory - Microsoft: 204
ZeroMemD - unknown: 224
fZeroMemory- Four-F: 222
memfill - masm32 lib: 103
; ------------- Sample size = 1024 bytes ---------------------
zero_it - hutch: 2081
xzero_it - The Dude of Dudes: 333
xzero_it2 - pro3carp3: 347
RtlZeroMemory - Microsoft: 331
ZeroMemD - unknown: 351
fZeroMemory- Four-F: 350
memfill - masm32 lib: 338
PIV ( Northwood) @2GHz ( new test )
; ------------- Sample size = 16 bytes ---------------------
zero_it - hutch: 47
xzero_it - The Dude of Dudes: 103
xzero_it2 - pro3carp3: 166
RtlZeroMemory - Microsoft: 112
ZeroMemD - unknown: 183
fZeroMemory- Four-F: 157
memfill - masm32 lib: 18
; ------------- Sample size = 64 bytes ---------------------
zero_it - hutch: 139
xzero_it - The Dude of Dudes: 126
xzero_it2 - pro3carp3: 179
RtlZeroMemory - Microsoft: 130
ZeroMemD - unknown: 194
fZeroMemory- Four-F: 170
memfill - masm32 lib: 32
; ------------- Sample size = 256 bytes ---------------------
zero_it - hutch: 527
xzero_it - The Dude of Dudes: 225
xzero_it2 - pro3carp3: 282
RtlZeroMemory - Microsoft: 235
ZeroMemD - unknown: 301
fZeroMemory- Four-F: 285
memfill - masm32 lib: 151
; ------------- Sample size = 1024 bytes ---------------------
zero_it - hutch: 2009
xzero_it - The Dude of Dudes: 415
xzero_it2 - pro3carp3: 473
RtlZeroMemory - Microsoft: 426
ZeroMemD - unknown: 489
fZeroMemory- Four-F: 474
memfill - masm32 lib: 566
Press any key to continue ...
Quote; ------------- Sample size = 16 bytes ---------------------
zero_it - hutch: 77
xzero_it - The Dude of Dudes: 36
xzero_it2 - pro3carp3: 42
RtlZeroMemory - Microsoft: 113
ZeroMemD - unknown: 66
fZeroMemory- Four-F: 58
memfill - masm32 lib: 39
; ------------- Sample size = 64 bytes ---------------------
zero_it - hutch: 170
xzero_it - The Dude of Dudes: 54
xzero_it2 - pro3carp3: 65
RtlZeroMemory - Microsoft: 116
ZeroMemD - unknown: 70
fZeroMemory- Four-F: 61
memfill - masm32 lib: 41
; ------------- Sample size = 256 bytes ---------------------
zero_it - hutch: 587
xzero_it - The Dude of Dudes: 77
xzero_it2 - pro3carp3: 87
RtlZeroMemory - Microsoft: 139
ZeroMemD - unknown: 91
fZeroMemory- Four-F: 84
memfill - masm32 lib: 70
; ------------- Sample size = 1024 bytes ---------------------
zero_it - hutch: 2172
xzero_it - The Dude of Dudes: 173
xzero_it2 - pro3carp3: 183
RtlZeroMemory - Microsoft: 246
ZeroMemD - unknown: 189
fZeroMemory- Four-F: 115
memfill - masm32 lib: 203
Dunno if the test if of any use, it was ran on wine in FreeBSD on a Sempron 2500+ :red
However, it seems to be faster than the original MS code :bg
I added jdoes AzmtMemZero, RtlFillMemory, and tried to add crt_memset, but the crt one wouldn't compile. Also my specs are amd 64 3800, 2 gigs ddr ram
; ------------- Sample size = 16 bytes ---------------------
zero_it - hutch: 51
xzero_it - The Dude of Dudes: 26
xzero_it2 - pro3carp3: 39
RtlZeroMemory - Microsoft: 24
ZeroMemD - unknown: 43
fZeroMemory- Four-F: 34
memfill - masm32 lib: 16
AzmtMemZero - jdoe 15
RtlFillMemory - Microsoft(NT+ only)29
; ------------- Sample size = 64 bytes ---------------------
zero_it - hutch: 154
xzero_it - The Dude of Dudes: 31
xzero_it2 - pro3carp3: 44
RtlZeroMemory - Microsoft: 30
ZeroMemD - unknown: 49
fZeroMemory- Four-F: 42
memfill - masm32 lib: 21
AzmtMemZero - jdoe 24
RtlFillMemory - Microsoft(NT+ only)35
; ------------- Sample size = 256 bytes ---------------------
zero_it - hutch: 568
xzero_it - The Dude of Dudes: 62
xzero_it2 - pro3carp3: 71
RtlZeroMemory - Microsoft: 58
ZeroMemD - unknown: 74
fZeroMemory- Four-F: 67
memfill - masm32 lib: 53
AzmtMemZero - jdoe 67
RtlFillMemory - Microsoft(NT+ only)62
; ------------- Sample size = 1024 bytes ---------------------
zero_it - hutch: 2228
xzero_it - The Dude of Dudes: 161
xzero_it2 - pro3carp3: 173
RtlZeroMemory - Microsoft: 161
ZeroMemD - unknown: 178
fZeroMemory- Four-F: 174
memfill - masm32 lib: 196
AzmtMemZero - jdoe 232
RtlFillMemory - Microsoft(NT+ only)165
[attachment deleted by admin]
The problem with memset is the [esi]. The prototype in msvcrt.inc is:
c_msvcrt typedef PROTO C :VARARG
...
externdef _imp__memset:PTR c_msvcrt
crt_memset equ <_imp__memset>
The problem might possibly have something to do with :VARARG not being preceded by a symbol. Per the MASM Programmer's Guide:
"A symbol must precede :VARARG so the procedure can access arguments as offsets from the given variable name"
;invoke crt_memset, addr szBuff, 0, [esi]
push [esi]
push 0
push offset szBuff
call crt_memset
add esp,12
P3:
; ------------- Sample size = 16 bytes ---------------------
zero_it - hutch: 60
xzero_it - The Dude of Dudes: 41
xzero_it2 - pro3carp3: 52
RtlZeroMemory - Microsoft: 33
ZeroMemD - unknown: 55
fZeroMemory- Four-F: 60
memfill - masm32 lib: 19
AzmtMemZero - jdoe 44
RtlFillMemory - Microsoft(NT+ only)50
msvcrt memset - Microsoft: 41
; ------------- Sample size = 64 bytes ---------------------
zero_it - hutch: 206
xzero_it - The Dude of Dudes: 67
xzero_it2 - pro3carp3: 80
RtlZeroMemory - Microsoft: 61
ZeroMemD - unknown: 82
fZeroMemory- Four-F: 89
memfill - masm32 lib: 33
AzmtMemZero - jdoe 63
RtlFillMemory - Microsoft(NT+ only)78
msvcrt memset - Microsoft: 70
; ------------- Sample size = 256 bytes ---------------------
zero_it - hutch: 785
xzero_it - The Dude of Dudes: 208
xzero_it2 - pro3carp3: 225
RtlZeroMemory - Microsoft: 205
ZeroMemD - unknown: 224
fZeroMemory- Four-F: 226
memfill - masm32 lib: 102
AzmtMemZero - jdoe 130
RtlFillMemory - Microsoft(NT+ only)222
msvcrt memset - Microsoft: 211
; ------------- Sample size = 1024 bytes ---------------------
zero_it - hutch: 3108
xzero_it - The Dude of Dudes: 340
xzero_it2 - pro3carp3: 350
RtlZeroMemory - Microsoft: 334
ZeroMemD - unknown: 353
fZeroMemory- Four-F: 354
memfill - masm32 lib: 337
AzmtMemZero - jdoe 371
RtlFillMemory - Microsoft(NT+ only)351
msvcrt memset - Microsoft: 340
P4 Willamette:
; ------------- Sample size = 16 bytes ------------------
zero_it - hutch: 44
xzero_it - The Dude of Dudes: 96
xzero_it2 - pro3carp3: 148
RtlZeroMemory - Microsoft: 109
ZeroMemD - unknown: 158
fZeroMemory- Four-F: 142
memfill - masm32 lib: 12
AzmtMemZero - jdoe 25
RtlFillMemory - Microsoft(NT+ only)120
msvcrt memset - Microsoft: 71
; ------------- Sample size = 64 bytes ------------------
zero_it - hutch: 136
xzero_it - The Dude of Dudes: 112
xzero_it2 - pro3carp3: 172
RtlZeroMemory - Microsoft: 130
ZeroMemD - unknown: 175
fZeroMemory- Four-F: 158
memfill - masm32 lib: 38
AzmtMemZero - jdoe 32
RtlFillMemory - Microsoft(NT+ only)136
msvcrt memset - Microsoft: 87
; ------------- Sample size = 256 bytes -----------------
zero_it - hutch: 552
xzero_it - The Dude of Dudes: 212
xzero_it2 - pro3carp3: 264
RtlZeroMemory - Microsoft: 231
ZeroMemD - unknown: 274
fZeroMemory- Four-F: 258
memfill - masm32 lib: 160
AzmtMemZero - jdoe 162
RtlFillMemory - Microsoft(NT+ only)237
msvcrt memset - Microsoft: 174
; ------------- Sample size = 1024 bytes ----------------
zero_it - hutch: 1992
xzero_it - The Dude of Dudes: 382
xzero_it2 - pro3carp3: 445
RtlZeroMemory - Microsoft: 399
ZeroMemD - unknown: 447
fZeroMemory- Four-F: 434
memfill - masm32 lib: 558
AzmtMemZero - jdoe 542
RtlFillMemory - Microsoft(NT+ only)398
msvcrt memset - Microsoft: 343
Thanks MichaelW, I updated the attachment in my last post with your fixes and heres the new listing I got
; ------------- Sample size = 16 bytes ---------------------
zero_it - hutch: 61
xzero_it - The Dude of Dudes: 23
xzero_it2 - pro3carp3: 35
RtlZeroMemory - Microsoft: 22
ZeroMemD - unknown: 40
fZeroMemory- Four-F: 32
memfill - masm32 lib: 15
AzmtMemZero - jdoe 14
RtlFillMemory - Microsoft(NT+ only)27
msvcrt memset - Microsoft 30
; ------------- Sample size = 64 bytes ---------------------
zero_it - hutch: 206
xzero_it - The Dude of Dudes: 29
xzero_it2 - pro3carp3: 41
RtlZeroMemory - Microsoft: 29
ZeroMemD - unknown: 46
fZeroMemory- Four-F: 39
memfill - masm32 lib: 19
AzmtMemZero - jdoe 23
RtlFillMemory - Microsoft(NT+ only)34
msvcrt memset - Microsoft 37
; ------------- Sample size = 256 bytes ---------------------
zero_it - hutch: 785
xzero_it - The Dude of Dudes: 53
xzero_it2 - pro3carp3: 65
RtlZeroMemory - Microsoft: 53
ZeroMemD - unknown: 70
fZeroMemory- Four-F: 63
memfill - masm32 lib: 49
AzmtMemZero - jdoe 59
RtlFillMemory - Microsoft(NT+ only)58
msvcrt memset - Microsoft 61
; ------------- Sample size = 1024 bytes ---------------------
zero_it - hutch: 3095
xzero_it - The Dude of Dudes: 150
xzero_it2 - pro3carp3: 161
RtlZeroMemory - Microsoft: 149
ZeroMemD - unknown: 166
fZeroMemory- Four-F: 159
memfill - masm32 lib: 182
AzmtMemZero - jdoe 216
RtlFillMemory - Microsoft(NT+ only)154
msvcrt memset - Microsoft 157
PIV ( Northwood) @2GHz
; ------------- Sample size = 16 bytes ---------------------
zero_it - hutch: 45
xzero_it - The Dude of Dudes: 103
xzero_it2 - pro3carp3: 160
RtlZeroMemory - Microsoft: 112
ZeroMemD - unknown: 181
fZeroMemory- Four-F: 156
memfill - masm32 lib: 16
AzmtMemZero - jdoe 15
RtlFillMemory - Microsoft(NT+ only)120
msvcrt memset - Microsoft 71
; ------------- Sample size = 64 bytes ---------------------
zero_it - hutch: 150
xzero_it - The Dude of Dudes: 128
xzero_it2 - pro3carp3: 286
RtlZeroMemory - Microsoft: 126
ZeroMemD - unknown: 187
fZeroMemory- Four-F: 170
memfill - masm32 lib: 31
AzmtMemZero - jdoe 36
RtlFillMemory - Microsoft(NT+ only)130
msvcrt memset - Microsoft 79
; ------------- Sample size = 256 bytes ---------------------
zero_it - hutch: 513
xzero_it - The Dude of Dudes: 230
xzero_it2 - pro3carp3: 281
RtlZeroMemory - Microsoft: 230
ZeroMemD - unknown: 298
fZeroMemory- Four-F: 278
memfill - masm32 lib: 154
AzmtMemZero - jdoe 157
RtlFillMemory - Microsoft(NT+ only)247
msvcrt memset - Microsoft 190
; ------------- Sample size = 1024 bytes ---------------------
zero_it - hutch: 1997
xzero_it - The Dude of Dudes: 393
xzero_it2 - pro3carp3: 460
RtlZeroMemory - Microsoft: 405
ZeroMemD - unknown: 475
fZeroMemory- Four-F: 453
memfill - masm32 lib: 548
AzmtMemZero - jdoe 547
RtlFillMemory - Microsoft(NT+ only)412
msvcrt memset - Microsoft 354
Press any key to continue ...
I'm impressed. For once Microsoft has done it faster than the competition :bdg
Has someone here sold them some source code recently...
lol there's nothing high-tech about the "rep stosd" approach, that MS obviously uses :)
There doesn't have to be, it works and it works pretty well over a full range. There are some much faster variations for small blocks though, which are much more common than larger ones. My pick is an algo which checks the size of the block, gives it to Memfill if it's less than 512 bytes (though I'd do more tests to pick a better crossover) and RtlZeroMemory if it's larger. (Checking the size of the block is an extremely cheap test, since it has to be passed as a parameter anyway)
Cheers,
Zooba :U
Based on the surprising performance of memset for the largest block, and on zooba's idea for a hybrid, I created a procedure that combined a modified memfill with a rep stosd for blocks >= 512 bytes, and added it to the test app.
P3:
; ------------- Sample size = 16 bytes ---------------------
zero_it - hutch: 49
xzero_it - The Dude of Dudes: 41
xzero_it2 - pro3carp3: 52
RtlZeroMemory - Microsoft: 33
ZeroMemD - unknown: 56
fZeroMemory- Four-F: 60
memfill - masm32 lib: 20
AzmtMemZero - jdoe 44
RtlFillMemory - Microsoft(NT+ only)50
msvcrt memset - Microsoft: 41
_memfill - modified memfill & rep stosd : 23
; ------------- Sample size = 64 bytes ---------------------
zero_it - hutch: 146
xzero_it - The Dude of Dudes: 67
xzero_it2 - pro3carp3: 80
RtlZeroMemory - Microsoft: 61
ZeroMemD - unknown: 82
fZeroMemory- Four-F: 88
memfill - masm32 lib: 33
AzmtMemZero - jdoe 63
RtlFillMemory - Microsoft(NT+ only)77
msvcrt memset - Microsoft: 72
_memfill - modified memfill & rep stosd : 31
; ------------- Sample size = 256 bytes ---------------------
zero_it - hutch: 533
xzero_it - The Dude of Dudes: 208
xzero_it2 - pro3carp3: 222
RtlZeroMemory - Microsoft: 204
ZeroMemD - unknown: 226
fZeroMemory- Four-F: 224
memfill - masm32 lib: 103
AzmtMemZero - jdoe 130
RtlFillMemory - Microsoft(NT+ only)222
msvcrt memset - Microsoft: 212
_memfill - modified memfill & rep stosd : 100
; ------------- Sample size = 1024 bytes ---------------------
zero_it - hutch: 2100
xzero_it - The Dude of Dudes: 339
xzero_it2 - pro3carp3: 351
RtlZeroMemory - Microsoft: 333
ZeroMemD - unknown: 355
fZeroMemory- Four-F: 353
memfill - masm32 lib: 337
AzmtMemZero - jdoe 372
RtlFillMemory - Microsoft(NT+ only)350
msvcrt memset - Microsoft: 340
_memfill - modified memfill & rep stosd : 329
P4 Willamette:
; ------------- Sample size = 16 bytes ---------------------
zero_it - hutch: 42
xzero_it - The Dude of Dudes: 96
xzero_it2 - pro3carp3: 147
RtlZeroMemory - Microsoft: 112
ZeroMemD - unknown: 158
fZeroMemory- Four-F: 142
memfill - masm32 lib: 14
AzmtMemZero - jdoe 22
RtlFillMemory - Microsoft(NT+ only)120
msvcrt memset - Microsoft: 71
_memfill - modified memfill & rep stosd : 12
; ------------- Sample size = 64 bytes ---------------------
zero_it - hutch: 136
xzero_it - The Dude of Dudes: 112
xzero_it2 - pro3carp3: 174
RtlZeroMemory - Microsoft: 118
ZeroMemD - unknown: 174
fZeroMemory- Four-F: 168
memfill - masm32 lib: 38
AzmtMemZero - jdoe 30
RtlFillMemory - Microsoft(NT+ only)136
msvcrt memset - Microsoft: 78
_memfill - modified memfill & rep stosd : 25
; ------------- Sample size = 256 bytes ---------------------
zero_it - hutch: 516
xzero_it - The Dude of Dudes: 212
xzero_it2 - pro3carp3: 264
RtlZeroMemory - Microsoft: 226
ZeroMemD - unknown: 274
fZeroMemory- Four-F: 258
memfill - masm32 lib: 148
AzmtMemZero - jdoe 151
RtlFillMemory - Microsoft(NT+ only)236
msvcrt memset - Microsoft: 188
_memfill - modified memfill & rep stosd : 144
; ------------- Sample size = 1024 bytes ---------------------
zero_it - hutch: 1980
xzero_it - The Dude of Dudes: 392
xzero_it2 - pro3carp3: 444
RtlZeroMemory - Microsoft: 397
ZeroMemD - unknown: 446
fZeroMemory- Four-F: 427
memfill - masm32 lib: 548
AzmtMemZero - jdoe 549
RtlFillMemory - Microsoft(NT+ only)399
msvcrt memset - Microsoft: 343
_memfill - modified memfill & rep stosd : 337
[attachment deleted by admin]
PIV - Northwood
; ------------- Sample size = 16 bytes ---------------------
zero_it - hutch: 35
xzero_it - The Dude of Dudes: 103
xzero_it2 - pro3carp3: 160
RtlZeroMemory - Microsoft: 119
ZeroMemD - unknown: 174
fZeroMemory- Four-F: 153
memfill - masm32 lib: 14
AzmtMemZero - jdoe 15
RtlFillMemory - Microsoft(NT+ only)130
msvcrt memset - Microsoft: 65
_memfill - modified memfill & rep stosd : 16
; ------------- Sample size = 64 bytes ---------------------
zero_it - hutch: 137
xzero_it - The Dude of Dudes: 119
xzero_it2 - pro3carp3: 175
RtlZeroMemory - Microsoft: 126
ZeroMemD - unknown: 187
fZeroMemory- Four-F: 168
memfill - masm32 lib: 30
AzmtMemZero - jdoe 33
RtlFillMemory - Microsoft(NT+ only)143
msvcrt memset - Microsoft: 78
_memfill - modified memfill & rep stosd : 26
; ------------- Sample size = 256 bytes ---------------------
zero_it - hutch: 511
xzero_it - The Dude of Dudes: 224
xzero_it2 - pro3carp3: 279
RtlZeroMemory - Microsoft: 237
ZeroMemD - unknown: 292
fZeroMemory- Four-F: 273
memfill - masm32 lib: 150
AzmtMemZero - jdoe 149
RtlFillMemory - Microsoft(NT+ only)247
msvcrt memset - Microsoft: 184
_memfill - modified memfill & rep stosd : 144
; ------------- Sample size = 1024 bytes ---------------------
zero_it - hutch: 2021
xzero_it - The Dude of Dudes: 393
xzero_it2 - pro3carp3: 448
RtlZeroMemory - Microsoft: 406
ZeroMemD - unknown: 471
fZeroMemory- Four-F: 446
memfill - masm32 lib: 549
AzmtMemZero - jdoe 536
RtlFillMemory - Microsoft(NT+ only)421
msvcrt memset - Microsoft: 356
_memfill - modified memfill & rep stosd : 345
Press any key to continue ...
Not totally sure what this was suppose to do, but here is what I got:
Athlon K-7 (Model 2) @ 600mhz with 256mb
I'll go ahead and read the whole post :D
; ------------- Sample size = 16 bytes ---------------------
zero_it - hutch: 50
xzero_it - The Dude of Dudes: 31
xzero_it2 - pro3carp3: 43
RtlZeroMemory - Microsoft: 30
ZeroMemD - unknown: 48
fZeroMemory- Four-F: 41
memfill - masm32 lib: 23
AzmtMemZero - jdoe 20
RtlFillMemory - Microsoft(NT+ only)33
msvcrt memset - Microsoft: 38
_memfill - modified memfill & rep stosd : 22
; ------------- Sample size = 64 bytes ---------------------
zero_it - hutch: 147
xzero_it - The Dude of Dudes: 43
xzero_it2 - pro3carp3: 55
RtlZeroMemory - Microsoft: 42
ZeroMemD - unknown: 60
fZeroMemory- Four-F: 53
memfill - masm32 lib: 22
AzmtMemZero - jdoe 29
RtlFillMemory - Microsoft(NT+ only)45
msvcrt memset - Microsoft: 50
_memfill - modified memfill & rep stosd : 21
; ------------- Sample size = 256 bytes ---------------------
zero_it - hutch: 536
xzero_it - The Dude of Dudes: 92
xzero_it2 - pro3carp3: 104
RtlZeroMemory - Microsoft: 91
ZeroMemD - unknown: 109
fZeroMemory- Four-F: 102
memfill - masm32 lib: 58
AzmtMemZero - jdoe 76
RtlFillMemory - Microsoft(NT+ only)94
msvcrt memset - Microsoft: 99
_memfill - modified memfill & rep stosd : 59
; ------------- Sample size = 1024 bytes ---------------------
zero_it - hutch: 2128
xzero_it - The Dude of Dudes: 302
xzero_it2 - pro3carp3: 309
RtlZeroMemory - Microsoft: 285
ZeroMemD - unknown: 304
fZeroMemory- Four-F: 303
memfill - masm32 lib: 223
AzmtMemZero - jdoe 223
RtlFillMemory - Microsoft(NT+ only)288
msvcrt memset - Microsoft: 293
_memfill - modified memfill & rep stosd : 285
I would propose you to remake TIMERS.asm - there be less context switches by adding SetThreadPriority func with THREAD_PRIORITY_TIME_CRITICAL and to change HIGH_PRIORITY_CLASS to REALTIME_PRIORITY_CLASS then we will see more or less accurate differences among this procs.
Wait a minute...
Under Mine(AMD Athlon K-7 @ 600mhz (Model 2)(Orion)) using 1024 byte blocks
TomRiddle: 2128, 302, 309, 285, 304, 303, 223, 223, 288, 293, 285
j_groothu using PIV Northwood (He didn't say the speed)
j_groothu: 2021, 393, 448, 406, 471, 446, 549, 536, 421, 356, 345
MichaelW using PIV Willamette (Also didn't mention it)
2100, 339, 351, 333, 355, 353, 337, 372, 350, 340, 329
Winner
Weird...
Almost forgot, I wanted to run it again to make sure it wasn't a fluke...so I ran it three times
2072, 283, 295, 282, 300, 293, 214, 220, 285, 290, 282
2068, 283, 295, 283, 300, 293, 214, 220, 286, 290, 282
2068, 283, 295, 283, 300, 293, 214, 220, 286, 290, 282
IMO - the 600Mhz AMD would be better compared against a P3 ( which has many instructions that take less cycles than on the p4, a different animal).
After all, More cycles on a 1GHz+ CPU is still a lot faster than less cycles on a 600MHz CPU, but i would imagine requires more sophisticated circuitry to achieve the higher clockrates.
- really memory, cache and pipeline differences across the platforms make the 100 or so difference in cycle count pretty insignificant. This means to me that the apparent consistency across platforms validates the tests of the algorithms in use, rather than as a useful CPU or platform benchmark.
Relative timings could be more useful for comparing CPUs/Platforms, In which case a Core 2 duo would wipe the floor, but probably take many more cycles to do so. [If each core's cycles are summed for a total]
FYI,
Mine is the 2GHZ Northwood, I beleive it is "reclassified" as a "Northwood 2.0A" , partly because it is Pre-Hyperthreading. It is the last core revision before Hyperthreading enters which mucks up timings in later Northwoods and Prescotts ( So I beleive, I don't own a later P4 for comparison). Later Northwoods & Prescotts than mine seem to have much better SSE2 etc.. but introduce other wierdnesses.
Jason
AMD Athlon 64 X2 3800+
Manchester Core
Memory settings: 2.5 3 2 5 (1)
; ------------- Sample size = 16 bytes ---------------------
zero_it - hutch: 61
xzero_it - The Dude of Dudes: 23
xzero_it2 - pro3carp3: 35
RtlZeroMemory - Microsoft: 21
ZeroMemD - unknown: 40
fZeroMemory- Four-F: 32
memfill - masm32 lib: 15
AzmtMemZero - jdoe 14
RtlFillMemory - Microsoft(NT+ only)27
msvcrt memset - Microsoft 30
; ------------- Sample size = 64 bytes ---------------------
zero_it - hutch: 206
xzero_it - The Dude of Dudes: 29
xzero_it2 - pro3carp3: 41
RtlZeroMemory - Microsoft: 28
ZeroMemD - unknown: 46
fZeroMemory- Four-F: 39
memfill - masm32 lib: 19
AzmtMemZero - jdoe 23
RtlFillMemory - Microsoft(NT+ only)34
msvcrt memset - Microsoft 37
; ------------- Sample size = 256 bytes ---------------------
zero_it - hutch: 783
xzero_it - The Dude of Dudes: 54
xzero_it2 - pro3carp3: 65
RtlZeroMemory - Microsoft: 52
ZeroMemD - unknown: 70
fZeroMemory- Four-F: 63
memfill - masm32 lib: 51
AzmtMemZero - jdoe 59
RtlFillMemory - Microsoft(NT+ only)58
msvcrt memset - Microsoft 61
; ------------- Sample size = 1024 bytes ---------------------
zero_it - hutch: 3092
xzero_it - The Dude of Dudes: 150
xzero_it2 - pro3carp3: 162
RtlZeroMemory - Microsoft: 148
ZeroMemD - unknown: 166
fZeroMemory- Four-F: 159
memfill - masm32 lib: 182
AzmtMemZero - jdoe 216
RtlFillMemory - Microsoft(NT+ only)154
msvcrt memset - Microsoft 157
Press any key to continue ...
The Riddle game...
Under Mine(AMD Athlon K-7 @ 600mhz (Model 2)(Orion)) using 1024 byte blocks
2128, 302, 309, 285, 304, 303, 223, 223, 288, 293
j_groothu using PIV Northwood (He didn't say the speed)
[b]2021[/b], 393, 448, 406, 471, 446, 549, 536, 421, 356
MichaelW using PIV Willamette (Also didn't mention it)
2100, 339, 351, 333, 355, 353, 337, 372, 350, 340
MichaelW using P3
3108, 340, 350, 334, 353, 354, 337, 371, 351, 340
Rockoon
3092, [b]150, 162, 148, 166, 159, 182, 216, 154, 157[/b]
Now the important part, the objective is long buffers? Why not using MMX/SSE?
Quote from: EduardoS on February 03, 2007, 12:12:32 PM
Now the important part, the objective is long buffers? Why not using MMX/SSE?
I think mainly because there is only a small benefit for the break in compatability.
If you are in a high performance inner loop then you arent calling a library function, and if you are not in an inner loop the gains are negligable.
(hutch's 8-bit per iteration zero_it() performs fairly bad on my 64-bit system, so its obviously a technique to be avoided in the future.. on the other hand its simplicity and footprint are probably good for the size freaks)
Because of the nature of masm32 memfill (only dword mov), it is a fast function but few more modifications can make it faster.
1) Removing the stack frame
2) Replacing "DEC ECX" by "SUB ECX, 1"
3) "SUB EDX, EAX" before the big loop to make it all dword aligned
4) Removing useless "CMP" because "SHR" and "AND" already sets the zero flag
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
jdoe_memfill proc lpmem:DWORD,ln:DWORD,fill:DWORD
mov edx, [esp+4] ; buffer address
mov eax, [esp+12] ; fill chars
mov ecx, [esp+8] ; byte length
sub edx, eax
shr ecx, 5 ; divide by 32
jz rmndr
align 4
; ------------
; unroll by 8
; ------------
@@:
mov [edx+eax+28], eax
mov [edx+eax+24], eax
mov [edx+eax+20], eax
mov [edx+eax+16], eax
mov [edx+eax+12], eax
mov [edx+eax+8], eax
mov [edx+eax+4], eax ; put fill chars at address in edx
mov [edx+eax], eax
add edx, 32
sub ecx, 1
jnz @B
rmndr:
and dword ptr [esp+8], 31 ; get remainder
jz mfQuit
mov ecx, [esp+8]
shr ecx, 2 ; divide by 4
@@:
mov [edx+eax], eax
add edx, 4
sub ecx, 1
jnz @B
mfQuit:
ret 12
jdoe_memfill endp
OPTION PROLOGUE:PROLOGUEDEF
OPTION EPILOGUE:EPILOGUEDEF
AMD Athlon XP 1800+ (1.53 GHz)
Quote
; ------------- Sample size = 16 bytes ---------------------
zero_it - hutch: 51
xzero_it - The Dude of Dudes: 31
xzero_it2 - pro3carp3: 43
RtlZeroMemory - Microsoft: 30
ZeroMemD - unknown: 47
fZeroMemory- Four-F: 41
AzmtMemZero - jdoe 20
RtlFillMemory - Microsoft(NT+ only)35
msvcrt memset - Microsoft: 38
_memfill - modified memfill & rep stosd : 22
memfill - masm32 lib: 23
jdoe_memfill - modified memfill by jdoe : 19
; ------------- Sample size = 64 bytes ---------------------
zero_it - hutch: 152
xzero_it - The Dude of Dudes: 43
xzero_it2 - pro3carp3: 56
RtlZeroMemory - Microsoft: 42
ZeroMemD - unknown: 60
fZeroMemory- Four-F: 54
AzmtMemZero - jdoe 29
RtlFillMemory - Microsoft(NT+ only)46
msvcrt memset - Microsoft: 51
_memfill - modified memfill & rep stosd : 21
memfill - masm32 lib: 22
jdoe_memfill - modified memfill by jdoe : 17
; ------------- Sample size = 256 bytes ---------------------
zero_it - hutch: 546
xzero_it - The Dude of Dudes: 92
xzero_it2 - pro3carp3: 103
RtlZeroMemory - Microsoft: 91
ZeroMemD - unknown: 109
fZeroMemory- Four-F: 102
AzmtMemZero - jdoe 65
RtlFillMemory - Microsoft(NT+ only)95
msvcrt memset - Microsoft: 99
_memfill - modified memfill & rep stosd : 57
memfill - masm32 lib: 58
jdoe_memfill - modified memfill by jdoe : 46
; ------------- Sample size = 1024 bytes ---------------------
zero_it - hutch: 2102
xzero_it - The Dude of Dudes: 286
xzero_it2 - pro3carp3: 297
RtlZeroMemory - Microsoft: 284
ZeroMemD - unknown: 303
fZeroMemory- Four-F: 296
AzmtMemZero - jdoe 221
RtlFillMemory - Microsoft(NT+ only)289
msvcrt memset - Microsoft: 293
_memfill - modified memfill & rep stosd : 285
memfill - masm32 lib: 217
jdoe_memfill - modified memfill by jdoe : 170
; ------------- Sample size = 2048 bytes ---------------------
zero_it - hutch: 4174
xzero_it - The Dude of Dudes: 548
xzero_it2 - pro3carp3: 557
RtlZeroMemory - Microsoft: 545
ZeroMemD - unknown: 563
fZeroMemory- Four-F: 555
AzmtMemZero - jdoe 418
RtlFillMemory - Microsoft(NT+ only)549
msvcrt memset - Microsoft: 553
_memfill - modified memfill & rep stosd : 545
memfill - masm32 lib: 411
jdoe_memfill - modified memfill by jdoe : 314
[attachment deleted by admin]
JD,
Try using one extra register.
mov [edx+eax+28], eax
Make the fill character in ESI or similar so that EAX is not being used both sides of the instruction. You may have a dependency that can be removed to make it faster.
Quote
Try using one extra register.
hutch,
I tried and it's slower. I did a fast search in the AMD, Intel and Agner Fog documentations and I didn't found something about such dependency. I know it can looks unusual but on my AMD it's not a slowdown. It is exponently faster when the memory size gets larger and larger.
I only found this little piece that could looks like what I did in memfill and is using the same register in both sides.
; Example 5.4a. Register read stall
mov [edi + esi], eax
mov ebx, [esp + ebp]
; Example 5.4b. No register read stall
mov [edi + esi], edi
mov ebx, [edi + edi]
If someone can test the latest zeromem on an Intel processor so we can see how jdoe_memfill performs.
Thanks
P3 (all I have running ATM):
; ------------- Sample size = 16 bytes ---------------------
zero_it - hutch: 49
xzero_it - The Dude of Dudes: 41
xzero_it2 - pro3carp3: 53
RtlZeroMemory - Microsoft: 33
ZeroMemD - unknown: 55
fZeroMemory- Four-F: 60
AzmtMemZero - jdoe 44
RtlFillMemory - Microsoft(NT+ only)51
msvcrt memset - Microsoft: 42
_memfill - modified memfill & rep stosd : 23
memfill - masm32 lib: 20
jdoe_memfill - modified memfill by jdoe : 28
; ------------- Sample size = 64 bytes ---------------------
zero_it - hutch: 146
xzero_it - The Dude of Dudes: 68
xzero_it2 - pro3carp3: 79
RtlZeroMemory - Microsoft: 61
ZeroMemD - unknown: 81
fZeroMemory- Four-F: 88
AzmtMemZero - jdoe 63
RtlFillMemory - Microsoft(NT+ only)77
msvcrt memset - Microsoft: 70
_memfill - modified memfill & rep stosd : 31
memfill - masm32 lib: 34
jdoe_memfill - modified memfill by jdoe : 34
; ------------- Sample size = 256 bytes ---------------------
zero_it - hutch: 531
xzero_it - The Dude of Dudes: 208
xzero_it2 - pro3carp3: 221
RtlZeroMemory - Microsoft: 204
ZeroMemD - unknown: 225
fZeroMemory- Four-F: 224
AzmtMemZero - jdoe 130
RtlFillMemory - Microsoft(NT+ only)222
msvcrt memset - Microsoft: 211
_memfill - modified memfill & rep stosd : 100
memfill - masm32 lib: 102
jdoe_memfill - modified memfill by jdoe : 100
; ------------- Sample size = 1024 bytes ---------------------
zero_it - hutch: 2078
xzero_it - The Dude of Dudes: 338
xzero_it2 - pro3carp3: 350
RtlZeroMemory - Microsoft: 333
ZeroMemD - unknown: 354
fZeroMemory- Four-F: 353
AzmtMemZero - jdoe 371
RtlFillMemory - Microsoft(NT+ only)350
msvcrt memset - Microsoft: 340
_memfill - modified memfill & rep stosd : 328
memfill - masm32 lib: 337
jdoe_memfill - modified memfill by jdoe : 319
; ------------- Sample size = 2048 bytes ---------------------
zero_it - hutch: 4143
xzero_it - The Dude of Dudes: 510
xzero_it2 - pro3carp3: 523
RtlZeroMemory - Microsoft: 506
ZeroMemD - unknown: 527
fZeroMemory- Four-F: 525
AzmtMemZero - jdoe 693
RtlFillMemory - Microsoft(NT+ only)522
msvcrt memset - Microsoft: 514
_memfill - modified memfill & rep stosd : 501
memfill - masm32 lib: 646
jdoe_memfill - modified memfill by jdoe : 613
Thanks Michael
The results are more impressive on my processor. Looks like I'm losing my time playing with optimization. I can impress myself on my AMD but everytime the same code is executed on Intel I am disappointed. On the other hand, the code that was optimized on Intel perform well on AMD. Intel optimization seems more predictable when executed on other cpu.
No more AMD for me :bdg
JD,
This is on my 2.8 gig PIV. Its reasonably typical of late PIVs.
; ------------- Sample size = 16 bytes ---------------------
zero_it - hutch: 49
xzero_it - The Dude of Dudes: 89
xzero_it2 - pro3carp3: 118
RtlZeroMemory - Microsoft: 103
ZeroMemD - unknown: 129
fZeroMemory- Four-F: 112
AzmtMemZero - jdoe 12
RtlFillMemory - Microsoft(NT+ only)110
msvcrt memset - Microsoft: 53
_memfill - modified memfill & rep stosd : 16
memfill - masm32 lib: 14
jdoe_memfill - modified memfill by jdoe : 14
; ------------- Sample size = 64 bytes ---------------------
zero_it - hutch: 149
xzero_it - The Dude of Dudes: 110
xzero_it2 - pro3carp3: 138
RtlZeroMemory - Microsoft: 126
ZeroMemD - unknown: 155
fZeroMemory- Four-F: 132
AzmtMemZero - jdoe 35
RtlFillMemory - Microsoft(NT+ only)132
msvcrt memset - Microsoft: 74
_memfill - modified memfill & rep stosd : 27
memfill - masm32 lib: 33
jdoe_memfill - modified memfill by jdoe : 31
; ------------- Sample size = 256 bytes ---------------------
zero_it - hutch: 540
xzero_it - The Dude of Dudes: 237
xzero_it2 - pro3carp3: 272
RtlZeroMemory - Microsoft: 243
ZeroMemD - unknown: 286
fZeroMemory- Four-F: 269
AzmtMemZero - jdoe 152
RtlFillMemory - Microsoft(NT+ only)252
msvcrt memset - Microsoft: 201
_memfill - modified memfill & rep stosd : 147
memfill - masm32 lib: 147
jdoe_memfill - modified memfill by jdoe : 147
; ------------- Sample size = 1024 bytes ---------------------
zero_it - hutch: 2072
xzero_it - The Dude of Dudes: 423
xzero_it2 - pro3carp3: 455
RtlZeroMemory - Microsoft: 433
ZeroMemD - unknown: 469
fZeroMemory- Four-F: 448
AzmtMemZero - jdoe 563
RtlFillMemory - Microsoft(NT+ only)440
msvcrt memset - Microsoft: 382
_memfill - modified memfill & rep stosd : 374
memfill - masm32 lib: 568
jdoe_memfill - modified memfill by jdoe : 559
; ------------- Sample size = 2048 bytes ---------------------
zero_it - hutch: 4124
xzero_it - The Dude of Dudes: 646
xzero_it2 - pro3carp3: 687
RtlZeroMemory - Microsoft: 655
ZeroMemD - unknown: 705
fZeroMemory- Four-F: 686
AzmtMemZero - jdoe 1072
RtlFillMemory - Microsoft(NT+ only)671
msvcrt memset - Microsoft: 614
_memfill - modified memfill & rep stosd : 600
memfill - masm32 lib: 1107
jdoe_memfill - modified memfill by jdoe : 1088
Pentium D 940 (dual-core 3.2 GHz)
; ------------- Sample size = 16 bytes ---------------------
zero_it - hutch: 57
xzero_it - The Dude of Dudes: 87
xzero_it2 - pro3carp3: 92
RtlZeroMemory - Microsoft: 80
ZeroMemD - unknown: 94
fZeroMemory- Four-F: 87
AzmtMemZero - jdoe 22
RtlFillMemory - Microsoft(NT+ only)83
msvcrt memset - Microsoft: 81
_memfill - modified memfill & rep stosd : 24
memfill - masm32 lib: 27
jdoe_memfill - modified memfill by jdoe : 25
; ------------- Sample size = 64 bytes ---------------------
zero_it - hutch: 171
xzero_it - The Dude of Dudes: 100
xzero_it2 - pro3carp3: 114
RtlZeroMemory - Microsoft: 106
ZeroMemD - unknown: 118
fZeroMemory- Four-F: 111
AzmtMemZero - jdoe 39
RtlFillMemory - Microsoft(NT+ only)113
msvcrt memset - Microsoft: 105
_memfill - modified memfill & rep stosd : 37
memfill - masm32 lib: 39
jdoe_memfill - modified memfill by jdoe : 36
; ------------- Sample size = 256 bytes ---------------------
zero_it - hutch: 549
xzero_it - The Dude of Dudes: 286
xzero_it2 - pro3carp3: 296
RtlZeroMemory - Microsoft: 290
ZeroMemD - unknown: 307
fZeroMemory- Four-F: 300
AzmtMemZero - jdoe 138
RtlFillMemory - Microsoft(NT+ only)296
msvcrt memset - Microsoft: 73
_memfill - modified memfill & rep stosd : 138
memfill - masm32 lib: 152
jdoe_memfill - modified memfill by jdoe : 134
; ------------- Sample size = 1024 bytes ---------------------
zero_it - hutch: 2013
xzero_it - The Dude of Dudes: 470
xzero_it2 - pro3carp3: 490
RtlZeroMemory - Microsoft: 453
ZeroMemD - unknown: 482
fZeroMemory- Four-F: 462
AzmtMemZero - jdoe 570
RtlFillMemory - Microsoft(NT+ only)473
msvcrt memset - Microsoft: 367
_memfill - modified memfill & rep stosd : 436
memfill - masm32 lib: 594
jdoe_memfill - modified memfill by jdoe : 572
; ------------- Sample size = 2048 bytes ---------------------
zero_it - hutch: 4029
xzero_it - The Dude of Dudes: 757
xzero_it2 - pro3carp3: 774
RtlZeroMemory - Microsoft: 746
ZeroMemD - unknown: 766
fZeroMemory- Four-F: 751
AzmtMemZero - jdoe 1101
RtlFillMemory - Microsoft(NT+ only)811
msvcrt memset - Microsoft: 676
_memfill - modified memfill & rep stosd : 763
memfill - masm32 lib: 1128
jdoe_memfill - modified memfill by jdoe : 1051
Latest zeromem on Intel P4 1.8 GHz, 640mb RAM
; ------------- Sample size = 16 bytes ---------------------
zero_it - hutch: 45
xzero_it - The Dude of Dudes: 100
xzero_it2 - pro3carp3: 161
RtlZeroMemory - Microsoft: 115
ZeroMemD - unknown: 167
fZeroMemory- Four-F: 155
memfill - masm32 lib: 13
AzmtMemZero - jdoe 12
RtlFillMemory - Microsoft(NT+ only)116
msvcrt memset - Microsoft: 59
_memfill - modified memfill & rep stosd : 15
; ------------- Sample size = 64 bytes ---------------------
zero_it - hutch: 138
xzero_it - The Dude of Dudes: 113
xzero_it2 - pro3carp3: 167
RtlZeroMemory - Microsoft: 131
ZeroMemD - unknown: 179
fZeroMemory- Four-F: 161
memfill - masm32 lib: 31
AzmtMemZero - jdoe 36
RtlFillMemory - Microsoft(NT+ only)125
msvcrt memset - Microsoft: 75
_memfill - modified memfill & rep stosd : 26
; ------------- Sample size = 256 bytes ---------------------
zero_it - hutch: 523
xzero_it - The Dude of Dudes: 214
xzero_it2 - pro3carp3: 269
RtlZeroMemory - Microsoft: 223
ZeroMemD - unknown: 282
fZeroMemory- Four-F: 264
memfill - masm32 lib: 150
AzmtMemZero - jdoe 150
RtlFillMemory - Microsoft(NT+ only)230
msvcrt memset - Microsoft: 176
_memfill - modified memfill & rep stosd : 146
; ------------- Sample size = 1024 bytes ---------------------
zero_it - hutch: 2005
xzero_it - The Dude of Dudes: 383
xzero_it2 - pro3carp3: 438
RtlZeroMemory - Microsoft: 389
ZeroMemD - unknown: 443
fZeroMemory- Four-F: 436
memfill - masm32 lib: 554
AzmtMemZero - jdoe 548
RtlFillMemory - Microsoft(NT+ only)411
msvcrt memset - Microsoft: 349
_memfill - modified memfill & rep stosd : 337
Press any key to continue ...
I added in my own SSE code and added in NigthWare's code from another thread. I have a Core 2 Duo processor. I would be willing to guess that most people on the forums don't have one. Can someone with a P4 class processor run it, so I can get some idea how my code works on it? I don't have a booting P4 processor.
There are still stuff I can do to speed it up. But I just wanted to post what I have so far.
; ------------- Sample size = 16 bytes ---------------------
zero_it - hutch: 50
xzero_it - The Dude of Dudes: 42
xzero_it2 - pro3carp3: 44
RtlZeroMemory - Microsoft: 31
ZeroMemD - unknown: 46
fZeroMemory- Four-F: 52
AzmtMemZero - jdoe 56
RtlFillMemory - Microsoft(NT+ only)46
msvcrt memset - Microsoft: 46
_memfill - modified memfill & rep stosd : 12
memfill - masm32 lib: 12
jdoe_memfill - modified memfill by jdoe : 23
Sse_ZeroMem_UnAligned - NightWare: 10
Mark_zeromem_SSE - Mark Larson: 2
; ------------- Sample size = 64 bytes ---------------------
zero_it - hutch: 136
xzero_it - The Dude of Dudes: 47
xzero_it2 - pro3carp3: 49
RtlZeroMemory - Microsoft: 36
ZeroMemD - unknown: 53
fZeroMemory- Four-F: 57
AzmtMemZero - jdoe 66
RtlFillMemory - Microsoft(NT+ only)52
msvcrt memset - Microsoft: 51
_memfill - modified memfill & rep stosd : 26
memfill - masm32 lib: 31
jdoe_memfill - modified memfill by jdoe : 30
Sse_ZeroMem_UnAligned - NightWare: 10
Mark_zeromem_SSE - Mark Larson: 2
; ------------- Sample size = 256 bytes ---------------------
zero_it - hutch: 299
xzero_it - The Dude of Dudes: 96
xzero_it2 - pro3carp3: 98
RtlZeroMemory - Microsoft: 85
ZeroMemD - unknown: 100
fZeroMemory- Four-F: 106
AzmtMemZero - jdoe 121
RtlFillMemory - Microsoft(NT+ only)100
msvcrt memset - Microsoft: 99
_memfill - modified memfill & rep stosd : 77
memfill - masm32 lib: 84
jdoe_memfill - modified memfill by jdoe : 89
Sse_ZeroMem_UnAligned - NightWare: 22
Mark_zeromem_SSE - Mark Larson: 17
; ------------- Sample size = 1024 bytes ---------------------
zero_it - hutch: 1094
xzero_it - The Dude of Dudes: 290
xzero_it2 - pro3carp3: 296
RtlZeroMemory - Microsoft: 283
ZeroMemD - unknown: 298
fZeroMemory- Four-F: 298
AzmtMemZero - jdoe 337
RtlFillMemory - Microsoft(NT+ only)301
msvcrt memset - Microsoft: 301
_memfill - modified memfill & rep stosd : 282
memfill - masm32 lib: 283
jdoe_memfill - modified memfill by jdoe : 297
Sse_ZeroMem_UnAligned - NightWare: 75
Mark_zeromem_SSE - Mark Larson: 67
; ------------- Sample size = 2048 bytes ---------------------
zero_it - hutch: 2180
xzero_it - The Dude of Dudes: 559
xzero_it2 - pro3carp3: 559
RtlZeroMemory - Microsoft: 548
ZeroMemD - unknown: 561
fZeroMemory- Four-F: 567
AzmtMemZero - jdoe 611
RtlFillMemory - Microsoft(NT+ only)548
msvcrt memset - Microsoft: 560
_memfill - modified memfill & rep stosd : 545
memfill - masm32 lib: 566
jdoe_memfill - modified memfill by jdoe : 559
Sse_ZeroMem_UnAligned - NightWare: 143
Mark_zeromem_SSE - Mark Larson: 138
Press any key to continue ...
[attachment deleted by admin]
P4 2.8 1gb Ram........
; ------------- Sample size = 16 bytes ---------------------
zero_it - hutch: 38
xzero_it - The Dude of Dudes: 76
xzero_it2 - pro3carp3: 85
RtlZeroMemory - Microsoft: 77
ZeroMemD - unknown: 92
fZeroMemory- Four-F: 84
AzmtMemZero - jdoe 17
RtlFillMemory - Microsoft(NT+ only)80
msvcrt memset - Microsoft: 71
_memfill - modified memfill & rep stosd : 25
memfill - masm32 lib: 25
jdoe_memfill - modified memfill by jdoe : 20
Sse_ZeroMem_UnAligned - NightWare: 16
Mark_zeromem_SSE - Mark Larson: 9
; ------------- Sample size = 64 bytes ---------------------
zero_it - hutch: 159
xzero_it - The Dude of Dudes: 98
xzero_it2 - pro3carp3: 112
RtlZeroMemory - Microsoft: 103
ZeroMemD - unknown: 118
fZeroMemory- Four-F: 110
AzmtMemZero - jdoe 38
RtlFillMemory - Microsoft(NT+ only)106
msvcrt memset - Microsoft: 98
_memfill - modified memfill & rep stosd : 36
memfill - masm32 lib: 38
jdoe_memfill - modified memfill by jdoe : 36
Sse_ZeroMem_UnAligned - NightWare: 19
Mark_zeromem_SSE - Mark Larson: 9
; ------------- Sample size = 256 bytes ---------------------
zero_it - hutch: 506
xzero_it - The Dude of Dudes: 289
xzero_it2 - pro3carp3: 299
RtlZeroMemory - Microsoft: 293
ZeroMemD - unknown: 307
fZeroMemory- Four-F: 301
AzmtMemZero - jdoe 133
RtlFillMemory - Microsoft(NT+ only)297
msvcrt memset - Microsoft: 281
_memfill - modified memfill & rep stosd : 138
memfill - masm32 lib: 141
jdoe_memfill - modified memfill by jdoe : 127
Sse_ZeroMem_UnAligned - NightWare: 42
Mark_zeromem_SSE - Mark Larson: 28
; ------------- Sample size = 1024 bytes ---------------------
zero_it - hutch: 1973
xzero_it - The Dude of Dudes: 431
xzero_it2 - pro3carp3: 442
RtlZeroMemory - Microsoft: 437
ZeroMemD - unknown: 452
fZeroMemory- Four-F: 446
AzmtMemZero - jdoe 547
RtlFillMemory - Microsoft(NT+ only)439
msvcrt memset - Microsoft: 436
_memfill - modified memfill & rep stosd : 412
memfill - masm32 lib: 564
jdoe_memfill - modified memfill by jdoe : 556
Sse_ZeroMem_UnAligned - NightWare: 325
Mark_zeromem_SSE - Mark Larson: 315
; ------------- Sample size = 2048 bytes ---------------------
zero_it - hutch: 3876
xzero_it - The Dude of Dudes: 721
xzero_it2 - pro3carp3: 729
RtlZeroMemory - Microsoft: 726
ZeroMemD - unknown: 740
fZeroMemory- Four-F: 733
AzmtMemZero - jdoe 1032
RtlFillMemory - Microsoft(NT+ only)730
msvcrt memset - Microsoft: 726
_memfill - modified memfill & rep stosd : 701
memfill - masm32 lib: 1070
jdoe_memfill - modified memfill by jdoe : 1039
Sse_ZeroMem_UnAligned - NightWare: 616
Mark_zeromem_SSE - Mark Larson: 605
Press any key to continue ...
And here's the results on my totally irrelevant AMD
------------- Sample size in bytes = 16 64 256 1024 2048
zero_it - hutch: 64 208 790 3125 6220
xzero_it - The Dude of Dudes: 31 43 91 286 544
xzero_it2 - pro3carp3: 43 55 103 298 556
RtlZeroMemory - Microsoft: 31 42 91 285 543
ZeroMemD - unknown: 48 60 109 302 562
************ error in routine **********
fZeroMemory- Four-F: 43 55 104 297 556
AzmtMemZero - jdoe 20 29 65 222 416
RtlFillMemory - Microsoft(NT+ only) 34 46 94 288 547
msvcrt memset - Microsoft: 38 50 98 292 551
_memfill - modified memfill & rep stosd : 22 21 58 285 544
************ error in routine **********
memfill - masm32 lib: 23 22 58 216 409
jdoe_memfill - modified memfill by jdoe : 20 17 45 169 313
Sse_ZeroMem_UnAligned - NightWare: 17 19 37 156 284
Mark_zeromem_SSE - Mark Larson: 4 8 74 171 300
************ error in routine **********
Press any key to continue ...
On my P4 - 3GHz
; ------------- Sample size = 16 bytes ---------------------
zero_it - hutch: 89
xzero_it - The Dude of Dudes: 119
xzero_it2 - pro3carp3: 87
RtlZeroMemory - Microsoft: 85
ZeroMemD - unknown: 104
fZeroMemory- Four-F: 94
AzmtMemZero - jdoe 20
RtlFillMemory - Microsoft(NT+ only)88
msvcrt memset - Microsoft: 80
_memfill - modified memfill & rep stosd : 24
memfill - masm32 lib: 31
jdoe_memfill - modified memfill by jdoe : 21
Sse_ZeroMem_UnAligned - NightWare: 19
Mark_zeromem_SSE - Mark Larson: 8
; ------------- Sample size = 64 bytes ---------------------
zero_it - hutch: 169
xzero_it - The Dude of Dudes: 100
xzero_it2 - pro3carp3: 115
RtlZeroMemory - Microsoft: 104
ZeroMemD - unknown: 121
fZeroMemory- Four-F: 113
AzmtMemZero - jdoe 39
RtlFillMemory - Microsoft(NT+ only)107
msvcrt memset - Microsoft: 100
_memfill - modified memfill & rep stosd : 38
memfill - masm32 lib: 40
jdoe_memfill - modified memfill by jdoe : 37
Sse_ZeroMem_UnAligned - NightWare: 20
Mark_zeromem_SSE - Mark Larson: 8
; ------------- Sample size = 256 bytes ---------------------
zero_it - hutch: 532
xzero_it - The Dude of Dudes: 289
xzero_it2 - pro3carp3: 298
RtlZeroMemory - Microsoft: 289
ZeroMemD - unknown: 306
fZeroMemory- Four-F: 296
AzmtMemZero - jdoe 143
RtlFillMemory - Microsoft(NT+ only)292
msvcrt memset - Microsoft: 277
_memfill - modified memfill & rep stosd : 145
memfill - masm32 lib: 155
jdoe_memfill - modified memfill by jdoe : 133
Sse_ZeroMem_UnAligned - NightWare: 47
Mark_zeromem_SSE - Mark Larson: 29
; ------------- Sample size = 1024 bytes ---------------------
zero_it - hutch: 2005
xzero_it - The Dude of Dudes: 432
xzero_it2 - pro3carp3: 446
RtlZeroMemory - Microsoft: 442
ZeroMemD - unknown: 452
fZeroMemory- Four-F: 453
AzmtMemZero - jdoe 560
RtlFillMemory - Microsoft(NT+ only)447
msvcrt memset - Microsoft: 428
_memfill - modified memfill & rep stosd : 409
memfill - masm32 lib: 585
jdoe_memfill - modified memfill by jdoe : 589
Sse_ZeroMem_UnAligned - NightWare: 327
Mark_zeromem_SSE - Mark Larson: 313
; ------------- Sample size = 2048 bytes ---------------------
zero_it - hutch: 4392
xzero_it - The Dude of Dudes: 726
xzero_it2 - pro3carp3: 741
RtlZeroMemory - Microsoft: 734
ZeroMemD - unknown: 752
fZeroMemory- Four-F: 753
AzmtMemZero - jdoe 1050
RtlFillMemory - Microsoft(NT+ only)741
msvcrt memset - Microsoft: 716
_memfill - modified memfill & rep stosd : 701
memfill - masm32 lib: 1096
jdoe_memfill - modified memfill by jdoe : 1105
Sse_ZeroMem_UnAligned - NightWare: 623
Mark_zeromem_SSE - Mark Larson: 608
Press any key to continue ...
fixed typo, general cleanup of test code, added meaningless 4096 test
------------- Sample size in bytes = 16 64 256 1024 2048 4096
xzero_it - The Dude of Dudes: 31 43 92 286 544 1061
xzero_it2 - pro3carp3: 43 55 103 297 556 1073
RtlZeroMemory - Microsoft: 31 43 91 285 543 1060
ZeroMemD - unknown: 48 60 109 302 561 1078
************ error in routine **********
fZeroMemory- Four-F: 43 55 104 297 556 1073
AzmtMemZero - jdoe 20 29 65 222 416 803
RtlFillMemory - Microsoft(NT+ only) 34 46 94 288 547 1063
msvcrt memset - Microsoft: 38 50 98 293 551 1068
_memfill - modified memfill & rep stosd : 22 21 57 284 543 1060
************ error in routine **********
memfill - masm32 lib: 23 22 60 216 410 797
jdoe_memfill - modified memfill by jdoe : 20 17 45 168 314 604
Sse_ZeroMem_UnAligned - NightWare: 16 19 37 155 284 543
Mark_zeromem_SSE - Mark Larson: 4 8 73 170 299 558
************ error in routine **********
Press any key to continue ...
[attachment deleted by admin]
I was talking about TLB priming in another thread. http://www.masm32.com/board/index.php?topic=8526.msg63671#msg63671
TLB priming means pre-reading a page table in advance. To make it work you break up the data into 4096 byte chunks. I am applying this to the SSE version of zero memory routine I wrote. So I have two loops now instead of 1. I have an inner loop that handles 4096 bytes of MOVAPS, and an outer loop that goes through the number of bytes divided by 4096. I use the prefetchnta instruciton to pre-read the data one page table in advance. Here is the line of code that does it.
prefetchnta [edi+4096]
I modified the new code that Jimg posted and added support for 8192, 16384, and 32768 bytes. The TLB priming only works if you have mulitple of page sizes in data. So that is why I picked 8192 as the starting point ( 2 pages).
As you can see, the larger the data size the bigger the speed improvement. Obviously it'll hit a point where it'll flatten out.
------------- Sample size in bytes = 8192 16384 32768
Sse_ZeroMem_UnAligned - NightWare: 545 1064 2507
Mark_zeromem_SSE - Mark Larson: 542 1058 2450
Mark_zeromem_SSE_TLB - Mark Larson: 527 1061 2345
Here is the actual code.
align 16
;only call with > 4096 memory to clear, memory size needs to be divisible by 4096, we can add special code later to
; support any size.
Mark_zeromem_SSE_TLB proc
;use edi for ptr
;eax for size
;int 3
pxor xmm0,xmm0
shr eax,12 ;divide by 4096, one page size.
align 16
outer:
prefetchnta [edi+4096]
mov edx,4096/16 ;we handle 4096 bytes per inner loop, each MOVAPS handle 16 of those bytes.
align 16
inner:
movaps [edi],xmm0
movaps [edi+16],xmm0
movaps [edi+32],xmm0
movaps [edi+48],xmm0
add edi,16*4
sub edx,1*4
jnz inner
sub eax,1
jnz outer
ret
Mark_zeromem_SSE_TLB endp
Intel Core 2 Quad Q9550:
------------- Sample size in bytes = 16 64 256 1024 2048 4096
xzero_it - The Dude of Dudes: 41 45 94 191 320 575
xzero_it2 - pro3carp3: 41 46 94 191 322 576
RtlZeroMemory - Microsoft: 29 34 81 181 310 566
ZeroMemD - unknown: 43 48 96 195 323 581
************ error in routine **********
fZeroMemory- Four-F: 38 43 91 189 317 574
AzmtMemZero - jdoe 10 22 73 277 550 1116
RtlFillMemory - Microsoft(NT+ only) 44 49 97 194 322 580
msvcrt memset - Microsoft: 35 41 88 186 314 570
_memfill - modified memfill & rep stosd : 15 25 75 178 307 563
************ error in routine **********
memfill - masm32 lib: 12 28 76 278 550 1097
jdoe_memfill - modified memfill by jdoe : 13 25 77 278 546 1098
Sse_ZeroMem_UnAligned - NightWare: 10 10 22 74 141 275
Mark_zeromem_SSE - Mark Larson: 2 3 20 69 134 258
************ error in routine **********
zero_it - hutch: 4364
xzero_it - The Dude of Dudes: 664
xzero_it2 - pro3carp3: 691
RtlZeroMemory - Microsoft: 668
ZeroMemD - unknown: 698
fZeroMemory- Four-F: 708
I think this subject was bashed to death some time ago. REP STOSD beats most once the byte count exceeds about 500 bytes. If you don't mind writing SSE code you can do it faster but its a factor of if it matters, if you only have to fill a meg or so its a case of who cares where if you have to repeatedly fill a gig, you will strain the technique to do it faster. A small buffer is easily handles by a crude byte scanner, large blocks are handled by multithread SSE2 techniques, pick the task, pick the best method to perform it.
Quote from: Mark_Larson on February 21, 2008, 11:07:57 PM
align 16
;only call with > 4096 memory to clear, memory size needs to be divisible by 4096, we can add special code later to
; support any size.
Mark_zeromem_SSE_TLB proc
;use edi for ptr
;eax for size
;int 3
pxor xmm0,xmm0
shr eax,12 ;divide by 4096, one page size.
align 16
outer:
prefetchnta [edi+4096]
mov edx,4096/16 ;we handle 4096 bytes per inner loop, each MOVAPS handle 16 of those bytes.
align 16
inner:
movaps [edi],xmm0
movaps [edi+16],xmm0
movaps [edi+32],xmm0
movaps [edi+48],xmm0
add edi,16*4
sub edx,1*4
jnz inner
sub eax,1
jnz outer
ret
Mark_zeromem_SSE_TLB endp
This can be better written as this:
pxor xmm0,xmm0
shr eax,12 ;divide by 4096, one page size.
mov ecx, 4
push ebx
mov ebx, 1
movd mm0, esp
mov esp, 16*4
align 16
outer:
prefetchnta [edi+4096]
mov edx,4096/16 ;we handle 4096 bytes per inner loop, each MOVAPS handle 16 of those bytes.
align 16
inner:
movaps [edi],xmm0
movaps [edi+16],xmm0
movaps [edi+32],xmm0
movaps [edi+48],xmm0
add edi, esp
sub edx, ecx
jnz inner
sub eax, ebx
jnz outer
movd esp, mm0
pop ebx
ret
Mark_zeromem_SSE_TLB endp
Quote from: zemtex on September 25, 2010, 03:50:48 AM
Quote from: Mark_Larson on February 21, 2008, 11:07:57 PM
align 16
;only call with > 4096 memory to clear, memory size needs to be divisible by 4096, we can add special code later to
; support any size.
Mark_zeromem_SSE_TLB proc
;use edi for ptr
;eax for size
;int 3
pxor xmm0,xmm0
shr eax,12 ;divide by 4096, one page size.
align 16
outer:
prefetchnta [edi+4096]
mov edx,4096/16 ;we handle 4096 bytes per inner loop, each MOVAPS handle 16 of those bytes.
align 16
inner:
movaps [edi],xmm0
movaps [edi+16],xmm0
movaps [edi+32],xmm0
movaps [edi+48],xmm0
add edi,16*4
sub edx,1*4
jnz inner
sub eax,1
jnz outer
ret
Mark_zeromem_SSE_TLB endp
This can be better written as this:
pxor xmm0,xmm0
shr eax,12 ;divide by 4096, one page size.
mov ecx, 4
push ebx
mov ebx, 1
movd mm0, esp
mov esp, 16*4
align 16
outer:
prefetchnta [edi+4096]
mov edx,4096/16 ;we handle 4096 bytes per inner loop, each MOVAPS handle 16 of those bytes.
align 16
inner:
movaps [edi],xmm0
movaps [edi+16],xmm0
movaps [edi+32],xmm0
movaps [edi+48],xmm0
add edi, esp
sub edx, ecx
jnz inner
sub eax, ebx
jnz outer
movd esp, mm0
pop ebx
ret
Mark_zeromem_SSE_TLB endp
Did you get any improvement in the performance?
"Better written" implies what in this case?
Frank
Quote from: frktons on September 25, 2010, 08:49:32 AM
Did you get any improvement in the performance?
"Better written" implies what in this case?
Frank
I havent run the test on it. You save 3 bytes per iteration in the inner loop. It shrinks from 21 to 18 bytes.