News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

ZeroMemory Speed Test!

Started by ecube, January 23, 2007, 03:32:37 AM

Previous topic - Next topic

MichaelW

P3 (all I have running ATM):

; ------------- Sample size = 16 bytes ---------------------

zero_it - hutch: 49
xzero_it - The Dude of Dudes: 41
xzero_it2 - pro3carp3: 53
RtlZeroMemory - Microsoft: 33
ZeroMemD - unknown: 55
fZeroMemory- Four-F: 60
AzmtMemZero - jdoe 44
RtlFillMemory - Microsoft(NT+ only)51
msvcrt memset - Microsoft: 42
_memfill - modified memfill & rep stosd : 23
memfill - masm32 lib: 20
jdoe_memfill - modified memfill by jdoe : 28

; ------------- Sample size = 64 bytes ---------------------

zero_it - hutch: 146
xzero_it - The Dude of Dudes: 68
xzero_it2 - pro3carp3: 79
RtlZeroMemory - Microsoft: 61
ZeroMemD - unknown: 81
fZeroMemory- Four-F: 88
AzmtMemZero - jdoe 63
RtlFillMemory - Microsoft(NT+ only)77
msvcrt memset - Microsoft: 70
_memfill - modified memfill & rep stosd : 31
memfill - masm32 lib: 34
jdoe_memfill - modified memfill by jdoe : 34

; ------------- Sample size = 256 bytes ---------------------

zero_it - hutch: 531
xzero_it - The Dude of Dudes: 208
xzero_it2 - pro3carp3: 221
RtlZeroMemory - Microsoft: 204
ZeroMemD - unknown: 225
fZeroMemory- Four-F: 224
AzmtMemZero - jdoe 130
RtlFillMemory - Microsoft(NT+ only)222
msvcrt memset - Microsoft: 211
_memfill - modified memfill & rep stosd : 100
memfill - masm32 lib: 102
jdoe_memfill - modified memfill by jdoe : 100

; ------------- Sample size = 1024 bytes ---------------------

zero_it - hutch: 2078
xzero_it - The Dude of Dudes: 338
xzero_it2 - pro3carp3: 350
RtlZeroMemory - Microsoft: 333
ZeroMemD - unknown: 354
fZeroMemory- Four-F: 353
AzmtMemZero - jdoe 371
RtlFillMemory - Microsoft(NT+ only)350
msvcrt memset - Microsoft: 340
_memfill - modified memfill & rep stosd : 328
memfill - masm32 lib: 337
jdoe_memfill - modified memfill by jdoe : 319

; ------------- Sample size = 2048 bytes ---------------------

zero_it - hutch: 4143
xzero_it - The Dude of Dudes: 510
xzero_it2 - pro3carp3: 523
RtlZeroMemory - Microsoft: 506
ZeroMemD - unknown: 527
fZeroMemory- Four-F: 525
AzmtMemZero - jdoe 693
RtlFillMemory - Microsoft(NT+ only)522
msvcrt memset - Microsoft: 514
_memfill - modified memfill & rep stosd : 501
memfill - masm32 lib: 646
jdoe_memfill - modified memfill by jdoe : 613

eschew obfuscation

jdoe


Thanks Michael

The results are more impressive on my processor. Looks like I'm losing my time playing with optimization. I can impress myself on my AMD but everytime the same code is executed on Intel I am disappointed. On the other hand, the code that was optimized on Intel perform well on AMD. Intel optimization seems more predictable when executed on other cpu.

No more AMD for me   :bdg


hutch--

JD,

This is on my 2.8 gig PIV. Its reasonably typical of late PIVs.


; ------------- Sample size = 16 bytes ---------------------

zero_it - hutch: 49
xzero_it - The Dude of Dudes: 89
xzero_it2 - pro3carp3: 118
RtlZeroMemory - Microsoft: 103
ZeroMemD - unknown: 129
fZeroMemory- Four-F: 112
AzmtMemZero - jdoe 12
RtlFillMemory - Microsoft(NT+ only)110
msvcrt memset - Microsoft: 53
_memfill - modified memfill & rep stosd : 16
memfill - masm32 lib: 14
jdoe_memfill - modified memfill by jdoe : 14

; ------------- Sample size = 64 bytes ---------------------

zero_it - hutch: 149
xzero_it - The Dude of Dudes: 110
xzero_it2 - pro3carp3: 138
RtlZeroMemory - Microsoft: 126
ZeroMemD - unknown: 155
fZeroMemory- Four-F: 132
AzmtMemZero - jdoe 35
RtlFillMemory - Microsoft(NT+ only)132
msvcrt memset - Microsoft: 74
_memfill - modified memfill & rep stosd : 27
memfill - masm32 lib: 33
jdoe_memfill - modified memfill by jdoe : 31

; ------------- Sample size = 256 bytes ---------------------

zero_it - hutch: 540
xzero_it - The Dude of Dudes: 237
xzero_it2 - pro3carp3: 272
RtlZeroMemory - Microsoft: 243
ZeroMemD - unknown: 286
fZeroMemory- Four-F: 269
AzmtMemZero - jdoe 152
RtlFillMemory - Microsoft(NT+ only)252
msvcrt memset - Microsoft: 201
_memfill - modified memfill & rep stosd : 147
memfill - masm32 lib: 147
jdoe_memfill - modified memfill by jdoe : 147

; ------------- Sample size = 1024 bytes ---------------------

zero_it - hutch: 2072
xzero_it - The Dude of Dudes: 423
xzero_it2 - pro3carp3: 455
RtlZeroMemory - Microsoft: 433
ZeroMemD - unknown: 469
fZeroMemory- Four-F: 448
AzmtMemZero - jdoe 563
RtlFillMemory - Microsoft(NT+ only)440
msvcrt memset - Microsoft: 382
_memfill - modified memfill & rep stosd : 374
memfill - masm32 lib: 568
jdoe_memfill - modified memfill by jdoe : 559

; ------------- Sample size = 2048 bytes ---------------------

zero_it - hutch: 4124
xzero_it - The Dude of Dudes: 646
xzero_it2 - pro3carp3: 687
RtlZeroMemory - Microsoft: 655
ZeroMemD - unknown: 705
fZeroMemory- Four-F: 686
AzmtMemZero - jdoe 1072
RtlFillMemory - Microsoft(NT+ only)671
msvcrt memset - Microsoft: 614
_memfill - modified memfill & rep stosd : 600
memfill - masm32 lib: 1107
jdoe_memfill - modified memfill by jdoe : 1088
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

GregL

Pentium D 940 (dual-core 3.2 GHz)


; ------------- Sample size = 16 bytes ---------------------

zero_it - hutch: 57
xzero_it - The Dude of Dudes: 87
xzero_it2 - pro3carp3: 92
RtlZeroMemory - Microsoft: 80
ZeroMemD - unknown: 94
fZeroMemory- Four-F: 87
AzmtMemZero - jdoe 22
RtlFillMemory - Microsoft(NT+ only)83
msvcrt memset - Microsoft: 81
_memfill - modified memfill & rep stosd : 24
memfill - masm32 lib: 27
jdoe_memfill - modified memfill by jdoe : 25

; ------------- Sample size = 64 bytes ---------------------

zero_it - hutch: 171
xzero_it - The Dude of Dudes: 100
xzero_it2 - pro3carp3: 114
RtlZeroMemory - Microsoft: 106
ZeroMemD - unknown: 118
fZeroMemory- Four-F: 111
AzmtMemZero - jdoe 39
RtlFillMemory - Microsoft(NT+ only)113
msvcrt memset - Microsoft: 105
_memfill - modified memfill & rep stosd : 37
memfill - masm32 lib: 39
jdoe_memfill - modified memfill by jdoe : 36

; ------------- Sample size = 256 bytes ---------------------

zero_it - hutch: 549
xzero_it - The Dude of Dudes: 286
xzero_it2 - pro3carp3: 296
RtlZeroMemory - Microsoft: 290
ZeroMemD - unknown: 307
fZeroMemory- Four-F: 300
AzmtMemZero - jdoe 138
RtlFillMemory - Microsoft(NT+ only)296
msvcrt memset - Microsoft: 73
_memfill - modified memfill & rep stosd : 138
memfill - masm32 lib: 152
jdoe_memfill - modified memfill by jdoe : 134

; ------------- Sample size = 1024 bytes ---------------------

zero_it - hutch: 2013
xzero_it - The Dude of Dudes: 470
xzero_it2 - pro3carp3: 490
RtlZeroMemory - Microsoft: 453
ZeroMemD - unknown: 482
fZeroMemory- Four-F: 462
AzmtMemZero - jdoe 570
RtlFillMemory - Microsoft(NT+ only)473
msvcrt memset - Microsoft: 367
_memfill - modified memfill & rep stosd : 436
memfill - masm32 lib: 594
jdoe_memfill - modified memfill by jdoe : 572

; ------------- Sample size = 2048 bytes ---------------------

zero_it - hutch: 4029
xzero_it - The Dude of Dudes: 757
xzero_it2 - pro3carp3: 774
RtlZeroMemory - Microsoft: 746
ZeroMemD - unknown: 766
fZeroMemory- Four-F: 751
AzmtMemZero - jdoe 1101
RtlFillMemory - Microsoft(NT+ only)811
msvcrt memset - Microsoft: 676
_memfill - modified memfill & rep stosd : 763
memfill - masm32 lib: 1128
jdoe_memfill - modified memfill by jdoe : 1051


Sameer

Latest zeromem on Intel P4 1.8 GHz, 640mb RAM

; ------------- Sample size = 16 bytes ---------------------
zero_it - hutch: 45
xzero_it - The Dude of Dudes: 100
xzero_it2 - pro3carp3: 161
RtlZeroMemory - Microsoft: 115
ZeroMemD - unknown: 167
fZeroMemory- Four-F: 155
memfill - masm32 lib: 13
AzmtMemZero - jdoe 12
RtlFillMemory - Microsoft(NT+ only)116
msvcrt memset - Microsoft: 59
_memfill - modified memfill & rep stosd : 15
; ------------- Sample size = 64 bytes ---------------------
zero_it - hutch: 138
xzero_it - The Dude of Dudes: 113
xzero_it2 - pro3carp3: 167
RtlZeroMemory - Microsoft: 131
ZeroMemD - unknown: 179
fZeroMemory- Four-F: 161
memfill - masm32 lib: 31
AzmtMemZero - jdoe 36
RtlFillMemory - Microsoft(NT+ only)125
msvcrt memset - Microsoft: 75
_memfill - modified memfill & rep stosd : 26
; ------------- Sample size = 256 bytes ---------------------
zero_it - hutch: 523
xzero_it - The Dude of Dudes: 214
xzero_it2 - pro3carp3: 269
RtlZeroMemory - Microsoft: 223
ZeroMemD - unknown: 282
fZeroMemory- Four-F: 264
memfill - masm32 lib: 150
AzmtMemZero - jdoe 150
RtlFillMemory - Microsoft(NT+ only)230
msvcrt memset - Microsoft: 176
_memfill - modified memfill & rep stosd : 146
; ------------- Sample size = 1024 bytes ---------------------
zero_it - hutch: 2005
xzero_it - The Dude of Dudes: 383
xzero_it2 - pro3carp3: 438
RtlZeroMemory - Microsoft: 389
ZeroMemD - unknown: 443
fZeroMemory- Four-F: 436
memfill - masm32 lib: 554
AzmtMemZero - jdoe 548
RtlFillMemory - Microsoft(NT+ only)411
msvcrt memset - Microsoft: 349
_memfill - modified memfill & rep stosd : 337
Press any key to continue ...

Mark_Larson

I added in my own SSE code and added in NigthWare's code from another thread.  I have a Core 2 Duo processor.  I would be willing to guess that most people on the forums don't have one.  Can someone with a P4 class processor run it, so I can get some idea how my code works on it?  I don't have a booting P4 processor.

There are still stuff I can do to speed it up.  But I just wanted to post what I have so far.


; ------------- Sample size = 16 bytes ---------------------

zero_it - hutch: 50
xzero_it - The Dude of Dudes: 42
xzero_it2 - pro3carp3: 44
RtlZeroMemory - Microsoft: 31
ZeroMemD - unknown: 46
fZeroMemory- Four-F: 52
AzmtMemZero - jdoe 56
RtlFillMemory - Microsoft(NT+ only)46
msvcrt memset - Microsoft: 46
_memfill - modified memfill & rep stosd : 12
memfill - masm32 lib: 12
jdoe_memfill - modified memfill by jdoe : 23
Sse_ZeroMem_UnAligned - NightWare: 10
Mark_zeromem_SSE - Mark Larson: 2

; ------------- Sample size = 64 bytes ---------------------

zero_it - hutch: 136
xzero_it - The Dude of Dudes: 47
xzero_it2 - pro3carp3: 49
RtlZeroMemory - Microsoft: 36
ZeroMemD - unknown: 53
fZeroMemory- Four-F: 57
AzmtMemZero - jdoe 66
RtlFillMemory - Microsoft(NT+ only)52
msvcrt memset - Microsoft: 51
_memfill - modified memfill & rep stosd : 26
memfill - masm32 lib: 31
jdoe_memfill - modified memfill by jdoe : 30
Sse_ZeroMem_UnAligned - NightWare: 10
Mark_zeromem_SSE - Mark Larson: 2

; ------------- Sample size = 256 bytes ---------------------

zero_it - hutch: 299
xzero_it - The Dude of Dudes: 96
xzero_it2 - pro3carp3: 98
RtlZeroMemory - Microsoft: 85
ZeroMemD - unknown: 100
fZeroMemory- Four-F: 106
AzmtMemZero - jdoe 121
RtlFillMemory - Microsoft(NT+ only)100
msvcrt memset - Microsoft: 99
_memfill - modified memfill & rep stosd : 77
memfill - masm32 lib: 84
jdoe_memfill - modified memfill by jdoe : 89
Sse_ZeroMem_UnAligned - NightWare: 22
Mark_zeromem_SSE - Mark Larson: 17

; ------------- Sample size = 1024 bytes ---------------------

zero_it - hutch: 1094
xzero_it - The Dude of Dudes: 290
xzero_it2 - pro3carp3: 296
RtlZeroMemory - Microsoft: 283
ZeroMemD - unknown: 298
fZeroMemory- Four-F: 298
AzmtMemZero - jdoe 337
RtlFillMemory - Microsoft(NT+ only)301
msvcrt memset - Microsoft: 301
_memfill - modified memfill & rep stosd : 282
memfill - masm32 lib: 283
jdoe_memfill - modified memfill by jdoe : 297
Sse_ZeroMem_UnAligned - NightWare: 75
Mark_zeromem_SSE - Mark Larson: 67

; ------------- Sample size = 2048 bytes ---------------------

zero_it - hutch: 2180
xzero_it - The Dude of Dudes: 559
xzero_it2 - pro3carp3: 559
RtlZeroMemory - Microsoft: 548
ZeroMemD - unknown: 561
fZeroMemory- Four-F: 567
AzmtMemZero - jdoe 611
RtlFillMemory - Microsoft(NT+ only)548
msvcrt memset - Microsoft: 560
_memfill - modified memfill & rep stosd : 545
memfill - masm32 lib: 566
jdoe_memfill - modified memfill by jdoe : 559
Sse_ZeroMem_UnAligned - NightWare: 143
Mark_zeromem_SSE - Mark Larson: 138

Press any key to continue ...

[attachment deleted by admin]
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

Draakie

P4 2.8 1gb Ram........

; ------------- Sample size = 16 bytes ---------------------

zero_it - hutch: 38
xzero_it - The Dude of Dudes: 76
xzero_it2 - pro3carp3: 85
RtlZeroMemory - Microsoft: 77
ZeroMemD - unknown: 92
fZeroMemory- Four-F: 84
AzmtMemZero - jdoe 17
RtlFillMemory - Microsoft(NT+ only)80
msvcrt memset - Microsoft: 71
_memfill - modified memfill & rep stosd : 25
memfill - masm32 lib: 25
jdoe_memfill - modified memfill by jdoe : 20
Sse_ZeroMem_UnAligned - NightWare: 16
Mark_zeromem_SSE - Mark Larson: 9

; ------------- Sample size = 64 bytes ---------------------

zero_it - hutch: 159
xzero_it - The Dude of Dudes: 98
xzero_it2 - pro3carp3: 112
RtlZeroMemory - Microsoft: 103
ZeroMemD - unknown: 118
fZeroMemory- Four-F: 110
AzmtMemZero - jdoe 38
RtlFillMemory - Microsoft(NT+ only)106
msvcrt memset - Microsoft: 98
_memfill - modified memfill & rep stosd : 36
memfill - masm32 lib: 38
jdoe_memfill - modified memfill by jdoe : 36
Sse_ZeroMem_UnAligned - NightWare: 19
Mark_zeromem_SSE - Mark Larson: 9

; ------------- Sample size = 256 bytes ---------------------

zero_it - hutch: 506
xzero_it - The Dude of Dudes: 289
xzero_it2 - pro3carp3: 299
RtlZeroMemory - Microsoft: 293
ZeroMemD - unknown: 307
fZeroMemory- Four-F: 301
AzmtMemZero - jdoe 133
RtlFillMemory - Microsoft(NT+ only)297
msvcrt memset - Microsoft: 281
_memfill - modified memfill & rep stosd : 138
memfill - masm32 lib: 141
jdoe_memfill - modified memfill by jdoe : 127
Sse_ZeroMem_UnAligned - NightWare: 42
Mark_zeromem_SSE - Mark Larson: 28

; ------------- Sample size = 1024 bytes ---------------------

zero_it - hutch: 1973
xzero_it - The Dude of Dudes: 431
xzero_it2 - pro3carp3: 442
RtlZeroMemory - Microsoft: 437
ZeroMemD - unknown: 452
fZeroMemory- Four-F: 446
AzmtMemZero - jdoe 547
RtlFillMemory - Microsoft(NT+ only)439
msvcrt memset - Microsoft: 436
_memfill - modified memfill & rep stosd : 412
memfill - masm32 lib: 564
jdoe_memfill - modified memfill by jdoe : 556
Sse_ZeroMem_UnAligned - NightWare: 325
Mark_zeromem_SSE - Mark Larson: 315

; ------------- Sample size = 2048 bytes ---------------------

zero_it - hutch: 3876
xzero_it - The Dude of Dudes: 721
xzero_it2 - pro3carp3: 729
RtlZeroMemory - Microsoft: 726
ZeroMemD - unknown: 740
fZeroMemory- Four-F: 733
AzmtMemZero - jdoe 1032
RtlFillMemory - Microsoft(NT+ only)730
msvcrt memset - Microsoft: 726
_memfill - modified memfill & rep stosd : 701
memfill - masm32 lib: 1070
jdoe_memfill - modified memfill by jdoe : 1039
Sse_ZeroMem_UnAligned - NightWare: 616
Mark_zeromem_SSE - Mark Larson: 605

Press any key to continue ...


Does this code make me look bloated ? (wink)

Jimg

#37
And here's the results on my totally irrelevant AMD

     ------------- Sample size in bytes  =  16   64   256  1024 2048

zero_it - hutch:                           64   208  790  3125 6220

xzero_it - The Dude of Dudes:              31   43   91   286  544

xzero_it2 - pro3carp3:                      43   55   103  298  556

RtlZeroMemory - Microsoft:                  31   42   91   285  543

ZeroMemD - unknown:                         48   60   109  302  562
************  error in routine **********
fZeroMemory- Four-F:                        43   55   104  297  556

AzmtMemZero - jdoe                          20   29   65   222  416

RtlFillMemory - Microsoft(NT+ only)         34   46   94   288  547

msvcrt memset - Microsoft:                  38   50   98   292  551

_memfill - modified memfill & rep stosd :   22   21   58   285  544
************  error in routine **********
memfill - masm32 lib:                       23   22   58   216  409

jdoe_memfill - modified memfill by jdoe :   20   17   45   169  313

Sse_ZeroMem_UnAligned - NightWare:          17   19   37   156  284

Mark_zeromem_SSE - Mark Larson:             4    8    74   171  300
************  error in routine **********

Press any key to continue ...


RuiLoureiro

On my P4 - 3GHz

; ------------- Sample size = 16 bytes ---------------------

zero_it - hutch: 89
xzero_it - The Dude of Dudes: 119
xzero_it2 - pro3carp3: 87
RtlZeroMemory - Microsoft: 85
ZeroMemD - unknown: 104
fZeroMemory- Four-F: 94
AzmtMemZero - jdoe 20
RtlFillMemory - Microsoft(NT+ only)88
msvcrt memset - Microsoft: 80
_memfill - modified memfill & rep stosd : 24
memfill - masm32 lib: 31
jdoe_memfill - modified memfill by jdoe : 21
Sse_ZeroMem_UnAligned - NightWare: 19
Mark_zeromem_SSE - Mark Larson: 8

; ------------- Sample size = 64 bytes ---------------------

zero_it - hutch: 169
xzero_it - The Dude of Dudes: 100
xzero_it2 - pro3carp3: 115
RtlZeroMemory - Microsoft: 104
ZeroMemD - unknown: 121
fZeroMemory- Four-F: 113
AzmtMemZero - jdoe 39
RtlFillMemory - Microsoft(NT+ only)107
msvcrt memset - Microsoft: 100
_memfill - modified memfill & rep stosd : 38
memfill - masm32 lib: 40
jdoe_memfill - modified memfill by jdoe : 37
Sse_ZeroMem_UnAligned - NightWare: 20
Mark_zeromem_SSE - Mark Larson: 8

; ------------- Sample size = 256 bytes ---------------------

zero_it - hutch: 532
xzero_it - The Dude of Dudes: 289
xzero_it2 - pro3carp3: 298
RtlZeroMemory - Microsoft: 289
ZeroMemD - unknown: 306
fZeroMemory- Four-F: 296
AzmtMemZero - jdoe 143
RtlFillMemory - Microsoft(NT+ only)292
msvcrt memset - Microsoft: 277
_memfill - modified memfill & rep stosd : 145
memfill - masm32 lib: 155
jdoe_memfill - modified memfill by jdoe : 133
Sse_ZeroMem_UnAligned - NightWare: 47
Mark_zeromem_SSE - Mark Larson: 29

; ------------- Sample size = 1024 bytes ---------------------

zero_it - hutch: 2005
xzero_it - The Dude of Dudes: 432
xzero_it2 - pro3carp3: 446
RtlZeroMemory - Microsoft: 442
ZeroMemD - unknown: 452
fZeroMemory- Four-F: 453
AzmtMemZero - jdoe 560
RtlFillMemory - Microsoft(NT+ only)447
msvcrt memset - Microsoft: 428
_memfill - modified memfill & rep stosd : 409
memfill - masm32 lib: 585
jdoe_memfill - modified memfill by jdoe : 589
Sse_ZeroMem_UnAligned - NightWare: 327
Mark_zeromem_SSE - Mark Larson: 313

; ------------- Sample size = 2048 bytes ---------------------

zero_it - hutch: 4392
xzero_it - The Dude of Dudes: 726
xzero_it2 - pro3carp3: 741
RtlZeroMemory - Microsoft: 734
ZeroMemD - unknown: 752
fZeroMemory- Four-F: 753
AzmtMemZero - jdoe 1050
RtlFillMemory - Microsoft(NT+ only)741
msvcrt memset - Microsoft: 716
_memfill - modified memfill & rep stosd : 701
memfill - masm32 lib: 1096
jdoe_memfill - modified memfill by jdoe : 1105
Sse_ZeroMem_UnAligned - NightWare: 623
Mark_zeromem_SSE - Mark Larson: 608

Press any key to continue ...

Jimg

#39
fixed typo, general cleanup of test code, added meaningless 4096 test

     ------------- Sample size in bytes  =  16   64   256  1024 2048 4096

xzero_it - The Dude of Dudes:              31   43   92   286  544  1061

xzero_it2 - pro3carp3:                     43   55   103  297  556  1073

RtlZeroMemory - Microsoft:                 31   43   91   285  543  1060

ZeroMemD - unknown:                        48   60   109  302  561  1078
************  error in routine **********
fZeroMemory- Four-F:                       43   55   104  297  556  1073

AzmtMemZero - jdoe                         20   29   65   222  416  803

RtlFillMemory - Microsoft(NT+ only)        34   46   94   288  547  1063

msvcrt memset - Microsoft:                 38   50   98   293  551  1068

_memfill - modified memfill & rep stosd :  22   21   57   284  543  1060
************  error in routine **********
memfill - masm32 lib:                      23   22   60   216  410  797

jdoe_memfill - modified memfill by jdoe :  20   17   45   168  314  604

Sse_ZeroMem_UnAligned - NightWare:         16   19   37   155  284  543

Mark_zeromem_SSE - Mark Larson:            4    8    73   170  299  558
************  error in routine **********

Press any key to continue ...



[attachment deleted by admin]

Mark_Larson

I was talking about TLB priming in another thread.  http://www.masm32.com/board/index.php?topic=8526.msg63671#msg63671

TLB priming means pre-reading a page table in advance.  To make it work you break up the data into 4096 byte chunks.  I am applying this to the SSE version of zero memory routine I wrote.  So I have two loops now instead of 1.  I have an inner loop that handles 4096 bytes of MOVAPS, and an outer loop that goes through the number of bytes divided by 4096.  I use the prefetchnta instruciton to pre-read the data one page table in advance.  Here is the line of code that does it.


         prefetchnta [edi+4096]


I modified the new code that Jimg posted and added support for 8192, 16384, and 32768 bytes.  The TLB priming only works if you have mulitple of page sizes in data.  So that is why I picked 8192 as the starting point ( 2 pages).

As you can see, the larger the data size the bigger the speed improvement.  Obviously it'll hit a point where it'll flatten out.



     ------------- Sample size in bytes  =  8192   16384  32768

Sse_ZeroMem_UnAligned - NightWare:         545    1064   2507

Mark_zeromem_SSE - Mark Larson:            542    1058   2450

Mark_zeromem_SSE_TLB - Mark Larson:        527    1061   2345



Here is the actual code.


align 16
;only call with > 4096 memory to clear, memory size needs to be divisible by 4096, we can add special code later to
; support any size.
Mark_zeromem_SSE_TLB proc
;use edi for ptr
;eax for size
;int 3

pxor xmm0,xmm0
shr eax,12 ;divide by 4096, one page size.

align 16
outer:
prefetchnta [edi+4096]
mov edx,4096/16 ;we handle 4096 bytes per inner loop, each MOVAPS handle 16 of those bytes.

align 16
inner:
movaps [edi],xmm0
movaps [edi+16],xmm0
movaps [edi+32],xmm0
movaps [edi+48],xmm0
add edi,16*4
sub edx,1*4
jnz inner

sub eax,1
jnz outer

ret
Mark_zeromem_SSE_TLB endp

BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

katsyonak

Intel Core 2 Quad Q9550:


    ------------- Sample size in bytes  =  16   64   256  1024 2048 4096

xzero_it - The Dude of Dudes:              41   45   94   191  320  575

xzero_it2 - pro3carp3:                     41   46   94   191  322  576

RtlZeroMemory - Microsoft:                 29   34   81   181  310  566

ZeroMemD - unknown:                        43   48   96   195  323  581
************  error in routine **********
fZeroMemory- Four-F:                       38   43   91   189  317  574

AzmtMemZero - jdoe                         10   22   73   277  550  1116

RtlFillMemory - Microsoft(NT+ only)        44   49   97   194  322  580

msvcrt memset - Microsoft:                 35   41   88   186  314  570

_memfill - modified memfill & rep stosd :  15   25   75   178  307  563
************  error in routine **********
memfill - masm32 lib:                      12   28   76   278  550  1097

jdoe_memfill - modified memfill by jdoe :  13   25   77   278  546  1098

Sse_ZeroMem_UnAligned - NightWare:         10   10   22   74   141  275

Mark_zeromem_SSE - Mark Larson:            2    3    20   69   134  258
************  error in routine **********

2-Bit Chip

zero_it - hutch: 4364
xzero_it - The Dude of Dudes: 664
xzero_it2 - pro3carp3: 691
RtlZeroMemory - Microsoft: 668
ZeroMemD - unknown: 698
fZeroMemory- Four-F: 708

hutch--

I think this subject was bashed to death some time ago. REP STOSD beats most once the byte count exceeds about 500 bytes. If you don't mind writing SSE code you can do it faster but its a factor of if it matters, if you only have to fill a meg or so its a case of who cares where if you have to repeatedly fill a gig, you will strain the technique to do it faster. A small buffer is easily handles by a crude byte scanner, large blocks are handled by multithread SSE2 techniques, pick the task, pick the best method to perform it.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

zemtex

Quote from: Mark_Larson on February 21, 2008, 11:07:57 PM

align 16
;only call with > 4096 memory to clear, memory size needs to be divisible by 4096, we can add special code later to
; support any size.
Mark_zeromem_SSE_TLB proc
;use edi for ptr
;eax for size
;int 3

pxor xmm0,xmm0
shr eax,12 ;divide by 4096, one page size.

align 16
outer:
prefetchnta [edi+4096]
mov edx,4096/16 ;we handle 4096 bytes per inner loop, each MOVAPS handle 16 of those bytes.

align 16
inner:
movaps [edi],xmm0
movaps [edi+16],xmm0
movaps [edi+32],xmm0
movaps [edi+48],xmm0
add edi,16*4
sub edx,1*4
jnz inner

sub eax,1
jnz outer

ret
Mark_zeromem_SSE_TLB endp


This can be better written as this:

pxor xmm0,xmm0
shr eax,12 ;divide by 4096, one page size.

mov ecx, 4
        push ebx
        mov ebx, 1
movd mm0, esp
mov esp, 16*4

align 16
outer:
prefetchnta [edi+4096]
mov edx,4096/16 ;we handle 4096 bytes per inner loop, each MOVAPS handle 16 of those bytes.

align 16
inner:
movaps [edi],xmm0
movaps [edi+16],xmm0
movaps [edi+32],xmm0
movaps [edi+48],xmm0
add edi, esp
sub edx, ecx
jnz inner

sub eax, ebx
jnz outer

        movd esp, mm0
        pop ebx
     
ret
Mark_zeromem_SSE_TLB endp
I have been puzzling with lego bricks all my life. I know how to do this. When Peter, at age 6 is competing with me, I find it extremely neccessary to show him that I can puzzle bricks better than him, because he is so damn talented that all that is called rational has gone haywire.