News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

rep & ecx ?

Started by Thomas_1110, August 26, 2009, 04:12:43 AM

Previous topic - Next topic

Thomas_1110

Hello
The masm32 reference says, that the count value for rep is stored in cx. So, cx is 16 bit, i tried it with ecx and it works fine over the 16 bit range (Win Vista).

push ds
pop es
mov esi, MemCopie_
mov edi, edx
mov ecx, filesize
rep movsb

My test was copy a file in Memory, copy the memory in another and then save the memorycopy to another file. Filesize > 500 Mb.
Has anyone other experience with this?

dedndave

in 32-bit world, it is ecx - must be an old masm manual
also, you don't have to mess with ds and es - they are all the same in a flat model program
however, "mov esi, MemCopie_" is that a string ? or a pointer
it wants to be the offset of a string

but !!!
all that isn't neccessary if you are just trying to copy a file
no need to rep movsb anything
read it into a buffer
write it out from the same buffer

hutch--

Dave is right here, in FLAT memory model you don't touch the segment registers at all. Have a look at the memcopy procedure in the MASM32 library, it uses REP MOVSD for most of the file and uses REP MOVSB for the balance, its a lot fatser than byte copy. You can get faster versions again with MMX and with a late enough processor XMM instructions, depends what you need to do.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Thomas_1110

Quotein 32-bit world, it is ecx - must be an old masm manual
Yes, i have 2 books. The one from 2001, the other from 2003.
Quoteall that isn't neccessary if you are just trying to copy a file
I know, i can do it easier with read_disk_file and write_disk_file.
It was just a test. Its 2 weeks ago that i began masm32 programming. I have so much to learn.
Thanks for answer.

japheth

Quote from: hutch-- on August 26, 2009, 04:23:10 AM
... it uses REP MOVSD for most of the file and uses REP MOVSB for the balance, its a lot fatser than byte copy.

IIRC the P4 and later cpus have a "string byte move optimization" feature implemented, which eliminates the speed difference between MOVSD and MOVSB. The feature can be enabled / disabled by writing a certain MSR register, usually it's enabled.

jj2007

Quote from: japheth on August 26, 2009, 05:31:56 AM
IIRC the P4 and later cpus have a "string byte move optimization" feature implemented, which eliminates the speed difference between MOVSD and MOVSB. The feature can be enabled / disabled by writing a certain MSR register, usually it's enabled.

Good to know, although it seems to kick in only at higher byte counts - results for a Prescott P4:

176240  cycles for rep movsd, ct=400000
167652  cycles for rep movsb, ct=400000

14311   cycles for rep movsd, ct=40000
14673   cycles for rep movsb, ct=40000

1267    cycles for rep movsd, ct=4000
1481    cycles for rep movsb, ct=4000

307     cycles for rep movsd, ct=400
487     cycles for rep movsb, ct=400

59      cycles for rep movsd, ct=40
275     cycles for rep movsb, ct=40

dedndave

hiyas Jochen
for rep movsb, the count should be 4x that used for rep movsd
we want to compare moving the same amount of data

EDIT - i am trying to make sense of the numbers - lol
nothing works in my head - more coffee

hutch--

Here is a simple test piece that verifies Japheth's information.

I get these timings on my PIV.


2859 ms REP MOVSD
2797 ms REP MOVSB
2797 ms REP MOVSD
2812 ms REP MOVSB
2797 ms REP MOVSD
2797 ms REP MOVSB
2797 ms REP MOVSD
2781 ms REP MOVSB
Press any key to continue ...


Running this test piece.


; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
    include \masm32\include\masm32rt.inc
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

comment * -----------------------------------------------------
                        Build this  template with
                       "CONSOLE ASSEMBLE AND LINK"
        ----------------------------------------------------- *

    MemCopyD PROTO :DWORD,:DWORD,:DWORD
    MemCopyB PROTO :DWORD,:DWORD,:DWORD

    .code

start:
   
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    call main
    inkey
    exit

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

main proc

    LOCAL hMem1 :DWORD
    LOCAL hMem2 :DWORD

    meg64 equ <1024*1024*64>

    mov hMem1, alloc(meg64)
    mov hMem2, alloc(meg64)

    push ebx

  REPEAT 4

    invoke GetTickCount
    push eax

    mov ebx, 50

  @@:
    invoke MemCopyD,hMem1,hMem2,meg64
    sub ebx, 1
    jnz @B


    invoke GetTickCount
    pop ecx
    sub eax, ecx
    print str$(eax)," ms REP MOVSD",13,10


    invoke GetTickCount
    push eax

    mov ebx, 50

  @@:
    invoke MemCopyB,hMem1,hMem2,meg64
    sub ebx, 1
    jnz @B


    invoke GetTickCount
    pop ecx
    sub eax, ecx
    print str$(eax)," ms REP MOVSB",13,10

  ENDM

    pop ebx

    free hMem2
    free hMem1

    ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

align 4

MemCopyD proc public uses esi edi Source:DWORD,Dest:DWORD,ln:DWORD

    cld
    mov esi, [Source]
    mov edi, [Dest]
    mov ecx, [ln]

    shr ecx, 2
    rep movsd

    mov ecx, [ln]
    and ecx, 3
    rep movsb

    ret

MemCopyD endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

align 4

MemCopyB proc public uses esi edi Source:DWORD,Dest:DWORD,ln:DWORD

    cld
    mov esi, Source
    mov edi, Dest
    mov ecx, ln

    rep movsb

    ret

MemCopyB endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end start


I would still exercise caution at using the BYTE copy as the PIV behaviour in special case circuitry in not universal in available hardware.

Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Quote from: dedndave on August 26, 2009, 12:49:50 PM
for rep movsb, the count should be 4x that used for rep movsd
we want to compare moving the same amount of data

That's what I did, otherwise timings would not be so close for the large counts. ct means bytes copied.