SzCpy vs. lstrcpy

Mark_Larson · May 16, 2005, 08:30:54 PM

Quote from: AeroASM on May 15, 2005, 05:06:03 PM
93 cycles with MMX
96 cycles with XMM

Why longer with XMM?

Test piece attached: I nicked the MMX algorithm from Mark Larson and optimised it and commented it and converted it to XMM.
Timings are for my Pentium M 1.5GHz

Took me a while to run your code, because I was so busy. I forgot earlier that you said you had a Pentium 4 M. They are more like a P3 in optimiztion than a P4. A lot of the long cycled instructions on a P4 are lower latency on a P4 M. However you still might want to try doing 2 per loop to break dependencies. I'll play with it if I get a chance.

Code Select


117 cycles with MMX
97 cycles with XMM

Your modified MMX routine of my code runs 10 cycles slower on my P4 than my original code. But I am sure it runs faster on your P4 M or you wouldn't have posted it. I am better at optimizing for a standard P4, since that's what I have :)

AeroASM · May 16, 2005, 08:37:52 PM

Not Pentium 4-M, but Pentium M. After the P3, Intel souped it up to make the P4, added mobility (battery saving) stuff and made the Pentium M, then put some of the mobility stuff on the P4 to make the P4-M (AFAIK).

You can save clocks by organising the qword loop like this:

Instead of this:

Code Select


;preliminary stuff
qword_copy:
;stuff 1
jnz somewhere_else
;stuff 2
jmp qword_copy
somewhere_else:

Put this:

Code Select


;preliminary stuff
jmp copy_start
qword_copy:
;stuff 2
copy_start:
;stuff 1
jz qword_copy
somewhere_else:

this is faster because you do not have to fall through the jnz each loop.

Mark_Larson · May 16, 2005, 08:59:01 PM

The Pentium 4-M and Pentium M are the same processor. I just like using "Pentium 4 M" for people who aren't familiar with the different processor lines, so they can see it's a P4 part ( has SSE2 support). Though it was designed more from a P3 than a P4. Whereas the P4 is a significant departure from the P3. Intel has gotten worse and worse with their processors. Prescott has even longer instruction latencies ( eww) than the original P4, and slower L1 and L2 cache access times. One of the things that Intel has always beaten AMD on is their L1 cache latency access ( 2 clock cycles verus AMD having 3). The current prescott's have 3 cycles of L1 cache latency. The just released Intel parts are even worse with 4 cycles of L1 cache latency. So reads from memory even if they are in the L1 cache are getting more expensive.

AeroASM · May 16, 2005, 09:01:17 PM

I swear they are not, but I am probably mistaken.

Mark_Larson · May 16, 2005, 09:07:29 PM

Currently their are no desktop systems that are based off a Pentium M ( which is what you were saying). Tom's Hardware modified a desktop motherboard to plug in one of the Pentium M chips, so they could use it on the desktop. That's the closest a desktop version of it has come. The Pentium 4 M and Pentium M are the same processor. Intel has made several comments in the press that they will eventually release a Pentium M for the desktop, but none has come out yet. Because of my job I have to know low level hardware very well.

You might find this link interesting. I did a search on "Pentium 4-M" on google. Tom's Hardware is comparing 4 mobile ( read for laptops) Pentium 4-M ( calls it that exact name).

http://www.tomshardware.com/mobile/20030212/

I also did a search on "Pentium M", and found another link on Tom's Hardware doing reviews of laptop systems, where they call it "Pentium M"

http://www.tomshardware.com/mobile/20030205/

Both reviews are on the same site, and both reviews use both names.

lingo · May 16, 2005, 10:11:22 PM

still slow [without preserving esi) :lol

Code Select



 Timing routines by MichaelW from MASM32 forums, http://www.masmforum.com/
 Please terminate any high-priority tasks and press ENTER to begin.


 128-byte string copy timing results:

SzCpy10   - >   Lingo ->  MMX: 123 clocks
szCopyMMX ->Mark Larson ->MMX: 135 clocks
SzCpy11 -- >     Lingo ->     MMX  ->    Fast: 113 clocks
szCopyMMX1  - >  Mark Larson   ->  MMX-> Fast: 124 clocks

 Press ENTER to exit...

[attachment deleted by admin]

Jimg · May 17, 2005, 02:21:41 PM

Quote from: Jimg on May 16, 2005, 07:20:29 PM
Quote from: lingo on May 16, 2005, 06:14:04 PM

1. What is the lenght of the buffer with source string and is it OK to read the data past the end of the source string?

That's a very good question. Is it possible to have the source string right at the end of your alloted space and cause a page fault by trying to read past it? Every one of these routines has this possible problem :eek

Would someone please address this for the clueless?

Mark_Larson · May 17, 2005, 03:05:28 PM

Quote from: Jimg on May 17, 2005, 02:21:41 PM
Quote from: Jimg on May 16, 2005, 07:20:29 PM
Quote from: lingo on May 16, 2005, 06:14:04 PM

1. What is the lenght of the buffer with source string and is it OK to read the data past the end of the source string?

That's a very good question. Is it possible to have the source string right at the end of your alloted space and cause a page fault by trying to read past it? Every one of these routines has this possible problem :eek

Would someone please address this for the clueless?

I missed this the first time it was posted. Been super busy at work and been working a lot of overtime. That always happens when you are at a new place, because you spend a lot of time ramping up. So that counts down on my ability to optimize. I can't spend as much time on it as I'd like to.

I have heard of cases where reading past the end of a buffer can cause a fault ( dynamically allocated buffer). However, I have not experienced it for myself. And Agner Fog's string length routine actually reads past the end of the buffer since it reads data a dword at a time. In general you should be much safer doing a read past the end of the buffer versus doing a write. I'd never do a write that could potentially go past the end of the buffer. In the Intel Optimization manual I believe it says something about reads going past a page boundary causing a fault. But I could be misremembering. It's been a while, and I've never observed the behavior. Agner Fog's strlen routine has been in use in MASM32 for some time. Hutch-- made some modifications, and I've never heard of anyone reporting a problem. So if you do read past the end of some data, just be aware that there could be a problem, and be cautious.

Mark_Larson · May 17, 2005, 04:24:46 PM

Quote from: lingo on May 16, 2005, 10:11:22 PM
still slow [without preserving esi) :lol

Code Select Expand
Timing routines by MichaelW from MASM32 forums, http://www.masmforum.com/ Please terminate any high-priority tasks and press ENTER to begin. 128-byte string copy timing results: SzCpy10 - > Lingo -> MMX: 123 clocks szCopyMMX ->Mark Larson ->MMX: 135 clocks SzCpy11 -- > Lingo -> MMX -> Fast: 113 clocks szCopyMMX1 - > Mark Larson -> MMX-> Fast: 124 clocks Press ENTER to exit...

Your code is using SSE2 not SSE. SSE doesn't support the full 16 byte ALU instructions such as MOVDQA, PXOR, etc. I had been limiting myself to SSE since that's what I thought you were using, and I wanted to keep apples to apples. I just converted my SSE code to use SSE2 like yours uses, and now I am running in 62 cycles. Yours runs in 76 cycles. You can do the same conversion yourself on my code by simply converting the instruction to use SSE2 ( movdqa, etc).

For those that want a visual aid, here's the main loop.

Code Select


align 16
dqword_copy:
   movdqa xmm0,[eax]
   pxor xmm1,xmm1
   pcmpeqb xmm1,xmm0
   add eax,16
   pmovmskb ecx,xmm1

   test ecx,ecx
   jnz finish_rest

   movdqa [esi],xmm0
   add esi,16
   jmp dqword_copy

lingo · May 17, 2005, 05:16:39 PM

I don't know what you talking about? :naughty:
In my zip file I have just MMX code...

Where is your zip file with tests to understand what you mean?

MichaelW · May 17, 2005, 06:21:02 PM

Reading past the end of a buffer (under Windows 2000 SP4) causes no problems that I can see, at least without a debugger. I didn't have time to test buffers allocated by other means, or writes past the end.

Code Select


; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .486                       ; create 32 bit code
    .model flat, stdcall       ; 32 bit memory model
    option casemap :none       ; case sensitive

    include \masm32\include\windows.inc
    include \masm32\include\masm32.inc
    include \masm32\include\kernel32.inc
    includelib \masm32\lib\masm32.lib
    includelib \masm32\lib\kernel32.lib
    include \masm32\macros\macros.asm
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
        teststr db 4097 dup('X'),0
        lpMem   dd 0
        junkstr db 0,0,0,0,0
    .code
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start: 
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    mov   eax,input("Read past end of statically allocated buffer, press enter to continue...")

    invoke StrLen,ADDR teststr
    mov   ebx,eax
    print chr$("Returned string length :")
    print ustr$(ebx)
    print chr$(13,10)
    mov   ebx,OFFSET teststr
    add   ebx,4097+16
    mov   eax,[ebx]

    invoke GlobalAlloc,GMEM_FIXED or GMEM_ZEROINIT,4096
    mov   lpMem,eax
    invoke GlobalSize,lpMem
    mov   ebx,eax
    print chr$("Size of dynamically allocated buffer :")
    print ustr$(ebx)
    print chr$(13,10)

    invoke memfill,lpMem,4096,'LLIF'

    mov   eax,input("Read past end of dynamically allocated buffer, press enter to continue...")    

    mov   ebx,lpMem
    add   ebx,4095
    mov   eax,[ebx]
    mov   edx,[ebx+16]
    mov   DWORD PTR junkstr,eax
    print ADDR junkstr

    free lpMem

    mov   eax,input(13,10,"Press enter to exit...")
    exit
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

Mark_Larson · May 17, 2005, 07:42:30 PM

Quote from: lingo on May 17, 2005, 05:16:39 PM
Where is your zip file with tests to understand what you mean?

For timing purposes I added my routine to your code, so that we used the same test data. I just modified the timing to use REALTIME and added 10 times as many loops. That way I can get more accurate benchmarking against your code since we are using the same test data.

Quote from: lingo on May 17, 2005, 05:16:39 PM
I don't know what you talking about? :naughty:
In my zip file I have just MMX code...

Not true. Here's the text you print for your szCpy11 routine:

Code Select


SzSzCpy11    DB  "SzCpy11 (Lingo-> SSE): ",0

So you are saying it is just SSE. However here is some code from your szCpy11 routine.

Code Select


comment * XMM by Lingo*
align 16
@@:
		movdqa		[edx+eax],XMM0
		add				eax,16
@@XM:
		pcmpeqb		XMM1, [ecx+eax]
		movdqa		XMM0, [ecx+eax]

MOVDQA is an SSE2 instruction, not SSE. Doing PCMPEQB with an XMM register is an SSE2 instruction, not SSE. Doing any of the old MMX instructions on an XMM register instead of an MMX register requires SSE2 support. If you are unsure in the future, check Appendix B in the 2nd instruction manual for the P4. It gives a complete listing of what instructions were added, which each type of support. It has a section for all the MMX instructions, SSE, and SSE2. So you can clearly see what processor you need to use your code when you use a given instruction. If someone without SSE2 support tries to run your code, it'll die the big death ( invalid opcode). That means most people with AMDs, since unless you have an AMD64 you won't have SSE2 support. And anyone with a P3 or older Intel processor.

lingo · May 17, 2005, 08:35:08 PM

Bla, bla... blah..... :naughty:

I'm afraid you mean my older zip file with AeroASM's code in it...

Use my last zip file (named StillSlow.zip) at this page (page 7)

Where is your zip file with tests to understand all people how slow your code is?

No file, just bla, bla blah :naughty:

Mark_Larson · May 17, 2005, 08:50:12 PM

Quote from: lingo on May 17, 2005, 08:35:08 PM
Bla, bla... blah..... :naughty:

I'm afraid you mean my older zip file with AeroASM's code in it...

Use my last zip file (named StillSlow.zip) at this page (page 7)

Where is your zip file with tests to understand all people how slow your code is?

No file, just bla, bla blah :naughty:

I got better things to do than waste my time on someone who is rude and arrogant. You are just upset because my SSE2 code is faster than yours. Grow up. And BTW the timings on my computer for the latest code you posted are the same speed. And you still haven't gotten it right, your latest code says it's MMX, and it's not. It's SSE. If you spent more time listening instead of being rude, you would have already learned that fact.

Jimg · May 17, 2005, 08:55:46 PM

Quote from: Jimg on May 17, 2005, 02:21:41 PM
Quote from: Jimg on May 16, 2005, 07:20:29 PM
Quote from: lingo on May 16, 2005, 06:14:04 PM

1. What is the lenght of the buffer with source string and is it OK to read the data past the end of the source string?

That's a very good question. Is it possible to have the source string right at the end of your alloted space and cause a page fault by trying to read past it? Every one of these routines has this possible problem :eek

Would someone please address this for the clueless?

Well, I had to answer my own question....

Code Select

; assemble and link parameters
;Assemble=/c /coff /Cp /nologo /Sn
;Link=/SUBSYSTEM:WINDOWS /RELEASE /VERSION:4.0

.586
.model	flat, stdcall
option	casemap :none   ; case sensitive
include user32.inc
includelib user32.lib
include kernel32.inc
includelib kernel32.lib
.data
bln db ' ',0
	db 4087 dup (0)
done db 'done',0
badstring db '0'
.code
Program:
	invoke	GetModuleHandle, 0
    mov edx,offset badstring
@@:	mov eax,[edx]	; get 4 bytes as most of these routines are doing
	invoke MessageBox,0,addr done,addr bln,0   	
	invoke	ExitProcess, eax
end Program

The mov eax,[edx] bombs off because it trys to read past it's own address space.

This could be a very hard problem to debug.

So, what to do about all these general purpose routines????

[attachment deleted by admin]

News:

SzCpy vs. lstrcpy

AeroASM

AeroASM