I am trying to compile:
"movlps xmm0, [eax]"
with MASM (ver 9.0xx) but getting "error A2070:invalid instruction operands".
With JWASM it compiles just fine.
Can someone confirm it's a bug or do I missing something here?! ::)
Using:
.686p
.xmm
.model flat
Thanks
Same thing
Quote
MOVLPS xmm, mem64 0F 12 /r Moves two packed single-precision floating-point values from a
64-bit memory location to an XMM register.
MOVLPS mem64, xmm 0F 13 /r Moves two packed single-precision floating-point values from an
XMM register to a 64-bit memory location.
127 64
working
.data
truc qword ?
.code
movlps xmm0,truc ;[eax]
oops
Thanks so I gues I do have 3 choicest:
1.) Hard code it.
2.) Change to JWASM for good.
3.) Change my code. ::)
I think I go with 2.
It seems to be a fine alternative and may even better than MASM. :U
It assembles for me using ML 6.15 or 7.00.
Have you tried:
movlps xmm0, QWORD PTR [eax]
You are the man Michael! :U
I didn't think about that one since 'movlps' can't take anything else than QWORD and usually MASM figur self such things out.
I am downgrading the 'bug' to 'annoyance'. :P
I should have made it clear that I did not have to add the QWORD PTR. A plausible explanation for the change, other than it just being a mistake, might be that the instruction actually supports more than the two forms officially listed.
Movaps and movups do not require the size.
Quote.nolist
include \masm32\include\masm32rt.inc
.686
.xmm
include \masm32\macros\timers.asm ; get them from the Masm32 Laboratory (http://www.masm32.com/board/index.php?topic=770.0)
buffersize = 10000 ; don't go higher than 100000, ml.exe would slow down
LOOP_COUNT = 10000
.data?
align 16
Buffer16 db 1, 2, 3, 4, 5, 6, 7 ; try adding 8, then 9
buffer dd buffersize dup(?)
.code
start:
REPEAT 3
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
mov esi, offset Buffer16
REPEAT 100
movaps xmm1, [esi]
lea esi, [esi+16]
ENDM
counter_end
print str$(eax), 9, "cycles for movaps", 13, 10
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
mov esi, offset buffer
REPEAT 100
movups xmm1, [esi]
lea esi, [esi+16]
ENDM
counter_end
print str$(eax), 9, "cycles for movups", 13, 10
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
mov esi, offset buffer
REPEAT 100
movlps xmm0, QWORD PTR [esi]
movhps xmm0, QWORD PTR [esi+8]
lea esi, [esi+16]
ENDM
counter_end
print str$(eax), 9, "cycles for movlps/movhps", 13, 10, 10
ENDM
inkey chr$(13, 10, "--- ok ---", 13)
exit
end start
Interesting that the pair movlps/movhps is 20% faster than the single movups (Celeron M):
197 cycles for movaps
397 cycles for movups
321 cycles for movlps/movhps
197 cycles for movaps
403 cycles for movups
322 cycles for movlps/movhps
197 cycles for movaps
397 cycles for movups
321 cycles for movlps/movhps
Thanks JJ2007,
Nice timing. :thumbu
Just for fun can you put up a comparison with:
"fild qword ptr [eax]"
"fistp qword ptr [eax]"
just being curious how horrible –or not- FPU is for QWORD data transfer.
Since the mmx regs are mapped to FPU regs may it isn't that bud at all.?! ::)
Quote from: Ficko on June 03, 2009, 09:38:28 AM
Thanks JJ2007,
Nice timing. :thumbu
Just for fun can you put up a comparison with:
"fild qword ptr [eax]"
"fistp qword ptr [eax]"
just being curious how horrible –or not- FPU is for QWORD data transfer.
Since the mmx regs are mapped to FPU regs may it isn't that bud at all.?! ::)
It depends.. it depends, as usual. Here are P4 timings, 100*16 bytes mem to mem each:
1377 cycles for fild/fistp
515 cycles for movdqa (aligned 16)
625 cycles for rep movsd (aligned 16)
1206 cycles for movdqu
3391 cycles for movlps/movhps
1411 cycles for fild/fistp
490 cycles for movdqa (aligned 16)
668 cycles for rep movsd (aligned 16)
1744 cycles for movdqu
3486 cycles for movlps/movhps
1342 cycles for fild/fistp
605 cycles for movdqa (aligned 16)
956 cycles for rep movsd (aligned 16)
1489 cycles for movdqu
3572 cycles for movlps/movhps
1979 cycles for fild/fistp
776 cycles for movdqa (aligned 16)
955 cycles for rep movsd (aligned 16)
1606 cycles for movdqu
3490 cycles for movlps/movhps
1440 cycles for fild/fistp
673 cycles for movdqa (aligned 16)
971 cycles for rep movsd (aligned 16)
1741 cycles for movdqu
3583 cycles for movlps/movhps
Note this is comparing apples and oranges: Movdqa and movsd are aligned to a 16-byte boundary, the others are badly misaligned, i.e. +7 for src and +9 for dest (I made 5 repeats to show the variance).
Surprised?
:bg
EDIT: Fixed a bug - fild/fistp count was too low (it's 8 bytes, not 16 as for movdqa)
[attachment deleted by admin]
Yes I am surprised indeed. :bg
Secretly I was thinking it will be quite worse because FPU should do "int" to "float" conversion on load
but like a lot of times things aren't always the way you are thinking they are. :wink
Or maybe SSE instructions are doing conversions as well that would explain the closeness of results by "movdqu" and "fild/fistp",
which are about the same caliber –moving qwords with no alignment-!?
Cool, I have been using "fild/fistp" quite often but always thought there is a huge penalty to it but seems it isn't true. :dance:
Win XP Pro x64
AMD Athlon(tm) 64 X2 Dual Core Processor 4000+ (SSE3)
409 cycles for movdqa (src + dest aligned 16)
613 cycles for movlps/movhps srcalign=7, destalign=9
609 cycles for movdqu srcalign=7, destalign=9
906 cycles for rep movsd srcalign=7, destalign=9
620 cycles for fild/fistp srcalign=7, destalign=9
409 cycles for movdqa (src + dest aligned 16)
612 cycles for movlps/movhps srcalign=7, destalign=9
609 cycles for movdqu srcalign=7, destalign=9
907 cycles for rep movsd srcalign=7, destalign=9
617 cycles for fild/fistp srcalign=7, destalign=9
408 cycles for movdqa (src + dest aligned 16)
612 cycles for movlps/movhps srcalign=7, destalign=9
611 cycles for movdqu srcalign=7, destalign=9
906 cycles for rep movsd srcalign=7, destalign=9
618 cycles for fild/fistp srcalign=7, destalign=9
Edit: Revision 2 timings.
Quote from: Ficko on June 03, 2009, 03:51:51 PM
Yes I am surprised indeed. :bg
Secretly I was thinking it will be quite worse because FPU should do "int" to "float" conversion on load
but like a lot of times things aren't always the way you are thinking they are. :wink
Or maybe SSE instructions are doing conversions as well that would explain the closeness of results by "movdqu" and "fild/fistp",
which are about the same caliber –moving qwords with no alignment-!?
Cool, I have been using "fild/fistp" quite often but always thought there is a huge penalty to it but seems it isn't true. :dance:
It is still a good option if you don't have SSE2. It is slightly slower than rep movsd, though (but movsd binds esi and edi...).
Unfortunately I found a little bug in the last routine:
movlps qword ptr [esi], xmm0
should read: edi
... which gave an unfair treatment to the movlps/movhps pair.
I also replaced the assembly-time REPEATs with run-time .Repeats, and the results change:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
308 cycles for movdqa (src + dest aligned 16)
408 cycles for movlps/movhps srcalign=8, destalign=8
712 cycles for movdqu srcalign=8, destalign=8
458 cycles for rep movsd srcalign=8, destalign=8
1220 cycles for fild/fistp srcalign=8, destalign=8
310 cycles for movdqa (src + dest aligned 16)
408 cycles for movlps/movhps srcalign=8, destalign=8
712 cycles for movdqu srcalign=8, destalign=8
459 cycles for rep movsd srcalign=8, destalign=8
1220 cycles for fild/fistp srcalign=8, destalign=8
308 cycles for movdqa (src + dest aligned 16)
408 cycles for movlps/movhps srcalign=8, destalign=8
721 cycles for movdqu srcalign=8, destalign=8
459 cycles for rep movsd srcalign=8, destalign=8
1220 cycles for fild/fistp srcalign=8, destalign=8
Now what is really surprising is that the movlps/movhps pair is consistently a lot faster than movdqu. Try fumbling with the srcalign and destalign equates on top of the source - movlps/movhps is always faster.
Try in particular to set
srcalign = 8
destalign = 8
Of course, movdqa can't be beaten, but remember that HeapAlloc guarantees only an 8-byte aligned memory - no good for movdqa...
@Mark: Thanks for the AMD timings - and sorry for the bug in the movlps row.
[attachment deleted by admin]
Prescott dual-core
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
480 cycles for movdqa (src + dest aligned 16)
3303 cycles for movlps/movhps srcalign=7, destalign=9
3247 cycles for movdqu srcalign=7, destalign=9
3590 cycles for rep movsd srcalign=7, destalign=9
3292 cycles for fild/fistp srcalign=7, destalign=9
483 cycles for movdqa (src + dest aligned 16)
3336 cycles for movlps/movhps srcalign=7, destalign=9
3276 cycles for movdqu srcalign=7, destalign=9
3572 cycles for rep movsd srcalign=7, destalign=9
3287 cycles for fild/fistp srcalign=7, destalign=9
486 cycles for movdqa (src + dest aligned 16)
3306 cycles for movlps/movhps srcalign=7, destalign=9
3292 cycles for movdqu srcalign=7, destalign=9
3582 cycles for rep movsd srcalign=7, destalign=9
3342 cycles for fild/fistp srcalign=7, destalign=9
big numbers :(
Quote from: dedndave on June 03, 2009, 08:46:40 PM
Prescott dual-core
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
big numbers :(
Small numbers when aligned to 8 bytes :bg
And again, the movlps/movhps pair beats them all...
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
586 cycles for movdqa (src + dest aligned 16)
712 cycles for movlps/movhps srcalign=8, destalign=8
1377 cycles for movdqu srcalign=8, destalign=8
802 cycles for rep movsd srcalign=8, destalign=8
1175 cycles for fild/fistp srcalign=8, destalign=8
520 cycles for movdqa (src + dest aligned 16)
758 cycles for movlps/movhps srcalign=8, destalign=8
1323 cycles for movdqu srcalign=8, destalign=8
845 cycles for rep movsd srcalign=8, destalign=8
1160 cycles for fild/fistp srcalign=8, destalign=8
505 cycles for movdqa (src + dest aligned 16)
797 cycles for movlps/movhps srcalign=8, destalign=8
1242 cycles for movdqu srcalign=8, destalign=8
798 cycles for rep movsd srcalign=8, destalign=8
1047 cycles for fild/fistp srcalign=8, destalign=8
Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz (SSE4)
216 cycles for movdqa (src + dest aligned 16)
1241 cycles for movlps/movhps srcalign=7, destalign=9
1408 cycles for movdqu srcalign=7, destalign=9
1485 cycles for rep movsd srcalign=7, destalign=9
1458 cycles for fild/fistp srcalign=7, destalign=9
212 cycles for movdqa (src + dest aligned 16)
1242 cycles for movlps/movhps srcalign=7, destalign=9
1403 cycles for movdqu srcalign=7, destalign=9
1479 cycles for rep movsd srcalign=7, destalign=9
1456 cycles for fild/fistp srcalign=7, destalign=9
213 cycles for movdqa (src + dest aligned 16)
1239 cycles for movlps/movhps srcalign=7, destalign=9
1402 cycles for movdqu srcalign=7, destalign=9
1479 cycles for rep movsd srcalign=7, destalign=9
1462 cycles for fild/fistp srcalign=7, destalign=9
big numbers indeed...
213 cycles for movdqa (src + dest aligned 16)
312 cycles for movlps/movhps srcalign=8, destalign=8
614 cycles for movdqu srcalign=8, destalign=8
280 cycles for rep movsd srcalign=8, destalign=8
417 cycles for fild/fistp srcalign=8, destalign=8
213 cycles for movdqa (src + dest aligned 16)
313 cycles for movlps/movhps srcalign=8, destalign=8
614 cycles for movdqu srcalign=8, destalign=8
280 cycles for rep movsd srcalign=8, destalign=8
417 cycles for fild/fistp srcalign=8, destalign=8
213 cycles for movdqa (src + dest aligned 16)
313 cycles for movlps/movhps srcalign=8, destalign=8
618 cycles for movdqu srcalign=8, destalign=8
282 cycles for rep movsd srcalign=8, destalign=8
417 cycles for fild/fistp srcalign=8, destalign=8
better :bg
And even better... we have a new champion for MemCopy and addresses that are not aligned 16:
mov esi, offset src
mov edi, offset dest
mov ecx, 100
.Repeat
movdqu xmm0, [esi] ; read src
lea esi, [esi+16]
movlps qword ptr [edi], xmm0 ; write dest
movhps qword ptr [edi+8], xmm0
dec ecx
lea edi, [edi+16]
.Until Zero?
The trick is reading with movdqu and writing with movlps/movhps...
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
488 cycles for movdqa (src + dest aligned 16)
671 cycles for movdqu+movlps srcalign=8, destalign=8
714 cycles for movlps/movhps srcalign=8, destalign=8
1240 cycles for movlps+movdqu srcalign=8, destalign=8
1256 cycles for movdqu srcalign=8, destalign=8
818 cycles for rep movsd srcalign=8, destalign=8
1060 cycles for fild/fistp srcalign=8, destalign=8
484 cycles for movdqa (src + dest aligned 16)
3107 cycles for movdqu+movlps srcalign=7, destalign=9
3118 cycles for movlps/movhps srcalign=7, destalign=9
3163 cycles for movlps+movdqu srcalign=7, destalign=9
3151 cycles for movdqu srcalign=7, destalign=9
3186 cycles for rep movsd srcalign=7, destalign=9
3203 cycles for fild/fistp srcalign=7, destalign=9
For addresses not aligned to 8 bytes, it is only marginally slower than the others, though.
:bg
[attachment deleted by admin]
:U
Genuine Intel(R) CPU T2400 @ 1.83GHz (SSE3)
316 cycles for movdqa (src + dest aligned 16)
411 cycles for movdqu+movlps srcalign=8, destalign=8
408 cycles for movlps/movhps srcalign=8, destalign=8
714 cycles for movlps+movdqu srcalign=8, destalign=8
723 cycles for movdqu srcalign=8, destalign=8
462 cycles for rep movsd srcalign=8, destalign=8
1219 cycles for fild/fistp srcalign=8, destalign=8
361 cycles for movdqa (src + dest aligned 16)
422 cycles for movdqu+movlps srcalign=8, destalign=8
428 cycles for movlps/movhps srcalign=8, destalign=8
712 cycles for movlps+movdqu srcalign=8, destalign=8
719 cycles for movdqu srcalign=8, destalign=8
469 cycles for rep movsd srcalign=8, destalign=8
1222 cycles for fild/fistp srcalign=8, destalign=8
309 cycles for movdqa (src + dest aligned 16)
427 cycles for movdqu+movlps srcalign=8, destalign=8
424 cycles for movlps/movhps srcalign=8, destalign=8
737 cycles for movlps+movdqu srcalign=8, destalign=8
726 cycles for movdqu srcalign=8, destalign=8
481 cycles for rep movsd srcalign=8, destalign=8
1273 cycles for fild/fistp srcalign=8, destalign=8
310 cycles for movdqa (src + dest aligned 16)
420 cycles for movdqu+movlps srcalign=8, destalign=8
417 cycles for movlps/movhps srcalign=8, destalign=8
720 cycles for movlps+movdqu srcalign=8, destalign=8
713 cycles for movdqu srcalign=8, destalign=8
459 cycles for rep movsd srcalign=8, destalign=8
1256 cycles for fild/fistp srcalign=8, destalign=8
--- ok ---