The MASM Forum Archive 2004 to 2012

General Forums => The Campus => Topic started by: Ficko on June 02, 2009, 05:38:51 PM

Title: MASM movlps problem?
Post by: Ficko on June 02, 2009, 05:38:51 PM
I am trying to compile:

"movlps xmm0, [eax]"

with MASM (ver 9.0xx) but getting "error A2070:invalid instruction operands".

With JWASM it compiles just fine.

Can someone confirm it's a bug or do I missing something here?! ::)

Using:
.686p
.xmm
.model flat

Thanks
Title: Re: MASM movlps problem?
Post by: ToutEnMasm on June 02, 2009, 06:19:06 PM
Same thing
Quote
MOVLPS xmm, mem64 0F 12 /r Moves two packed single-precision floating-point values from a
64-bit memory location to an XMM register.
MOVLPS mem64, xmm 0F 13 /r Moves two packed single-precision floating-point values from an
XMM register to a 64-bit memory location.
127 64

working
.data
truc qword ?
.code
movlps xmm0,truc ;[eax]
Title: Re: MASM movlps problem?
Post by: dedndave on June 02, 2009, 06:30:41 PM
oops
Title: Re: MASM movlps problem?
Post by: Ficko on June 02, 2009, 06:40:50 PM
Thanks so I gues I do have 3 choicest:

1.) Hard code it.
2.) Change to JWASM for good.
3.) Change my code. ::)

I think I go with 2.
It seems to be a fine alternative and may even better than MASM. :U
Title: Re: MASM movlps problem?
Post by: MichaelW on June 02, 2009, 07:35:04 PM
It assembles for me using ML 6.15 or 7.00.

Have you tried:

movlps xmm0, QWORD PTR [eax]

Title: Re: MASM movlps problem?
Post by: Ficko on June 02, 2009, 07:53:24 PM
You are the man Michael!  :U

I didn't think about that one since 'movlps' can't take anything else than QWORD and usually MASM figur self such things out.

I am downgrading the 'bug' to 'annoyance'. :P
Title: Re: MASM movlps problem?
Post by: MichaelW on June 03, 2009, 12:11:55 AM
I should have made it clear that I did not have to add the QWORD PTR. A plausible explanation for the change, other than it just being a mistake, might be that the instruction actually supports more than the two forms officially listed.
Title: Re: MASM movlps problem?
Post by: jj2007 on June 03, 2009, 01:02:17 AM
Movaps and movups do not require the size.

Quote.nolist
include \masm32\include\masm32rt.inc
.686
.xmm
include \masm32\macros\timers.asm      ; get them from the Masm32 Laboratory (http://www.masm32.com/board/index.php?topic=770.0)
   buffersize      = 10000               ; don't go higher than 100000, ml.exe would slow down
   LOOP_COUNT   = 10000

.data?
align 16
Buffer16   db 1, 2, 3, 4, 5, 6, 7      ; try adding 8, then 9
buffer   dd buffersize dup(?)

.code
start:
   REPEAT 3
   counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
      mov esi, offset Buffer16
      REPEAT 100
         movaps xmm1, [esi]
         lea esi, [esi+16]
      ENDM
   counter_end
   print str$(eax), 9, "cycles for movaps", 13, 10

   counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
      mov esi, offset buffer
      REPEAT 100
         movups xmm1, [esi]
         lea esi, [esi+16]
      ENDM
   counter_end
   print str$(eax), 9, "cycles for movups", 13, 10

   counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
      mov esi, offset buffer
      REPEAT 100
         movlps xmm0, QWORD PTR [esi]
         movhps xmm0, QWORD PTR [esi+8]
         lea esi, [esi+16]
      ENDM
   counter_end
   print str$(eax), 9, "cycles for movlps/movhps", 13, 10, 10
   ENDM

   inkey chr$(13, 10, "--- ok ---", 13)
   exit
end start

Interesting that the pair movlps/movhps is 20% faster than the single movups (Celeron M):
197     cycles for movaps
397     cycles for movups
321     cycles for movlps/movhps

197     cycles for movaps
403     cycles for movups
322     cycles for movlps/movhps

197     cycles for movaps
397     cycles for movups
321     cycles for movlps/movhps
Title: Re: MASM movlps problem?
Post by: Ficko on June 03, 2009, 09:38:28 AM
Thanks JJ2007,

Nice timing. :thumbu

Just for fun can you put up a comparison with:

"fild qword ptr [eax]"
"fistp qword ptr [eax]"

just being curious how horrible –or not- FPU is for QWORD data transfer.
Since the mmx regs are mapped to FPU regs may it isn't that bud at all.?! ::)

Title: Re: MASM movlps problem?
Post by: jj2007 on June 03, 2009, 12:43:36 PM
Quote from: Ficko on June 03, 2009, 09:38:28 AM
Thanks JJ2007,

Nice timing. :thumbu

Just for fun can you put up a comparison with:

"fild qword ptr [eax]"
"fistp qword ptr [eax]"

just being curious how horrible –or not- FPU is for QWORD data transfer.
Since the mmx regs are mapped to FPU regs may it isn't that bud at all.?! ::)


It depends.. it depends, as usual. Here are P4 timings, 100*16 bytes mem to mem each:

1377    cycles for fild/fistp
515     cycles for movdqa (aligned 16)
625     cycles for rep movsd (aligned 16)
1206    cycles for movdqu
3391    cycles for movlps/movhps

1411    cycles for fild/fistp
490     cycles for movdqa (aligned 16)
668     cycles for rep movsd (aligned 16)
1744    cycles for movdqu
3486    cycles for movlps/movhps

1342    cycles for fild/fistp
605     cycles for movdqa (aligned 16)
956     cycles for rep movsd (aligned 16)
1489    cycles for movdqu
3572    cycles for movlps/movhps

1979    cycles for fild/fistp
776     cycles for movdqa (aligned 16)
955     cycles for rep movsd (aligned 16)
1606    cycles for movdqu
3490    cycles for movlps/movhps

1440    cycles for fild/fistp
673     cycles for movdqa (aligned 16)
971     cycles for rep movsd (aligned 16)
1741    cycles for movdqu
3583    cycles for movlps/movhps


Note this is comparing apples and oranges: Movdqa and movsd are aligned to a 16-byte boundary, the others are badly misaligned, i.e.  +7 for src and +9 for dest (I made 5 repeats to show the variance).

Surprised?
:bg

EDIT: Fixed a bug - fild/fistp count was too low (it's 8 bytes, not 16 as for movdqa)

[attachment deleted by admin]
Title: Re: MASM movlps problem?
Post by: Ficko on June 03, 2009, 03:51:51 PM
Yes I am surprised indeed. :bg

Secretly I was thinking it will be quite worse because FPU should do "int" to "float" conversion on load
but like a lot of times things aren't always the way you are thinking they are. :wink

Or  maybe SSE instructions are doing conversions as well that would explain the closeness of results by "movdqu" and "fild/fistp",
which are about the same caliber –moving qwords with no alignment-!?

Cool, I have been using "fild/fistp" quite often but always thought there is a huge penalty to it but seems it isn't true. :dance:
Title: Re: MASM movlps problem?
Post by: Mark Jones on June 03, 2009, 05:14:53 PM
Win XP Pro x64
AMD Athlon(tm) 64 X2 Dual Core Processor 4000+ (SSE3)
409     cycles for movdqa               (src + dest aligned 16)
613     cycles for movlps/movhps        srcalign=7, destalign=9
609     cycles for movdqu               srcalign=7, destalign=9
906     cycles for rep movsd            srcalign=7, destalign=9
620     cycles for fild/fistp           srcalign=7, destalign=9

409     cycles for movdqa               (src + dest aligned 16)
612     cycles for movlps/movhps        srcalign=7, destalign=9
609     cycles for movdqu               srcalign=7, destalign=9
907     cycles for rep movsd            srcalign=7, destalign=9
617     cycles for fild/fistp           srcalign=7, destalign=9

408     cycles for movdqa               (src + dest aligned 16)
612     cycles for movlps/movhps        srcalign=7, destalign=9
611     cycles for movdqu               srcalign=7, destalign=9
906     cycles for rep movsd            srcalign=7, destalign=9
618     cycles for fild/fistp           srcalign=7, destalign=9


Edit: Revision 2 timings.
Title: Re: MASM movlps problem?
Post by: jj2007 on June 03, 2009, 08:02:26 PM
Quote from: Ficko on June 03, 2009, 03:51:51 PM
Yes I am surprised indeed. :bg

Secretly I was thinking it will be quite worse because FPU should do "int" to "float" conversion on load
but like a lot of times things aren't always the way you are thinking they are. :wink

Or  maybe SSE instructions are doing conversions as well that would explain the closeness of results by "movdqu" and "fild/fistp",
which are about the same caliber –moving qwords with no alignment-!?

Cool, I have been using "fild/fistp" quite often but always thought there is a huge penalty to it but seems it isn't true. :dance:


It is still a good option if you don't have SSE2. It is slightly slower than rep movsd, though (but movsd binds esi and edi...).

Unfortunately I found a little bug in the last routine:
movlps qword ptr [esi], xmm0
should read: edi
... which gave an unfair treatment to the movlps/movhps pair.
I also replaced the assembly-time REPEATs with run-time .Repeats, and the results change:

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
308     cycles for movdqa               (src + dest aligned 16)
408     cycles for movlps/movhps        srcalign=8, destalign=8
712     cycles for movdqu               srcalign=8, destalign=8
458     cycles for rep movsd            srcalign=8, destalign=8
1220    cycles for fild/fistp           srcalign=8, destalign=8

310     cycles for movdqa               (src + dest aligned 16)
408     cycles for movlps/movhps        srcalign=8, destalign=8
712     cycles for movdqu               srcalign=8, destalign=8
459     cycles for rep movsd            srcalign=8, destalign=8
1220    cycles for fild/fistp           srcalign=8, destalign=8

308     cycles for movdqa               (src + dest aligned 16)
408     cycles for movlps/movhps        srcalign=8, destalign=8
721     cycles for movdqu               srcalign=8, destalign=8
459     cycles for rep movsd            srcalign=8, destalign=8
1220    cycles for fild/fistp           srcalign=8, destalign=8


Now what is really surprising is that the movlps/movhps pair is consistently a lot faster than movdqu. Try fumbling with the srcalign and destalign equates on top of the source - movlps/movhps is always faster.

Try in particular to set
srcalign = 8
destalign = 8

Of course, movdqa can't be beaten, but remember that HeapAlloc guarantees only an 8-byte aligned memory - no good for movdqa...

@Mark: Thanks for the AMD timings - and sorry for the bug in the movlps row.

[attachment deleted by admin]
Title: Re: MASM movlps problem?
Post by: dedndave on June 03, 2009, 08:46:40 PM
Prescott dual-core
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
480     cycles for movdqa               (src + dest aligned 16)
3303    cycles for movlps/movhps        srcalign=7, destalign=9
3247    cycles for movdqu               srcalign=7, destalign=9
3590    cycles for rep movsd            srcalign=7, destalign=9
3292    cycles for fild/fistp           srcalign=7, destalign=9

483     cycles for movdqa               (src + dest aligned 16)
3336    cycles for movlps/movhps        srcalign=7, destalign=9
3276    cycles for movdqu               srcalign=7, destalign=9
3572    cycles for rep movsd            srcalign=7, destalign=9
3287    cycles for fild/fistp           srcalign=7, destalign=9

486     cycles for movdqa               (src + dest aligned 16)
3306    cycles for movlps/movhps        srcalign=7, destalign=9
3292    cycles for movdqu               srcalign=7, destalign=9
3582    cycles for rep movsd            srcalign=7, destalign=9
3342    cycles for fild/fistp           srcalign=7, destalign=9

big numbers   :(
Title: Re: MASM movlps problem?
Post by: jj2007 on June 04, 2009, 07:28:11 AM
Quote from: dedndave on June 03, 2009, 08:46:40 PM
Prescott dual-core
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
big numbers   :(
Small numbers when aligned to 8 bytes :bg
And again, the movlps/movhps pair beats them all...

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
586     cycles for movdqa               (src + dest aligned 16)
712     cycles for movlps/movhps        srcalign=8, destalign=8
1377    cycles for movdqu               srcalign=8, destalign=8
802     cycles for rep movsd            srcalign=8, destalign=8
1175    cycles for fild/fistp           srcalign=8, destalign=8

520     cycles for movdqa               (src + dest aligned 16)
758     cycles for movlps/movhps        srcalign=8, destalign=8
1323    cycles for movdqu               srcalign=8, destalign=8
845     cycles for rep movsd            srcalign=8, destalign=8
1160    cycles for fild/fistp           srcalign=8, destalign=8

505     cycles for movdqa               (src + dest aligned 16)
797     cycles for movlps/movhps        srcalign=8, destalign=8
1242    cycles for movdqu               srcalign=8, destalign=8
798     cycles for rep movsd            srcalign=8, destalign=8
1047    cycles for fild/fistp           srcalign=8, destalign=8
Title: Re: MASM movlps problem?
Post by: sinsi on June 04, 2009, 07:46:44 AM

Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz (SSE4)
216     cycles for movdqa               (src + dest aligned 16)
1241    cycles for movlps/movhps        srcalign=7, destalign=9
1408    cycles for movdqu               srcalign=7, destalign=9
1485    cycles for rep movsd            srcalign=7, destalign=9
1458    cycles for fild/fistp           srcalign=7, destalign=9

212     cycles for movdqa               (src + dest aligned 16)
1242    cycles for movlps/movhps        srcalign=7, destalign=9
1403    cycles for movdqu               srcalign=7, destalign=9
1479    cycles for rep movsd            srcalign=7, destalign=9
1456    cycles for fild/fistp           srcalign=7, destalign=9

213     cycles for movdqa               (src + dest aligned 16)
1239    cycles for movlps/movhps        srcalign=7, destalign=9
1402    cycles for movdqu               srcalign=7, destalign=9
1479    cycles for rep movsd            srcalign=7, destalign=9
1462    cycles for fild/fistp           srcalign=7, destalign=9

big numbers indeed...


213     cycles for movdqa               (src + dest aligned 16)
312     cycles for movlps/movhps        srcalign=8, destalign=8
614     cycles for movdqu               srcalign=8, destalign=8
280     cycles for rep movsd            srcalign=8, destalign=8
417     cycles for fild/fistp           srcalign=8, destalign=8

213     cycles for movdqa               (src + dest aligned 16)
313     cycles for movlps/movhps        srcalign=8, destalign=8
614     cycles for movdqu               srcalign=8, destalign=8
280     cycles for rep movsd            srcalign=8, destalign=8
417     cycles for fild/fistp           srcalign=8, destalign=8

213     cycles for movdqa               (src + dest aligned 16)
313     cycles for movlps/movhps        srcalign=8, destalign=8
618     cycles for movdqu               srcalign=8, destalign=8
282     cycles for rep movsd            srcalign=8, destalign=8
417     cycles for fild/fistp           srcalign=8, destalign=8

better  :bg
Title: Re: MASM movlps problem?
Post by: jj2007 on June 04, 2009, 08:47:50 AM
And even better... we have a new champion for MemCopy and addresses that are not aligned 16:

mov esi, offset src
mov edi, offset dest
mov ecx, 100
.Repeat
movdqu xmm0, [esi] ; read src
lea esi, [esi+16]
movlps qword ptr [edi], xmm0 ; write dest
movhps qword ptr [edi+8], xmm0
dec ecx
lea edi, [edi+16]
.Until Zero?


The trick is reading with movdqu and writing with movlps/movhps...

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
488     cycles for movdqa               (src + dest aligned 16)
671     cycles for movdqu+movlps        srcalign=8, destalign=8
714     cycles for movlps/movhps        srcalign=8, destalign=8
1240    cycles for movlps+movdqu        srcalign=8, destalign=8
1256    cycles for movdqu               srcalign=8, destalign=8
818     cycles for rep movsd            srcalign=8, destalign=8
1060    cycles for fild/fistp           srcalign=8, destalign=8

484     cycles for movdqa               (src + dest aligned 16)
3107    cycles for movdqu+movlps        srcalign=7, destalign=9
3118    cycles for movlps/movhps        srcalign=7, destalign=9
3163    cycles for movlps+movdqu        srcalign=7, destalign=9
3151    cycles for movdqu               srcalign=7, destalign=9
3186    cycles for rep movsd            srcalign=7, destalign=9
3203    cycles for fild/fistp           srcalign=7, destalign=9


For addresses not aligned to 8 bytes, it is only marginally slower than the others, though.
:bg

[attachment deleted by admin]
Title: Re: MASM movlps problem?
Post by: UtillMasm on June 04, 2009, 09:10:12 AM
 :U
Genuine Intel(R) CPU           T2400  @ 1.83GHz (SSE3)
316     cycles for movdqa               (src + dest aligned 16)
411     cycles for movdqu+movlps        srcalign=8, destalign=8
408     cycles for movlps/movhps        srcalign=8, destalign=8
714     cycles for movlps+movdqu        srcalign=8, destalign=8
723     cycles for movdqu               srcalign=8, destalign=8
462     cycles for rep movsd            srcalign=8, destalign=8
1219    cycles for fild/fistp           srcalign=8, destalign=8

361     cycles for movdqa               (src + dest aligned 16)
422     cycles for movdqu+movlps        srcalign=8, destalign=8
428     cycles for movlps/movhps        srcalign=8, destalign=8
712     cycles for movlps+movdqu        srcalign=8, destalign=8
719     cycles for movdqu               srcalign=8, destalign=8
469     cycles for rep movsd            srcalign=8, destalign=8
1222    cycles for fild/fistp           srcalign=8, destalign=8

309     cycles for movdqa               (src + dest aligned 16)
427     cycles for movdqu+movlps        srcalign=8, destalign=8
424     cycles for movlps/movhps        srcalign=8, destalign=8
737     cycles for movlps+movdqu        srcalign=8, destalign=8
726     cycles for movdqu               srcalign=8, destalign=8
481     cycles for rep movsd            srcalign=8, destalign=8
1273    cycles for fild/fistp           srcalign=8, destalign=8

310     cycles for movdqa               (src + dest aligned 16)
420     cycles for movdqu+movlps        srcalign=8, destalign=8
417     cycles for movlps/movhps        srcalign=8, destalign=8
720     cycles for movlps+movdqu        srcalign=8, destalign=8
713     cycles for movdqu               srcalign=8, destalign=8
459     cycles for rep movsd            srcalign=8, destalign=8
1256    cycles for fild/fistp           srcalign=8, destalign=8


--- ok ---