News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

64-bit and SSE2

Started by Seb, November 20, 2006, 06:06:01 PM

Previous topic - Next topic

Seb

Hey again. :bg

I'm looking for 64-bit Assembler programming-tutorials and SSE2 programming tutorials as well, and was hoping any of you guys were sitting on some fine piece of work you'd like to share with me. :U I'd like to clarify that either "32-bit-to-64-bit" or just pure 64-bit tutorials are welcome - I'm interested in anything related to 64-bit and SSE2 programming. Thanls.

Regards,
Seb

dsouza123

An extremely short SSE2 overview.

It works with either integers or floating point values.

The register size is 128 bits, there are eight registers,
xmm0,xmm1,xmm2,xmm3,xxm4,xmm5,xmm6,xmm7

Mainly doing various math operations in parallel on the following data types :

Integer types
16 bytes
  8 words
  4 dwords
  2 qwords

Floating point types
  4 single precision
  2 double precision

And a few bit manipulation operations on 128 bit values (oword),
along with some transfer and compare instructions.

If your algorithm would allow parallel operations on the data types described
without dependancies between them then SSE2 can produce a decent speed up.

The integer SSE2 instructions are mostly a complete duplication of MMX instructions
but work on 128 bits instead of 64 bits with MMX.
There are extra ones that MMX doesn't have.

Seb

Hi dsouza,

thanks for your answer. What about the "packing" and "unpacking" instructions? What are they meant for?

Regards,
Seb

dsouza123

Packed means the data type packed together in a SSE2 register that is worked on in parallel.

example

psubb  subtract packed bytes

psubb xmm0, xmm7  ; dest = dest - source  ; subtract each of the 16 bytes in xmm7 from xmm0  ; xmm0 = xmm0 - xmm7

equivalent to the following ALU instruction on a pair of byte registers, but done on 16 bytes in parallel (no carries or borrows)

sub dl, al  ; dl = dl - al

dsouza123

64 bit assembler and SSE2, there are 8 more (128 bit) SSE2 registers xmm8..xmm15

64 bit ALU, there are 8 more ALU registers r8..r15 (64 bit)
The original 8 ALU have a 64 bit version rax, rbx, rcx etc
so they can be accessed as rax, eax, ax, ah, al.

Seb

Hi dsouza,

thanks once again for your answer. It sure cleared up a few things for me! :U Keep up the good work!

Regards,
Seb

Mark_Larson


I've done a lot of SSE and SSE2 programming over the years. I have
an optimization website that goes over some basic tricks to speed up
code with SSE/SSE2 ( along with other tricks).
http://www.mark.masmcode.com/

P4's and up on the Intel side really run SSE/SSE2 code very fast. So
I've used that advantage a lot to make code run extremely fast.

converting a string to a qword using SSE2
http://www.oldboard.assemblercode.com/index.php?topic=4253.msg28940#m...

SSE2 quaternion multiply
http://www.oldboard.assemblercode.com/index.php?topic=3469.0

Mersenne Twister Random Number Generator in SSE2
http://www.oldboard.assemblercode.com/index.php?topic=3565.0

my account on masmforum got messed up ( all these links are for
masmforum). So some messages will say they are from hutch- instead of
marklarson. The way you tell it's the real me, is it'll say "guest"
under "hutch--".

Counting the number of lines in a file using SSE2
http://www.oldboard.assemblercode.com/index.php?topic=2692.msg18800#m...

string copy using SSE2
http://www.oldboard.assemblercode.com/index.php?topic=2632.msg18047#m...

Computing MD5 using SSE2
http://www.oldboard.assemblercode.com/index.php?topic=2921.0

I am working on a raytracer that I haven't finished yet. You can use
scalar SSE code just like FP code ( you don't do stuff in parallel,
it's a single floating point value you are doing an operation on).
Scalar code is faster on a P4. ( not sure about AMD).

http://www.masm32.com/board/index.php?topic=1140.0

line counting again. But I actually have 2 different versions using
2 different algorithms. If you scroll down the second posted one is
done in a non-intuitive manner.

http://www.masm32.com/board/index.php?topic=5434.msg40666#msg40666
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

Seb

Hi Mark,

thanks a LOT for those links. I've bookmarked them all now. :U

If anyone else got any good links to provide, don't hesitate to post them here. Thanks.

Regards,
Seb

Alloy

Along with the extra larger Registers, SSE2 unlike SSE1 can directly copy data between the XMM, MMX, general purpose and some flag registers. It can be any kind of data as long as it fits into the register.  If done carefully it's useful extra storage of variables without using cache, stack or outside-of cpu memory. Be careful to of the zero extending that some instructions automatically do for no apparently  good reason moving register to register data.
We all used to be something else. Nature has always recycled.

Seb

Hi Alloy,

thanks for the tips. I'll take that last one into consideration when writing SSE2-code. :wink

Regards,
Seb

Alloy

#10
Quote from: Mark_Larson on November 21, 2006, 04:22:49 PM

I am working on a raytracer that I haven't finished yet.

How far are you working on the ray tracer Mark? I've been looking over some of the newer techniques lately and might try to convert some code to asm next year. I need to relearn alot since I haven't touched a calculus or differential equation book in almost 20 years.
We all used to be something else. Nature has always recycled.

FlySky

Hey there guys,

I have a question, I am new too SSE 2. My question is like how can I use move operands and moving stuff into xmm0 - xmm7. I can't figure it out for example: movss [xmm3], 0 keeps giving me erros like must be index or base register.

dsouza123


.686
.model flat,stdcall
option casemap:none
.xmm

include \masm32\include\windows.inc
include \masm32\include\kernel32.inc
include \masm32\include\user32.inc
includelib \masm32\lib\kernel32.lib
includelib \masm32\lib\user32.lib

.data
align 16
  fourSP real4 0.0, 1.0, 2.0, 3.0  ; real4 is a 32 bit single precision floating point value

.code
start:
   movss  xmm0, fourSP  ; xmm0 == ?.? ?.? ?.? 0.0              get one  real4
   movaps xmm1, fourSP  ; xmm1 == 3.0 2.0 1.0 0.0      aligned get four real4
   movups xmm2, fourSP  ; xmm2 == 3.0 2.0 1.0 0.0    unaligned get four real4
   invoke ExitProcess, NULL
end start

Mark_Larson

Quote from: Alloy on November 25, 2006, 05:29:31 AM
Quote from: Mark_Larson on November 21, 2006, 04:22:49 PM

   I am working on a raytracer that I haven't finished yet.

How far are you working on the ray tracer Mark? I've been looking over some of the newer techniques lately and might try to convert some code to asm next year. I need to relearn alot since I haven't touched a calculus or differential equation book in almost 20 years.

  I have working code in C that raytraces a spinning earth ( has an earth BMP), with a moon that circle the earth.  There is one light in the scene ( the moon). 

  I am converting it piece by piece, the above piece was my converting the ray/sphere intersection routine.  I spent a lot of time downloading different raytracing tutorials.  I have ones that go over kd trees and other scene management tools, which I will add much later.  I am also planning on shooting 4 rays at once using SSE2.  And I also want to add threaded support.  I chose scalar SSE for the ray/sphere intersection code so it would make it easy to convert to 4 paralell rays.  I am using Intel's Approximate Math Library to do the Trig functions using SSE registers.  They have it running faster than standard sin/cos through the FPU.  I'm going to interpolate the scene instead of drawing every pixel.  I am going to check every Nth pixel, if it is within a certain color range of the previous Nth pixel, I plot it using the previous Nth pixels color.  By default I am doing a 4x4 grid, so I am only drawing every 16th pixel. If it looks ugly, I'll drop back and try 2x2.

Mark
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm