Hello everybody. I've been a little busy recentely with
C learning, so my MASM/32 experimentations are waiting
for available time.
But I want to do some experimentations nevertheless, and
because I've already done some ADD,MOV, dereferencing
and so on, I'm wondering what else can I do with them, what
kind of instructions can I try, and what registers are available
on my machine, other than the 8 general 32 bit registers and
sons, so to speak, FLAG register, Stack registers and usual 32 bit ones.
According to the Intel Processor Identification Utility, my pc
has got a Core 2 Duo CPU E6600 2.40 Ghz, and is
in the X64 class Processor and can use SSE3 instructions.
I'm asking myself, and somebody who knows better than me,
some basic information about the quantity and type of registers
are there in that machine, what are they used for,
if I can move data from EAX for example to an MMX register
and things like these.
I think all these info are available on INTEL manuals, but before
diving into them and get lost, I'd like some general explanation
if you can help me or give me a link to these info.
One wonderful thing would be a 3 lines complete example of
the general ideas:
««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««« *
.686 ; create 32 bit code [?]
.model flat, stdcall ; 32 bit memory model
option casemap :none ; case sensitive
include \masm32\include\windows.inc ; always first
include \masm32\macros\macros.asm ; MASM support macros
; -----------------------------------------------------------------
; include files that have MASM format prototypes for function calls
; -----------------------------------------------------------------
include \masm32\include\masm32.inc
include \masm32\include\gdi32.inc
include \masm32\include\user32.inc
include \masm32\include\kernel32.inc
; ------------------------------------------------
; Library files that have definitions for function
; exports and tested reliable prebuilt code.
; ------------------------------------------------
includelib \masm32\lib\masm32.lib
includelib \masm32\lib\gdi32.lib
includelib \masm32\lib\user32.lib
includelib \masm32\lib\kernel32.lib
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start: ; The CODE entry point to the program
some code here to add or move a register X64 to another
and print the result, or the use of MMX, EMMX and the like
exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start ; Tell MASM where the program ends
I mean something I can assemble with the masm32 package, I'm using
ML from vs2010 so it can assemble the last available code I think.
Thanks for your patience.
Frank
While you run on an 32 bits OS and /or compile targeting an 32bits executable you will not be able to access the additional 64bits general purpose registers. Basically when the CPU is in 32 bits mode or executes an 32 bits executable THEN x64 register are off limits even if your CPU is x64 capable.
To put it blantly in 32 bits you do not have acces to RAX, RCX, RDX,... and R8, R9, to R15 neither to XMM8 ... XMM15. You can only use EAX, ECX, EDX, EBX, ESP,EBP,ESI,EDI and XMM1 to XMM7 even if your CPU is x64 capable. Access to FPU and MMX registers is kind of the same in x32 and in x64
Assumming that you run an 64bits OS like Windows 7 x64 then you could use x64 general purpoose registers but only if you compile for an PE32+ (in fact PE64) executable format target.
The MASM compiler provided in MASM32 package can not do this. You will have to use the 64 bits version of ML. Again unfortunately the x64 bits version of ML does not yet support invoke and many of the more advanced features of the ML 32 bits compiler like .IF .ELESIF .WHILE etc
Hence you can try JWASM, or GoASM or humbly my own assembler: SOL_ASM.
JWASM is the most compatible with MASM/MASM32.
GoASM is kind of different.... to much for my taste but some people here swear by it and it is part of those forums.
Sol_ASM is somewhere in the middle (you have invoke like in MASM but other things like the include/structures/db/PROC format is more like in TASM)
Then on other sites you can also find FASM or NASM or YASM with even more syntax diferences when compared to MASM.
I suggest that you gain confidence in 32 bits world with MASM32 and move to x64 only later because the transition is not exactly an easy ride but also not very complicated once you know 32bits well enough. Anyway you might get confused unless you have a solid 32bits conceptual base to fallback to.
IMHO JWASM is your best option now if you want to try x64 with a MASM32 like syntax. Or as a biased oppinion my own Sol_Asm :D
As for using and moving data to/from from GPR registers to FPU/MMX/SSE3/XMM registers some restrictions do apply but you will just have to learn them... your question is too vague to be answered briefly and clearely.
Quote from: BogdanOntanu on July 15, 2010, 05:27:48 PM
While you run on an 32 bits OS and /or compile targeting an 32bits executable you will not be able to access the additional 64bits general purpose registers.
Basically when the CPU is in 32 bits mode or executes an 32 bits executable x64 is off limits even if your CPU is x64 capable.
Assumming that you run an 64bits OS like Windows 7 x64 then you could use x64 general purpoose registers but only if you compile for an PE32+ (in fact PE64) exectable target.
The MASM compiler provided in MASM32 package can not do this. You will have to use the 64 bits version of ML. Again unfortunately the x64 bits version of ML does not yet support invoke and many of the more advanced features of the ML 32 bits compiler.
Hence you can try JWASM, or GoASM or humbly my own assembler: SOL_ASM.
JWASM is the most compatible with MASM/MASM32.
GoASM is kind of different.... to much for my taste but some people here swear by it and it is part of those forums.
Sol_ASM is somewhere in the middle (you have invoke like in MASM but other things like the include/structures/db/PROC format is more like in TASM)
Then on other sites you can also find FASM or NASM or YASM with even more syntax diferences when compared to MASM.
I suggest that you gain confidence in 32 bits world with MASM32 and move to x64 only later because the transition is not exactly an easy ride but also not very complicated once you know 32bits well enough. Anyway you might get confused unless you have a solid 32bits conceptual base to fallback to.
IMHO JWASM shold be your best choiche if you want to try x64 with a MASM32 like syntax. Or as a biased oppinion my own Sol_Asm :D
As for using and moving data to/from from GPR registers to FPU/MMX/SSE3/XMM registers some restrictions do apply but you will just have to learn them... your question is too vague to be answered briefly and clearely.
Thanks, BogdanOntanu.
My machine is X64 and my OS is WIN7/64 bit, but I'm not trying
to shift to 64 bit Assembly for the time being. I'm just a beginner
so I'm going to stay on 32 bit MASM for a while.
According to Wikipedia:
Quote
MMX defined eight registers, known as MM0 through MM7 (henceforth referred to as MMn). To avoid compatibility problems with the context switch mechanisms in existing operating systems, these registers were aliases for the existing x87 FPU stack registers (so no new registers needed to be saved or restored). Hence, anything that was done to the floating point stack would also affect the MMX registers and vice versa. However, unlike the FP stack, the MMn registers are directly addressable (random access).
Each of the MMn registers holds 64 bits (the mantissa-part of a full 80-bit FPU register). The main usage of the MMX instruction set is based on the concept of packed data types, which means that instead of using the whole register for a single 64-bit integer, two 32-bit integers, four 16-bit integers, or eight 8-bit integers may be processed concurrently.
The mapping of the MMX registers onto the existing FPU registers made it somewhat difficult to work with floating point and SIMD data in the same application. To maximize performance, programmers often used the processor exclusively in one mode or the other, deferring the relatively slow switch between them as long as possible.
Because the FPU stack registers are 80 bits wide, the upper 16 bits of the stack registers go unused in MMX, and these bits are set to all ones, which makes them NaNs or infinities in the floating point representation. This can be used to decide whether a particular register's content is intended as floating point or SIMD data.
MMX provides only integer operations. When originally developed, for the Intel_i860, the use of integer math made sense (both 2D and 3D calculations required it), but as graphics cards that did much of this became common, integer SIMD in the CPU became somewhat redundant for graphical applications. On the other hand, the saturation arithmetic operations in MMX could significantly speed up some digital signal processing applications.
I've got some 8 registers I never use, but that could be useful
for integer operations, or just to store data, or using SSE instructions
on the 64 bit area they provide.
You are right, the question is too vague, I need to express better my
idea:
Let's make a couple of practical examples:
if I run short of 32 bit GP register, how do I use the MMX registers
to store the content of some 32 bit GP register?
Is it better to push them on the stack and pop them afterwhile?
If so why?
When I have to deal with 4 16 bit integer numbers and I want
to perform some SIMD on them, like adding 1 to each of them
is it possible to do it with an MMX register?
I hope I am a little bit more clear now, I'm just trying to figure
what can I do with MMX registers that exist from 486 CPU,
without moving to X64 Assembly. It is not time yet :P
Quote from: BogdanOntanu on July 15, 2010, 05:27:48 PM
Basically when the CPU is in 32 bits mode or executes an 32 bits executable THEN x64 register are off limits even if your CPU is x64 capable.
Hey Bogdan, I was wondering is this an OS design choice or a CPU setting?.... At the lowest level (OS) you could you have switching right? rather than having to have 2 seperate exes like now but this would have to be factored into the PE equivalents design
Also does SOL OS work off PE format or do you have your own format?
google for Tommesani SSE2 - best intro you can get.
Quote from: jj2007 on July 15, 2010, 05:59:00 PM
google for Tommesani SSE2 - best intro you can get.
Thanks JJ, I'll have a look. :8)
Could anyone post some 3 lines example as well?
I mean the .686 directive to MASM is necessary?
Have I to declare some other directive for using MMX registers and SIMD/
SSE/SSE2/SSE3 instructions?
Thanks
Quote from: frktons on July 15, 2010, 05:45:19 PM
...
Thanks, BogdanOntanu.
My machine is X64 and my OS is WIN7/64 bit, but I'm not trying
to shift to 64 bit Assembly for the time being. I'm just a beginner
so I'm going to stay on 32 bit MASM for a while.
If you already run Windows7 x64 version (I also do) then you could concentrate on 32 bits for learning and also once in a while test some x64 code just to get your skills updated / introduced to the "new" x64 world...
At least that is what I do: I keep my main focus on x32 for now but I also do take long and deep incursions into x64 world and test programms / applications whenever I feel like.
Quote
I've got some 8 registers I never use, but that could be useful
for integer operations, or just to store data, or using SSE instructions
on the 64 bit area they provide.
Yes most of today machines have the extra MMX and XMM registers. The problem with MMX is that they are aliased over FPU registers and FPU is also releatively needed for many of today applications. If you can keep yourslef restrained into integer world then you could use MMX...
However using XMM is a much better choiche... personally I kind of ignore MMX and go directly for XMM when I want to use SSE.
Quote
You are right, the question is too vague, I need to express better my
idea:
Let's make a couple of practical examples:
if I run short of 32 bit GP register, how do I use the MMX registers
to store the content of some 32 bit GP register?
Is it better to push them on the stack and pop them afterwhile?
If so why?
Come on... such questions are naive at max. You can not gain this kind of experience by asking "what is better and why - make me a list" kind of questions :D
First you do need to read the INTEL manuals and some tutorials (as proposed here by jj2007) on this SSE instructions.
Then try some hands on simple tests... then you will understand the basics and gain the much needed neuronal paths and experience and then you can generate much more relevant questions if you hit some road block or if no questions arise then improve your hands simple test iteratively (by adding complexity).
If you get the predigeste answer then you have data but you are not improved internally. It might be usefull in a robotic or production way but it is not usefull for your future neuronal development.
Hands on simple tests and comming back with more exact questions is a better way to learn IMHO.
However in order to hint about your questions conceptually:
1) How do you propose to push a 64 bits register on a 32 bits stack?
2) Another thing to note is the Single Instruction Mutiple Data aspect: SSE instructions usually operate on multiple smaller data packed together inside a single MMX/XMM register. It depends on your skils to prepare or to handle data packed this way.
3) They can also help you with "saturation" and thus avoiding .IF eax>255 eax == 255 .ENDIF kind of code that can be time consuming inside an inner loop.
Quote
When I have to deal with 4 16 bit integer numbers and I want
to perform some SIMD on them, like adding 1 to each of them
is it possible to do it with an MMX register?
Better with an XMM register ;) bu yes generally you can do this and with saturation.
Quote
I hope I am a little bit more clear now, I'm just trying to figure
what can I do with MMX registers that exist from 486 CPU,
without moving to X64 Assembly. It is not time yet :P
x64 will give you access to extra XMM8...XMM15 but otherwise the instructions behave kind of exactly the same as in 32 bits.
Since you already use an x64 OS you do not have to restrain yourself because of this.
The real problems with X64 is that it uses another calling convention and you can not use mature enough tools yet (for example OllyDbg not working) and beeing a beginner it is of no use to deal with 2 (two) problems in the same time. You will not know if the problem is from x64 or from your handling of SSE.
However as I have said above by all means do take a peek int x64 if you are there ;)
004014F8 0F6EC1 movd mm0,ecx
004014FB 0F7EC1 movd ecx,mm0
http://www.tommesani.com/MMXDataTransfer.html
Quote from: frktons on July 15, 2010, 06:41:53 PM
Could anyone post some 3 lines example as well?
Not me :))
Quote
I mean the .686 directive to MASM is necessary?
Have I to declare some other directive for using MMX registers and SIMD/
SSE/SSE2/SSE3 instructions?
Thanks
Fast search on the forums reveals: .XMM
Quote from: oex on July 15, 2010, 05:52:11 PM
Hey Bogdan, I was wondering is this an OS design choice or a CPU setting?....
It is a CPU design setting. Design choiche of AMD.
Quote
At the lowest level (OS) you could you have switching right?
Not sure what you ask... but if I guess right the answer is NO.
However you do not have to since you can run both 32bits and 64 bits executables in a 64bits OS. You just can not mix them because the CPU forbids it.
Quote
rather than having to have 2 seperate exes like now but this would have to be factored into the PE equivalents design
This is more an issue with the CPU than with PE. Hence not possible.
Quote
Also does SOL OS work off PE format or do you have your own format?
For now SOL_OS can load map and run PE32 files for compatibility reasons (with existing compilers and tools). You can also load and run plain binary files under certain circumstances (older interfaces).
However in SOL_OS you (the programmer) have full control over the machine and hence there is nothing blocking you from switching the CPU into x64 mode and use 64 bits registers and then return to 32 bits mode and /or load and run your own favorite executable.
I do plan to use my own executable format later but this is not a priority at this moment (a format is pre-designed)
However:
1) One needs experience and deep understanding of existing formats in order to invent another "better" format.
2) There is a huge burden that a new executable format would place on potential developers and tool chains.
Note: this is kind of off-topic for the OP's question.
Hence if you intend to ask further questions on SOL_OS then you can do so on SOL_OS forums or here BUT in another thread. Emails or PM are not recommended with me :D
ty for reply it answers my questions :)
Quote from: BogdanOntanu on July 15, 2010, 07:11:09 PM
Quote
When I have to deal with 4 16 bit integer numbers and I want
to perform some SIMD on them, like adding 1 to each of them
is it possible to do it with an MMX register?
Better with an XMM register
Google for
mmx fpu emms to understand the reason.
Thanks everybody.
All your suggestions will be meditated upon. :U
what I got till now from your indications is:
1)I have to declare that I'm using MMX this way:
.686
option casemap:none
.mmx
.xmm
.model flat, stdcall
2) I can move data from 32 bit registers to the 32 low bits of an MMX
and viceversa this way:
movd mm0,ecx
movd ecx,mm0
So a complete program could be:
««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««« *
.686 ; create 32 bit code
option casemap :none ; case sensitive
.mmx
.xmm
.model flat, stdcall ; 32 bit memory model
include \masm32\include\windows.inc ; always first
include \masm32\macros\macros.asm ; MASM support macros
; -----------------------------------------------------------------
; include files that have MASM format prototypes for function calls
; -----------------------------------------------------------------
include \masm32\include\masm32.inc
include \masm32\include\gdi32.inc
include \masm32\include\user32.inc
include \masm32\include\kernel32.inc
; ------------------------------------------------
; Library files that have definitions for function
; exports and tested reliable prebuilt code.
; ------------------------------------------------
includelib \masm32\lib\masm32.lib
includelib \masm32\lib\gdi32.lib
includelib \masm32\lib\user32.lib
includelib \masm32\lib\kernel32.lib
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start: ; The CODE entry point to the program
mov ecx, 12345
movd mm0,ecx
print str$(mm0)," value of mm0",13,10
movd ecx,mm0
print str$(ecx)," value of ecx",13,10
exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start ; Tell MASM where the program ends
Or I still need something else? It doesn't assemble with masm32. :(
The general take away is that using MMX or SSE registers as additional scratch registers for the processor is not a great plan.
Specifically, they are designed to pull vector data out of memory and process it/them in parallel. Intel calls this SIMD (Single Instruction Multiple Data)
The Software Optimization Cookbook, Richard Gerber, Intel Press, ISBN 0-9712887-1-4
http://www.alibris.com/booksearch?qisbn=9780971288713&qwork=
A small example that assembles.
ml -Fl -c -coff test32.asm
TEST32.ASM
.686
.XMM
.MODEL FLAT
.CODE
_start:
mov ecx, 12345
movd mm0,ecx
movd ecx,mm0
ret
END _start
TEST32.LST
Microsoft (R) Macro Assembler Version 6.15.8803 07/15/10 15:01:09
test32.asm Page 1 - 1
.686
.XMM
.MODEL FLAT
00000000 .CODE
00000000 _start:
00000000 B9 00003039 mov ecx, 12345
00000005 0F 6E C1 movd mm0,ecx
00000008 0F 7E C1 movd ecx,mm0
0000000B C3 ret
END _start
Quote from: clive on July 15, 2010, 08:01:46 PM
Microsoft (R) Macro Assembler Version 6.15.8803 07/15/10 15:01:09
test32.asm Page 1 - 1
.686
.XMM
.MODEL FLAT
00000000 .CODE
00000000 _start:
00000000 B9 00003039 mov ecx, 12345
00000005 0F 6E C1 movd mm0,ecx
00000008 0F 7E C1 movd ecx,mm0
0000000B C3 ret
END _start
OK clive, I got that.
What if I want to display the content of ecx and mm0?
Why my example is not assembling? What's wrong?
Quote from: clive on July 15, 2010, 07:58:15 PM
The general take away is that using MMX or SSE registers as additional scratch registers for the processor is not a great plan.
Specifically, they are designed to pull vector data out of memory and process it/them in parallel. Intel calls this SIMD (Single Instruction Multiple Data)
The Software Optimization Cookbook, Richard Gerber, Intel Press, ISBN 0-9712887-1-4
http://www.alibris.com/booksearch?qisbn=9780971288713&qwork=
Good to know. If I'd like a faster code I'll have to take that into account. :U
Well, one of the thing MASM32 doesn't like is:
print str$(mm0)," value of mm0",13,10
maybe the print macro doesn't accept this kind of
data to be displayed.
What else?
Microsoft (R) Macro Assembler Version 10.00.30319.01
Copyright (C) Microsoft Corporation. All rights reserved.
Assembling: C:\masm32\examples\mmx_usage.asm
C:\masm32\examples\mmx_usage.asm(1) : error A2044:invalid character in file
C:\masm32\examples\mmx_usage.asm(37) : error A2034:must be in segment block
C:\masm32\examples\mmx_usage.asm(39) : error A2034:must be in segment block
C:\masm32\examples\mmx_usage.asm(40) : error A2034:must be in segment block
C:\masm32\examples\mmx_usage.asm(42) : error A2034:must be in segment block
C:\masm32\examples\mmx_usage.asm(49) : error A2006:undefined symbol : start
C:\masm32\examples\mmx_usage.asm(49) : error A2148:invalid symbol type in expres
sion : start
_
Assembly Error
Premere un tasto per continuare . . .
Well I found something else, I was missing .data and .code
Still something wrong:
C:\masm32\examples\mmx_usage.asm(1) : error A2044:invalid character in file
And the last one: I was missing a ";" at the very first line of comment.
Quote from: frktons on July 15, 2010, 08:23:47 PM
maybe the print macro doesn't accept this kind of
data to be displayed.
What else?
Well, many times in ASM you are on your own and you have to create your own tools and routines (unlike in HLL languages).
MASM32 spoils you a little with it's stock of macro's and routines ready to use BUT it is possible that the macro provided with MASM32 does not support 64bits or 128 bits registers printing... after all it is designed for 32 bits.
Hence there might be nothing else.
Use the example kindly provided by Clive and consider that maybe now it is a good time for you to learn how to write your own simple routine to convert an 64bits / 128 bits integer into it's ASCII equivalent and to print the resulting string on screen.
Alternatively you could store the MMX / XMM register into a memory location / variable and then do two consecutive prints on it's low and high parts inorder to see the ASCII on screen and avoid writting your own code ...
Quote from: frktons
C:\masm32\examples\mmx_usage.asm(37) : error A2034:must be in segment block
C:\masm32\examples\mmx_usage.asm(49) : error A2006:undefined symbol : start
Needs a .CODE, _start is probably what you should be using. Pretty sure you can't use MM0..MM7 for those STR$ macros, being as the registers live inside the FPU.
Quote from: BogdanOntanu on July 15, 2010, 08:34:55 PM
Well, many times in ASM you are on your own and you have to create your own tools and routines (unlike in HLL languages).
MASM32 spoils you a little with it's stock of macro's and routines ready to use BUT it is possible that the macro provided with MASM32 does not support 64bits or 128 bits registers printing... after all it is designed for 32 bits.
Hence there might be nothing else.
Use the example kindly provided by Clive and consider that maybe now it is a good time for you to learn how to write your own simple routine to convert an 64bits / 128 bits integer into it's ASCII equivalent and to print the resulting string on screen.
Alternatively you could store the MMX / XMM register into a memory location / variable and then do two consecutive prints on it's low and high parts inorder to see the ASCII on screen and avoid writting your own code ...
Thanks, I'll try do do it by myself.
For the time being I realized how I can use the MMX registers,
and this is the first thing I was trying to understand.
The final code that compiles and runs could be the starting
point for doing what you suggest. :P
;««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««« *
; Example of MMX register usage
;««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««« *
include \masm32\include\masm32rt.inc
.686
.mmx
.xmm
include \masm32\macros\macros.asm ; MASM support macros
.data
.code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start: ; The CODE entry point to the program
mov ecx, 12345
movd mm0,ecx
; print str$(mm0)," value of mm0",13,10
movd ecx,mm0
print str$(ecx)," value of ecx",13,10
print "Press a key to close the program"
call wait_key
print chr$(13,10)
exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start ; Tell MASM where the program ends
Quote from: clive on July 15, 2010, 09:08:33 PM
Quote from: frktons
C:\masm32\examples\mmx_usage.asm(37) : error A2034:must be in segment block
C:\masm32\examples\mmx_usage.asm(49) : error A2006:undefined symbol : start
Needs a .CODE, _start is probably what you should be using. Pretty sure you can't use MM0..MM7 for those STR$ macros, being as the registers live inside the FPU.
Yes clive, after some experiments I got that, now I'll try to find a way to display
the mm0 content. :U
Edit: something like this would be sufficient in this case:
mov ecx, 12345
movd mm0, ecx
movd eax, mm0
print str$(eax)," value of 32 lower bit mm0",13,10
Printf can handle 64-bit integers directly, and the formatting is easy. For 128-bit values the only easy method I can see is to display them as back to back 64-bit values, in hex.
;==============================================================================
include \masm32\include\masm32rt.inc
.586
.MMX
.XMM
;==============================================================================
.data
i64 dq 1122334455667788h
dq 8877665544332211h
rmm0 dq 0
rxmm0 dq 0,0
.code
;==============================================================================
start:
;==============================================================================
movq mm0, i64
movq rmm0, mm0
movups xmm0, i64
movups rxmm0, xmm0
invoke crt_printf, cfm$("MM0 = %I64Xh\n\n"), rmm0
invoke crt_printf, cfm$("XMM0 = %I64X%I64Xh\n\n"), rxmm0, rxmm0+8
inkey "Press any key to exit..."
exit
;==============================================================================
end start
GoASM is the best, the more I use it, the more I love it, it-just-makes-sense, and it saves me a ton of time even in the 32bit world vs masm.
Quote from: MichaelW on July 16, 2010, 05:26:17 AM
Printf can handle 64-bit integers directly, and the formatting is easy. For 128-bit values the only easy method I can see is to display them as back to back 64-bit values, in hex.
;==============================================================================
include \masm32\include\masm32rt.inc
.586
.MMX
.XMM
;==============================================================================
.data
i64 dq 1122334455667788h
dq 8877665544332211h
rmm0 dq 0
rxmm0 dq 0,0
.code
;==============================================================================
start:
;==============================================================================
movq mm0, i64
movq rmm0, mm0
movups xmm0, i64
movups rxmm0, xmm0
invoke crt_printf, cfm$("MM0 = %I64Xh\n\n"), rmm0
invoke crt_printf, cfm$("XMM0 = %I64X%I64Xh\n\n"), rxmm0, rxmm0+8
inkey "Press any key to exit..."
exit
;==============================================================================
end start
Hi Michael, thanks for the example.
I've tried to assemble it with MASM32 but I get some errors:
Microsoft (R) Macro Assembler Version 10.00.30319.01
Copyright (C) Microsoft Corporation. All rights reserved.
Assembling: C:\masm32\examples\print64bit.asm
C:\masm32\examples\print64bit.asm(20) : error A2070:invalid instruction operands
C:\masm32\examples\print64bit.asm(21) : error A2070:invalid instruction operands
_
Assembly Error
Press a key . . .
What's wrong?
It is pointing at these instructions I think:
movups xmm0, i64
movups rxmm0, xmm0
because I added a couple of comment lines on the top.
Quote from: E^cube on July 16, 2010, 05:49:41 AM
GoASM is the best, the more I use it, the more I love it, it-just-makes-sense, and it saves me a ton of time even in the 32bit world vs masm.
Well, GoAsm could be a next tool to get, but actually I
prefer to stick with MASM32 in order to learn enough bits
of Assembly, afterwhile I'll see....
QuoteI've tried to assemble it with MASM32 but I get some errors
I tested with ML 6.15 only. Try adding OWORD PTR in front of the memory operands.
http://msdn.microsoft.com/en-us/library/2det2cf1(VS.71).aspx
Quote from: MichaelW on July 16, 2010, 12:20:54 PM
I tested with ML 6.15 only. Try adding OWORD PTR in front of the memory operands.
http://msdn.microsoft.com/en-us/library/2det2cf1(VS.71).aspx
Thanks Michael it now works this way:
movups xmm0, OWORD PTR i64
movups OWORD PTR rxmm0, xmm0
and outputs:
MM0 = 1122334455667788h
XMM0 = 11223344556677888877665544332211h
Press any key to exit...
Is that what we expected? OWORD means we are using 8 bytes operands?
Yes, and yes. I initially included the OWORD PTR because the defined size of the data does not match the register size. I removed it when 6.15 did not complain, but that clearly was a bad choice.
Quote from: MichaelW on July 16, 2010, 03:05:21 PM
Yes, and yes. I initially included the OWORD PTR because the defined size of the data does not match the register size. I removed it when 6.15 did not complain, but that clearly was a bad choice.
:U
According to these small experiments and code kindly posted
by some of you, I can now reply to my own question that there
are quite a few registers I can use on my box, 8 general purpose
registers, 8 MMX, 16 XMM and probably some more.
How to use them, and when it is convenient to do it, well that
is a long path to go :P
Quote from: frktons on July 16, 2010, 09:22:26 PM
16 XMM and probably some more.
In 32 bits mode you only have access to 8 XMM registers.
In 64bits mode you also have access to 16 GPR registers.
Quote
How to use them, and when it is convenient to do it, well that
is a long path to go :P
This is not really important from an conceptual point of view.
Most software algorithms can be expressed and perform very well with just a few GPR registers available. At some point the number of registers becomes a market issue and a drag on the CPU speed (selecting from 16 registers instead of 8 registers is slower in electronics).
Also, internally the CPU does perform a few extra tricks like register alias/renaming and even if you overuse the same register the CPU knows better.
Hence, for a start I would not concentrate myself too much on using registers intensively.
It is more important to learn how to express code and algorithms in a "register / memory / jumps / calls and returns" based "world" rather that to contemplate optimum register usage.
For an syntax example understanding the diference between EAX and [EAX] is much more important ... and to a certain extent the difference between OFFSET and ADDR operators is also important.
Frank,
With registers in 32 bit, about 95% of the work is done in the 8 GP registers. The additional registers with their matching instruction sets tend to be more focused on a particularl type of work. FP for maths, MMX sharing the FP registers was the early multi-media extensions that were then bypassed by the extended multi-media instructions and registers (XMM or SIMD) which has then developed through about 4 families of additions (SSE - SSE4.2 Intel ).
Each register type has its own instruction set and the most extensive are the original general purpose integer registers (EAX ECX EDX EBX EBP ESP ESI EDI). 32 bit and later processors removed many restrictions on how the 8 GP registers were used but some legacy instructions still require specific registers to work, XLAT MOVS STOS etc ....
(http://images.anandtech.com/reviews/cpu/amd/hammer/x86-64.gif)
If you have dual core+ you still technically have exactly the same registers though they are silently doubled+
i was waiting for someone to bring up HTT :P
i have thought some of playing around with "hogging both threads" of my prescott - lol
i don't think it is a great idea in practice, though
seems like you tie up the machine by doing that - better to let the OS manage threads
but - it could give you a whole extra set of registers to play with
not that you could exchange from one set to another efficiently, but you might be able to find some advantage in there
Thanks everybody for your suggestions. :U
Actually my CPU doesn't support Hyper-Threading Technology so
HTT is not an issue for the time being, and it is probably a too advanced
subject for n00bs of my level. :P
Algorithms in Assembly well that's the matter I'd really like
to grasp a little. I have seen a lot of good books on algorithms in
C/C++/Java and the like. Probably the C/C++ category is the most
close to the machine.
From C, that I'm actually learning, I'll take advantage to get some
Algorithm attitude, so to speak, and then all the way long to translate
or adapt them in MASM/GoAsm whatever.
It's quite a long way though, and the sources are overwhelming :eek
A step, slow one, at a time, no other choice. :lol
One of the thing I'd like to test is the use of 64 bit registers
to perform the division, that is quite resource consuming, as
many of you have explained to me.
This short mixed code I use for dividing by ten a number
is an example I'd like to improve a little with a better algorithm,
maybe a divide by multiply and shift, and/or with the use of
some 64 bit Assembly trick I'm not aware of:
long div_result = 0;
long remain = 0;
const long ten = 10;
num2 = rand() % 10000;
__asm{
xor edx, edx
mov eax, num2
mov ecx, ten
idiv ecx
mov div_result, eax
mov remain, edx
}
Probably MMX registers are not well suited for this purpose,
or are slower than GPR, I actually don't know. Surely if I use
the following code, that is obviously in C language:
num2 = rand() % 10000;
div_result = (num2 * 6554UL) >> 16;
remain = num2 - div_result * ten;
I get a better performance because the algorithm is smarter
and doesn't use division, but a magic number to
multiply the number to divide and after it shifts right the same
number a given number of position.
Of course 6544 works for number not bigger than 9999
and I'd have to calculate the magic number depending on the range
I'm going to use.
So I was wondering what performance could we get using methods
like this with 64 bit registers and a full set of magic numbers
to use. ::)
I translated the C code for divide by multiply and shift:
div_result = (num2 * 6554UL) >> 16;
remain = num2 - div_result * ten;
in Assembly this way:
mov eax, num2
imul eax, 6554
shr eax, 16
mov div_result, eax
mov ecx, num2
imul eax, ten
sub ecx, eax
mov remain, ecx
But the performances are about the same, and I don't
know if it depends on how good the compiler is to
translate the code, or how bad I am to do the same. :P
Any suggestion to improve the above code?
Magic numbers are good for dividing by using a magic number multiply and shifting, but you get no remainder, and need the shift, and are usually used for dividing by constants and not for dividing by variables. For variables, you would need a table of all possible magic numbers, or a table that contained a pair of number/magic_number entries which had to be searched for a number match to get the magic number to use. The full table would exceed allowable memory (especially for 64 bit). The search would take more time than you would save with the Magic number multiply.
Until you start using 64 bit processing, you do not have 64 bit gp registers (rax,rdx). I do not see any MMX 64 bit register instructions that did divides. Some MMX 64 bit register packed multiplies exist, but nothing that you cannot do with multiply eax and edx. Note, to save a register, put one value in eax, the other in edx, then mul, the 64 bit result in eax:edx (low 32 bits:high 32 bits).
Dave.
Thanks Dave.
I was doing naive assumptions, typical beginner stuff :P
By the way, the code I used to translate the C code is good enough
or could I do better in some ways?
Quote from: frktons on July 19, 2010, 02:20:14 AM
Any suggestion to improve the above code?
You could swap memory for registers though it really does depend on the surrounding code.... ie I see no need for this line in current code:
mov div_result, eax
you could also:
mov eax, num2
mov ecx, eax
Quote from: oex on July 19, 2010, 03:23:00 AM
You could swap memory for registers though it really does depend on the surrounding code.... ie I see no need for this line in current code:
mov div_result, eax
well I need the div_result variable to use in the C code.
Quote
you could also:
mov eax, num2
mov ecx, eax
Well, this is good :U I can spare some cycles this way. Thanks:
mov eax, num2
mov ecx, eax
imul eax, 6554
shr eax, 16
mov div_result, eax
imul eax, ten
sub ecx, eax
mov remain, ecx
Nevertheless I'm not able to beat the Pelles'C compiler.
The C code is as fast as the Assembly. :eek
Most of the time is taken up in the imuls.... If you can find a way to remove or combine them you should be in luck but it's too late for me to do that math :lol
Quote from: oex on July 19, 2010, 04:56:23 AM
Most of the time is taken up in the imuls...
imuls are actually pretty fast, much faster than normal muls, so don't waste too much efforts for finding a workaround.
I was working off the MASM opcodes manual which has them at 13-42 clocks each.... Is there a better ref?
mov, sub and shr are down as 1-3 clocks....
I dont know for sure and it's been a VERY long night but shr, 16 would be:
movzx ebx, ax
I think.... (maybe the other way round.... bswap first) being 16 bit this might be slightly faster?
Quote from: oex on July 19, 2010, 07:09:36 AM
I was working off the MASM opcodes manual which has them at 13-42 clocks each.... Is there a better ref?
mov, sub and shr are down as 1-3 clocks....
I dont know for sure and it's been a VERY long night but shr, 16 would be:
movzx ebx, ax
I think.... (maybe the other way round.... bswap first) being 16 bit this might be slightly faster?
Thanks oex, this is another option to try:
movzx ebx, ax
or the code that works for it, I still don't know. ::)
Back home, on my pc, I'll try it and see if it performs any better. :P
Frank and oex, forget old timing manuals in cycles on anything later than a 386 as they have pipelines that "SCHEDULE" instructions and on some of the later processors the throughput of any single instruction without a stall may be 40 to 50 cycles from entry to retirement.
Think of one or more pipelines as instruction assembly production lines like in a factory, performance is measured by the output, not the individual component.
Oh OK so how many conveyor belts?.... I take it you mean something like....
mov eax, 7
add eax, 3
mov ebx, 5
add ebx, 5
add eax, ebx
Pipeline 1 Pipeline 2 - Processed on completion P1
mov eax, 7
add eax, 3
add eax, ebx
mov ebx, 5
add ebx, 5
????
If so how many 'component' conveyor belts in each pipeline (ie 2 in this example in P1 and 1 in P2)
Maybe I'm way off again.... I guess this looks more like FPGA but it's been a long night....
Maybe someone could point me in the direction of a diagram and/or code example explanation?
Quote from: hutch-- on July 19, 2010, 08:45:54 AM
Frank and oex, forget old timing manuals in cycles on anything later than a 386
Indeed.
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
5434 cycles for 100*div
470 cycles for 100*mul
174 cycles for 100*imul
177 cycles for 100*shl
On my pc I get different results:
Intel(R) Core(TM)2 Duo CPU E4500 @ 2.20GHz (SSE4)
175 cycles for 100*mul
95 cycles for 100*imul
62 cycles for 100*shl
--- ok ---
shl looks faster.
Probably because you only shifted 1 position:
mov eax, 2
shl eax, 1
hmmmm that just creates even more questions for me :lol.... Hutch just said to forget output of individual instructions so what do those timings tell us *out of context*?
From what I can see here now no code can be judged by any means other than testing it?
Quote from: hutch
on some of the later processors the throughput of any single instruction without a stall may be 40 to 50 cycles from entry to retirement
Quote from: frktons on July 19, 2010, 09:08:50 AM
shl looks faster.
Probably because you only shifted 1 position:
That "shl reg, 1 is faster than shl reg, 15" might be valid for very old CPUs... see updated attachment above.
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
4728 cycles for 100*div
468 cycles for 100*mul
172 cycles for 100*imul
183 cycles for 100*shl 1
181 cycles for 100*shl 2
Not many changes, actually:
Intel(R) Core(TM)2 Duo CPU E4500 @ 2.20GHz (SSE4)
1221 cycles for 100*div
179 cycles for 100*mul
95 cycles for 100*imul
63 cycles for 100*shl 1
63 cycles for 100*shl 2
1226 cycles for 100*div
179 cycles for 100*mul
95 cycles for 100*imul
62 cycles for 100*shl 1
63 cycles for 100*shl 2
--- ok ---
shl still looks 50% faster than imul ::)
Well I'm working on a Win XP pro/32 bit with a Core 2 duo, I don't
know if, but it seems to make difference.
Quote from: frktons on July 19, 2010, 09:21:02 AM
shl still looks 50% faster than imul ::)
Well I'm working on a Win XP pro/32 bit with a Core 2 duo, I don't
know if, but it seems to make difference.
We are talking 0.95 cycles instead of 0.63 cycles per multiplication. And you still have not explained how you want to replace
imul eax, 6554 with some intelligent shift, add etc operations that perform in less than 0.95 cycles...
Quote from: jj2007 on July 19, 2010, 09:25:49 AM
We are talking 0.95 cycles instead of 0.63 cycles per multiplication. And you still have not explained how you want to replace imul eax, 6554 with some intelligent shift, add etc operations that perform in less than 0.95 cycles...
Sorry JJ, I was just showing you what I get. In order to change
the
imul eax, 6554 with something smarter I've no clue
for the time being, I have to think about that for a while. That is
a
magic number and I don't know how to deal with them
without offending them :lol
Could you suggest something?
By the way, have you any idea why on your machine the imul
and the shift have different performances? ::) your machine should
be faster than mine according to what is displayed.
Lets get some AMD representation:
AMD Phenom(tm) II X6 1055T Processor (SSE3)
1899 cycles for 100*div
193 cycles for 100*mul
96 cycles for 100*imul
61 cycles for 100*shl 1
61 cycles for 100*shl 2
1896 cycles for 100*div
193 cycles for 100*mul
96 cycles for 100*imul
61 cycles for 100*shl 1
61 cycles for 100*shl 2
But honestly, the timing of individual instructions is useless. When choosing between SHL and IMUL (and hey, why wasnt LEA represented here?) the other instructions in the pipeline mean everything. IMUL takes over a different execution unit than the SHL does on the latest from both Intel and AMD.
oex,
The concept of 1 or more pipelines is not something you can control very well independently, it comes more in understanding how they work. You have 2 basic classes of instructions, the RISC preferred set and the old junk, mainly stored in microcode and what recent processors do is present an interface with the x86 instruction set. From a variety of sources you get a reasonably good idea of what the preferred instruction set is and its usually the simpler instructions. MOV ADD SUB TEST CMP, then you have more complex instructions that get slower and this varies from one processor to another, shifts, rotates are usually off the pace on late hardware, XCHG is a lemon, string instructions without REP are worth avoiding but there is special case circuitry when used with REP that cut in after about 500 bytes. On older hardware IMUL MUL were very slow and still are in comparison to preferred instructions but later hardware is getting faster with multiplications as they have additional execution units to do stuff like this.
You get the fastest code for the data size by using preferred instructions and avoiding stalls from a variety of situations, dependency being one of the bad ones that will stop a pipeline until the result it depends on is available. Earlier processors had problems with alignment and some had problems with different data sizes apart from the native unit size, 32 bit and on later stuff, 64 bit.
LEA was fast on everything from a 486 up to the early PIVs where it was off the pace and could be replaced by a number of ADDs in some contexts, on the Core series and later LEA is fast again.
Quote from: Rockoon on July 19, 2010, 11:35:11 AM
Lets get some AMD representation:
AMD Phenom(tm) II X6 1055T Processor (SSE3)
1899 cycles for 100*div
193 cycles for 100*mul
96 cycles for 100*imul
61 cycles for 100*shl 1
61 cycles for 100*shl 2
1896 cycles for 100*div
193 cycles for 100*mul
96 cycles for 100*imul
61 cycles for 100*shl 1
61 cycles for 100*shl 2
But honestly, the timing of individual instructions is useless. When choosing between SHL and IMUL (and hey, why wasnt LEA represented here?) the other instructions in the pipeline mean everything. IMUL takes over a different execution unit than the SHL does on the latest from both Intel and AMD.
What about
lea, where is she?
Post some representing code of
lea please, let's have
a taste of her :lol
Quote from: hutch-- on July 19, 2010, 11:58:32 AM
oex,
The concept of 1 or more pipelines is not something you can control very well independently, it comes more in understanding how they work. You have 2 basic classes of instructions, the RISC preferred set and the old junk, mainly stored in microcode and what recent processors do is present an interface with the x86 instruction set. From a variety of sources you get a reasonably good idea of what the preferred instruction set is and its usually the simpler instructions. MOV ADD SUB TEST CMP, then you have more complex instructions that get slower and this varies from one processor to another, shifts, rotates are usually off the pace on late hardware, XCHG is a lemon, string instructions without REP are worth avoiding but there is special case circuitry when used with REP that cut in after about 500 bytes. On older hardware IMUL MUL were very slow and still are in comparison to preferred instructions but later hardware is getting faster with multiplications as they have additional execution units to do stuff like this.
You get the fastest code for the data size by using preferred instructions and avoiding stalls from a variety of situations, dependency being one of the bad ones that will stop a pipeline until the result it depends on is available. Earlier processors had problems with alignment and some had problems with different data sizes apart from the native unit size, 32 bit and on later stuff, 64 bit.
LEA was fast on everything from a 486 up to the early PIVs where it was off the pace and could be replaced by a number of ADDs in some contexts, on the Core series and later LEA is fast again.
It looks like you never get rest with CPU modifications and upgrades.
Probably you have to stick with whatever is the best for a timeframe
and be ready to change as far as it is needed. ::)
QuoteFrom what I can see here now no code can be judged by any means other than testing it?
even testing it is only valid if you test it on a variety of CPU's
P4 cores are quickly becoming obsolete, in spite of the fact that they may not be all that old
here is the method i use...
(http://img839.imageshack.us/img839/4458/carnac.jpg)
Quote from: Rockoon on July 19, 2010, 11:35:11 AM
and hey, why wasnt LEA represented here?
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
5440 cycles for 100*div
471 cycles for 100*mul
173 cycles for 100*imul
426 cycles for 100*lea, 2*eax
277 cycles for 100*lea, 2*eax+eax
426 cycles for 100*lea, 2*eax+eax+99
177 cycles for 100*shl 1
177 cycles for 100*shl 2
FWIW,
Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz (SSE4)
1219 cycles for 100*div
178 cycles for 100*mul
94 cycles for 100*imul
94 cycles for 100*lea, 2*eax
94 cycles for 100*lea, 2*eax+eax
94 cycles for 100*lea, 2*eax+eax+99
62 cycles for 100*shl 1
62 cycles for 100*shl 2
1217 cycles for 100*div
178 cycles for 100*mul
94 cycles for 100*imul
94 cycles for 100*lea, 2*eax
94 cycles for 100*lea, 2*eax+eax
94 cycles for 100*lea, 2*eax+eax+99
62 cycles for 100*shl 1
62 cycles for 100*shl 2
No different to the earlier test (I do keep an eye on you mr jj :bg)
Tests should operate on the same data :naughty:
:bg
Frank,
Quote
It looks like you never get rest with CPU modifications and upgrades.
Probably you have to stick with whatever is the best for a timeframe
and be ready to change as far as it is needed. Roll Eyes
Welcome to mixed mode or balanced mode assembler programming. :P
Demonstrating the reason why I suggested LEA be tested:
AMD Phenom(tm) II X6 1055T Processor (SSE3)
1896 cycles for 100*div
193 cycles for 100*mul
96 cycles for 100*imul
61 cycles for 100*lea, 2*eax
61 cycles for 100*lea, 2*eax+eax
61 cycles for 100*lea, 2*eax+eax+99
61 cycles for 100*shl 1
61 cycles for 100*shl 2
1896 cycles for 100*div
194 cycles for 100*mul
96 cycles for 100*imul
61 cycles for 100*lea, 2*eax
61 cycles for 100*lea, 2*eax+eax
61 cycles for 100*lea, 2*eax+eax+99
61 cycles for 100*shl 1
61 cycles for 100*shl 2
--- ok ---
AMD never gave up on LEA performance.
ty guys for your input.... I'm reasonably confident that my code is about as fast as it can be, outside of imul being faster everything else you've been saying seems to be pretty much inkeeping with the current rules I implement, I havent used imul up until now so I might be able to tease a few cycles out of my code yet :bg.... I'll take onboard what you have said and see what improvements I can make :bg
Quote from: Rockoon on July 19, 2010, 01:04:09 PM
Demonstrating the reason why I suggested LEA be tested:
AMD Phenom(tm) II X6 1055T Processor (SSE3)
1896 cycles for 100*div
193 cycles for 100*mul
96 cycles for 100*imul
61 cycles for 100*lea, 2*eax
61 cycles for 100*lea, 2*eax+eax
61 cycles for 100*lea, 2*eax+eax+99
61 cycles for 100*shl 1
61 cycles for 100*shl 2
1896 cycles for 100*div
194 cycles for 100*mul
96 cycles for 100*imul
61 cycles for 100*lea, 2*eax
61 cycles for 100*lea, 2*eax+eax
61 cycles for 100*lea, 2*eax+eax+99
61 cycles for 100*shl 1
61 cycles for 100*shl 2
--- ok ---
AMD never gave up on LEA performance.
Good to see
lea is fine. ;) Could you post the ASM
as well, I live on it for the time being. :P
Quote from: frktons on July 19, 2010, 05:29:52 PM
Good to see lea is fine. ;) Could you post the ASM
as well, I live on it for the time being. :P
See JJ's post.
Oh! Oh! I skipped a couple of post :P
Miss lea was not improved that much on my CPU:
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz (SSE4)
1216 cycles for 100*div
178 cycles for 100*mul
94 cycles for 100*imul
94 cycles for 100*lea, 2*eax
94 cycles for 100*lea, 2*eax+eax
94 cycles for 100*lea, 2*eax+eax+99
62 cycles for 100*shl 1
62 cycles for 100*shl 2
1217 cycles for 100*div
178 cycles for 100*mul
94 cycles for 100*imul
94 cycles for 100*lea, 2*eax
94 cycles for 100*lea, 2*eax+eax
94 cycles for 100*lea, 2*eax+eax+99
62 cycles for 100*shl 1
62 cycles for 100*shl 2
--- ok ---
Thanks JJ for providing all these fine examples. :clap:
AMD Athlon(tm) 4 Processor (SSE1)
4217 cycles for 100*div
310 cycles for 100*mul
260 cycles for 100*imul
78 cycles for 100*lea, 2*eax
66 cycles for 100*lea, 2*eax+eax
88 cycles for 100*lea, 2*eax+eax+99
74 cycles for 100*shl 1
67 cycles for 100*shl 2
4221 cycles for 100*div
310 cycles for 100*mul
260 cycles for 100*imul
78 cycles for 100*lea, 2*eax
66 cycles for 100*lea, 2*eax+eax
88 cycles for 100*lea, 2*eax+eax+99
75 cycles for 100*shl 1
66 cycles for 100*shl 2
Queue
Intel(R) Pentium(R) 4 CPU 3.20GHz (SSE2)
5918 cycles for 100*div
1020 cycles for 100*mul
460 cycles for 100*imul
213 cycles for 100*lea, 2*eax
91 cycles for 100*lea, 2*eax+eax
193 cycles for 100*lea, 2*eax+eax+99
95 cycles for 100*shl 1
90 cycles for 100*shl 2
5838 cycles for 100*div
1013 cycles for 100*mul
466 cycles for 100*imul
196 cycles for 100*lea, 2*eax
99 cycles for 100*lea, 2*eax+eax
197 cycles for 100*lea, 2*eax+eax+99
87 cycles for 100*shl 1
87 cycles for 100*shl 2
--- ok ---
AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ (SSE3)
4344 cycles for 100*div
210 cycles for 100*mul
79 cycles for 100*imul
70 cycles for 100*lea, 2*eax
85 cycles for 100*lea, 2*eax+eax
78 cycles for 100*lea, 2*eax+eax+99
28 cycles for 100*shl 1
61 cycles for 100*shl 2
4425 cycles for 100*div
193 cycles for 100*mul
96 cycles for 100*imul
76 cycles for 100*lea, 2*eax
62 cycles for 100*lea, 2*eax+eax
61 cycles for 100*lea, 2*eax+eax+99
78 cycles for 100*shl 1
81 cycles for 100*shl 2
--- ok ---
Quote from: BogdanOntanu on July 15, 2010, 07:31:57 PM
Quote from: oex on July 15, 2010, 05:52:11 PM
Hey Bogdan, I was wondering is this an OS design choice or a CPU setting?....
It is a CPU design setting. Design choiche of AMD.
Quote
At the lowest level (OS) you could you have switching right?
Not sure what you ask... but if I guess right the answer is NO.
However you do not have to since you can run both 32bits and 64 bits executables in a 64bits OS. You just can not mix them because the CPU forbids it.
I know this answer from "Bogdan" is a little bit old however I just run into this http://vxheavens.com/lib/vrg16.html
Looks like YASM has this "trick" allready "build-in" wondering how this can be done in MASM ::) to mix 32/64 - bit source code ?