Just want to ask some advice really.
Ive doen assembly for quite a while, i can write some pretty complex applications, though i dont fully know all of the opcodes (I havent even touched on MMX, SSE) and i havent fully played with all the compiler options options like libraries and custom .data type sections much. So in order to further my assembly programming ability (so i can write code faster and easier - or i might be tempted to switch to C :( ) i was wondering if anyone had some more advanced working tutorials specifically using perhaps opcodes i havent yet used (just ones that arent so common), libraries and other extra dongles you can get the compiler to do that i should know about (I only really know the instruction to compile an normal .exe), other areas like optimising, really small executables, memory management/acessability, the format of PE's and any other tricks and anything else you can think of.
Theres bound to be something somewhere but i keep coming across more and more newbie tutorials for assembly so i figured i'd ask here what people recommend to further my ability.
Cheers :U
RedXVII
Hello!
Reading your post I got 2 things in my mind.
Optimizing, making a code smaller and/or faster does not mean to introduce complex instructions of MMX,SSE or SSE2. Okay, if you are coding a complecated mathematical calculation like vector and matrix operations then those units can be used.
I have learnt a lot from this optimization tutorial by Mark Larson: http://www.mark.masmcode.com/
I had an astonishing experience when I managed to accelerate a looped code 7 times!!! :eek by following 4 recommandations of this tutorial. (This was a rather simple code, the tipps I followed were: dword alignment, taking care of jump predictions, instruction pairing and dissolving functional dependency)
I'm sorry, in the matter of compilation and linking settings I can't tell anything wise, but there are many mates here who experiment with those switches. :U
Good like to your works!
Greets, Gábor
Quote from: RedXVII on August 09, 2006, 12:58:10 AM
i was wondering if anyone had some more advanced working tutorials specifically using perhaps opcodes i havent yet used (just ones that arent so common), libraries and other extra dongles you can get the compiler to do that i should know about (I only really know the instruction to compile an normal .exe), other areas like optimising, really small executables, memory management/acessability, the format of PE's and any other tricks and anything else you can think of.
Unfortunately there are a lot of newbie tutorials and less advanced stuff.
1) There are a bunch of example code for OpenGL and DirectX you can dig up to learn that. There's even an OpenGL forum on this board.
2) I think learning MMX/SSE/SSE2 programming is very important for writing highly optimized code. I saw a great MMX routine that counts the number of lines in a file that is a lot faster than any ALU version. A lot of people think you can only use MMX/SSE/SSE2 for math type stuff, and that's simply not true. It has a lot of uses. If you learn more about it, you can take better advantage of them to get better optimized code. They also have libraries you can download from Intel that do cos/sin/tan/etc since SSE/SSE2 doesn't have support for trig functions.
3) You can learn how to program the FPU, Raymond has a very in depth tutorial online on the masm website. You can also do scalar SSE/SSE2 which is a lot like doing the FPU. You can do a single floating point instruction on one piece of data, instead of doing one instruction on multiple pieces of data. I have a tutorial here, that goes into detail about converting VC++ FP app to scalar SSE.
http://www.masm32.com/board/index.php?topic=1140.0
4) Gabor already posted my link to my optimization page ( thanks Gabor!), but I also have several tutorials that go into optimizing specific algorithms.
Using SSE2 to speed up Quaternions
- http://www.oldboard.assemblercode.com/index.php?topic=3469.0
Optimizing Mersenne Twister Random Number Generator. Agner Fog has the same routine optimized in assembler on his website. My version is a bit over 3 times faster than Agner's. I go into detail about how I sped it up.
- http://www.oldboard.assemblercode.com/index.php?topic=3565.0
MD5 SSE2 code used in the md5crk project. The project was an open source distributed project trying to prove md5 is insecure. The code was over 10 times faster than the C code.
- http://www.oldboard.assemblercode.com/index.php?topic=2921.0
the extra dongles you speak of, search for threads here on linker switches to result in smaller .exe, look in masm help file for switches for ml.exe, which can output assembly list with cyclecount, there are advanced subjects on write your own manual stack handling and free up all eight regs
on opcodes, you can learn what the different prefixes do and I even investigated in prefixes for MMX/SSE/SSE2 do
MMX makes it easy to write your own personal filter for photoshop, or mess with sounds or only what phantasy limits it to
the closest to shaderprogramming, is to program 4 floats at once with SSE and a final conversion to screens integer
I combine hardware accelerated d3d9 with oop and SSE/MMX
hello,
I see the post of Mark_Larson that wrote
Quote
2) I think learning MMX/SSE/SSE2 programming is very important for writing highly optimized code. I saw a great MMX routine that counts the number of lines in a file that is a lot faster than any ALU version.
Can be a sample downloadable somewhere for the "great MMX routine that counts the number of lines " ?
ToutEnMasm
I went to the old forum and tried to register but i get this error..
**Sorry, registration is currently disabled.**
How do you download files from the old forum if you never registered in the pass. I tried my masm32forum name and password and it did not work.
I will have a look at that a bit later to see if it can be easily changed without leaving the forum open to postings.
Thanks hutch–
Please do, but take your time. I been reading the soap box and other forum here and around the world... Hell, has the world went MAD... I understand your caution
You now got too much on your plate and no one want to give you a break for some reason. But WTH (you started it all for the sake of common since back in the day).
That's the life of a true warrior i guest. As long as you are in good health, we are happy, if not, we will CODE it back up :)
RedXVII is right, we need a controlled section or something for advanced assembler programmers moving on. How about a detailed section containing only MMX coding for starter if feasible, provided MMX has a future in upcoming processors. I guest things are hard because you got to see 30 - 100 years into the future at minimum and who know what OS and API snow job will not allow programmers to see asm code for sake of selling high level programming tools to the world of new developed future of fools. Hee hee. Even the thought drives me nuts.
Thanks again
What happened with the old forum was the idiot fringe kept hacking it because it was an old PHPBB2 forum so after fixing it a few times I converted it to a SMF forum but in the process the conversion did not handle the attachments properly. I have a link at the top that has an Apache listing of all the attachments that you can download and I cannot do much more than this unfortunately.
@hutch--
are you see new phpbb3?
Quote from: ToutEnMasm on August 09, 2006, 08:37:37 PM
hello,
I see the post of Mark_Larson that wrote
Quote
2) I think learning MMX/SSE/SSE2 programming is very important for writing highly optimized code. I saw a great MMX routine that counts the number of lines in a file that is a lot faster than any ALU version.
Can be a sample downloadable somewhere for the "great MMX routine that counts the number of lines " ?
ToutEnMasm
count_table is a look up table that holds the number of bits set in each byte from 0 to 255. Each loop reads two 64-bit words at a time. So it handles 16 bytes at a time. The code looks for an 0ah instead of an 0dh, so that it will work under Linux ( linux uses 0ah as line termination for files). Windows uses odh,0ah.
If you do a search on MMX/SSE/SSE2 on the old forum you will find lots of good stuff, a good bit isn't even math related :)
markl_CountFileLines proc near pszFile:ptr byte
; This routine just counts the lines in a file
; in reality it just counts character 0Dh
mov eax,0a0a0a0ah
movd mm7,eax
pshufw mm7,mm7,00000000b
xor eax,eax ; initialize line count to zero.
; pad pFileBuffer with 16 0's on the end, because pFielbuffer might not be divisible by 16.
mov edi,[pFilebuffer]
mov esi,[cbFile]
lea esi,[esi+edi]
align 16 ; for P3 and below.
L1:
movq mm0,[edi]
movq mm1,[edi+8]
add edi,16
pcmpeqb mm0,mm7
pcmpeqb mm1,mm7
pmovmskb ebx,mm0
pmovmskb ecx,mm1
;eax is running count.
movzx edx,bl
movzx ebx,cl
add eax,dword ptr [count_table+edx*4]
add eax,dword ptr [count_table+ebx*4]
cmp edi,esi
jb L1
RET
When I converted the above MMX code to SSE2 it ran faster on my P4. Here's an SSE2 version of a line counter. It doesn't use the same algorithm as the previous one. See if you can figure out what it does
mov eax,0a0a0a0ah
movd xmm7,eax
mov edx,[pFilebuffer]
pshufd xmm7,xmm7,00000000b
mov ecx,[cbFile]
pxor xmm6, xmm6
shr ecx,11 ;divide by 128*16
jnc evenly_divisible
inc ecx
evenly_divisible:
@@:
pxor xmm5, xmm5
;i think it is the unrolling that is hosing things.
i = 0
WHILE i LT (16*128)
movdqa xmm0, [edx + i + 0]
movdqa xmm1, [edx + i + 16]
movdqa xmm2, [edx + i + 32]
movdqa xmm3, [edx + i + 48]
pcmpeqb xmm0, xmm7
pcmpeqb xmm1, xmm7
pcmpeqb xmm2, xmm7
pcmpeqb xmm3, xmm7
paddb xmm0, xmm1
paddb xmm2, xmm3
paddb xmm0, xmm2
;can't you do this step outside the loop? I am pretty sure you can.
psubb xmm5, xmm0 ; total 128*8 max = 1K
i=i+16*4
ENDM
; unpack MM5 to get sum in 1K block
pxor xmm0, xmm0
psadbw xmm5, xmm0
paddd xmm6, xmm5
dec ecx
lea edx, [edx + 128*16]
jne @B
movd eax, xmm6
psrldq xmm6,8 ;shift Right 8 bytes
movd ebx, xmm6
add eax, ebx
markl_CountFileLines ENDP
good luck
Later is better than never,thanks for help.
Want something to bend your mind ? Try GPGPU, using the GPU to do your math I looked into it a bit for a project and had to lie down until the pain went away...
http://www.gpgpu.org/
For a lighter subject matter look here:
http://www.df.lth.se/~john_e/gems.html
Edgar,
Looking through the list, I see this:
;
; fast strlen()
;
; input:
; eax = offset to string
;
; output:
; ecx = length
;
; destroys:
; ebx
; eflags
;
lea ecx,[eax-1]
l1: inc ecx
test ecx,3
jz l2
cmp [byte ptr ecx],0
jne l1
jmp l6
l2: mov ebx,[ecx] ; U
add ecx,4 ; V
test bl,bl ; U
jz l5 ; V
test bh,bh ; U
jz l4 ; V
test ebx,0ff0000h ; U
jz l3 ; V
test ebx,0ff000000h ; U
jnz l2 ; V +1brt
inc ecx
l3: inc ecx
l4: inc ecx
l5: sub ecx,4
l6: sub ecx,eax
What is your impression of this algo?
Paul
Hi Paul,
I have to admit that I have never really looked through any of the algorithms but the population count, which I needed for a project many moons ago, but the one you posted, though it would be faster than a byte scan would not be the optimal, too many conditional jumps.
(CUDA) GPU acceleration is a neat idea... and can produce some astonishing speed-ups. Unfortunately this also creates quite a hardware dependency. Still, check out the C-based SDK:
http://www.nvidia.com/object/cuda_learn.html