News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Taking assembly further

Started by RedXVII, August 09, 2006, 12:58:10 AM

Previous topic - Next topic

RedXVII

Just want to ask some advice really.

Ive doen assembly for quite a while, i can write some pretty complex applications, though i dont fully know all of the opcodes (I havent even touched on MMX, SSE) and i havent fully played with all the compiler options options like libraries and custom .data type sections much. So in order to further my assembly programming ability (so i can write code faster and easier - or i might be tempted to switch to C  :( ) i was wondering if anyone had some more advanced working tutorials specifically using perhaps opcodes i havent yet used (just ones that arent so common), libraries and other extra dongles you can get the compiler to do that i should know about (I only really know the instruction to compile an normal .exe), other areas like optimising, really small executables, memory management/acessability, the format of PE's and any other tricks and anything else you can think of.

Theres bound to be something somewhere but i keep coming across more and more newbie tutorials for assembly so i figured i'd ask here what people recommend to further my ability.

Cheers  :U
RedXVII

gabor

Hello!

Reading your post I got 2 things in my mind.
Optimizing, making a code smaller and/or faster does not mean to introduce complex instructions of MMX,SSE or SSE2. Okay, if you are coding a complecated mathematical calculation like vector and matrix operations then those units can be used.
I have learnt a lot from this optimization tutorial by Mark Larson: http://www.mark.masmcode.com/
I had an astonishing experience when I managed to accelerate a looped code 7 times!!!  :eek by following 4 recommandations of this tutorial. (This was a rather simple code, the tipps I followed were: dword alignment, taking care of jump predictions, instruction pairing and dissolving functional dependency)

I'm sorry, in the matter of compilation and linking settings I can't tell anything wise, but there are many mates here who experiment with those switches.  :U

Good like to your works!

Greets, Gábor

Mark_Larson

Quote from: RedXVII on August 09, 2006, 12:58:10 AM
i was wondering if anyone had some more advanced working tutorials specifically using perhaps opcodes i havent yet used (just ones that arent so common), libraries and other extra dongles you can get the compiler to do that i should know about (I only really know the instruction to compile an normal .exe), other areas like optimising, really small executables, memory management/acessability, the format of PE's and any other tricks and anything else you can think of.

  Unfortunately there are a lot of newbie tutorials and less advanced stuff.

1) There are a bunch of example code for OpenGL and DirectX you can dig up to learn that.  There's even an OpenGL forum on this board.

2) I think learning MMX/SSE/SSE2 programming is very important for writing highly optimized code.  I saw a great MMX routine that counts the number of lines in a file that is a lot faster than any ALU version.  A lot of people think you can only use MMX/SSE/SSE2 for math type stuff, and that's simply not true.  It has a lot of uses.  If you learn more about it, you can take better advantage of them to get better optimized code.  They also have libraries you can download from Intel that do cos/sin/tan/etc since SSE/SSE2 doesn't have support for trig functions.

3) You can learn how to program the FPU, Raymond has a very in depth tutorial online on the masm website.  You can also do scalar SSE/SSE2 which is a lot like doing the FPU.  You can do a single floating point instruction on one piece of data, instead of doing one instruction on multiple pieces of data.  I have a tutorial here, that goes into detail about converting VC++ FP app to scalar SSE.

http://www.masm32.com/board/index.php?topic=1140.0

4) Gabor already posted my link to my optimization page ( thanks Gabor!), but I also have several tutorials that go into optimizing specific algorithms. 

Using SSE2 to speed up Quaternions
- http://www.oldboard.assemblercode.com/index.php?topic=3469.0

Optimizing Mersenne Twister Random Number Generator.  Agner Fog has the same routine optimized in assembler on his website.  My version is a bit over 3 times faster than Agner's.  I go into detail about how I sped it up.
- http://www.oldboard.assemblercode.com/index.php?topic=3565.0

MD5 SSE2 code used in the md5crk project.  The project was an open source distributed project trying to prove md5 is insecure.  The code was over 10 times faster than the C code.
- http://www.oldboard.assemblercode.com/index.php?topic=2921.0
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

daydreamer

the extra dongles you speak of, search for threads here on linker switches to result in smaller .exe, look in masm help file for switches for ml.exe, which can output assembly list with cyclecount, there are advanced subjects on write your own manual stack handling and free up all eight regs
on opcodes, you can learn what the different prefixes do and I even investigated in prefixes for MMX/SSE/SSE2 do
MMX makes it easy to write your own personal filter for photoshop, or mess with sounds or only what phantasy limits it to
the closest to shaderprogramming, is to program 4 floats at once with SSE and a final conversion to screens integer

I combine hardware accelerated d3d9 with oop and SSE/MMX

ToutEnMasm

hello,
I see the post of Mark_Larson that wrote

Quote
2) I think learning MMX/SSE/SSE2 programming is very important for writing highly optimized code.  I saw a great MMX routine that counts the number of lines in a file that is a lot faster than any ALU version.

Can be a sample downloadable somewhere for the "great MMX routine that counts the number of lines  " ?
                                   ToutEnMasm



ic2

I went to the old forum and tried to register but i get this error..

**Sorry, registration is currently disabled.**

How do you download files from the old forum if you never registered in the pass.  I tried my masm32forum name and password and it did not work.

hutch--

I will have a look at that a bit later to see if it can be easily changed without leaving the forum open to postings.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

ic2

Thanks hutch–

Please do, but take your time.  I been reading the soap box and other forum here and around the world... Hell, has the world went MAD...  I understand your caution

You now got too much on your plate and no one want to give you a break for some reason.  But WTH (you started it all for the sake of common since back in the day).

That's the life of a true warrior i guest.  As long as you are in good health, we are  happy, if not, we will CODE it back up  :)

RedXVII is right, we need a controlled section or something for advanced assembler programmers moving on.   How about a detailed section containing only MMX coding for starter if feasible,  provided MMX has a future in upcoming processors.  I guest things are hard because you got to see 30 - 100 years into the future at minimum and who know what OS and API snow job will not allow programmers to see asm code for sake of selling high level programming tools to the world of new developed future of fools.  Hee hee.    Even the thought drives me  nuts.

Thanks again

hutch--

What happened with the old forum was the idiot fringe kept hacking it because it was an old PHPBB2 forum so after fixing it a few times I converted it to a SMF forum but in the process the conversion did not handle the attachments properly. I have a link at the top that has an Apache listing of all the attachments that you can download and I cannot do much more than this unfortunately.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

trodon

@hutch--
are you see new phpbb3?

Mark_Larson

Quote from: ToutEnMasm on August 09, 2006, 08:37:37 PM
hello,
I see the post of Mark_Larson that wrote

Quote
2) I think learning MMX/SSE/SSE2 programming is very important for writing highly optimized code.  I saw a great MMX routine that counts the number of lines in a file that is a lot faster than any ALU version.

Can be a sample downloadable somewhere for the "great MMX routine that counts the number of lines  " ?
                                   ToutEnMasm




count_table is a look up table that holds the number of bits set in each byte from 0 to 255.  Each loop reads two 64-bit words at a time.  So it handles 16 bytes at a time.  The code looks for an 0ah instead of an 0dh, so that it will work under Linux ( linux uses 0ah as line termination for files).  Windows uses odh,0ah.


If you do a search on MMX/SSE/SSE2 on the old forum you will find lots of good stuff, a good bit isn't even math related :)



markl_CountFileLines proc near pszFile:ptr byte
  ; This routine just counts the lines in a file
  ; in reality it just counts character 0Dh

  mov eax,0a0a0a0ah
  movd mm7,eax
  pshufw mm7,mm7,00000000b

  xor eax,eax ; initialize line count to zero.
; pad pFileBuffer with 16 0's on the end, because pFielbuffer might not be divisible by 16.
  mov edi,[pFilebuffer]
  mov esi,[cbFile]

  lea  esi,[esi+edi]

align 16 ; for P3 and below.
L1:
  movq mm0,[edi]
  movq mm1,[edi+8]
  add edi,16

  pcmpeqb mm0,mm7
  pcmpeqb mm1,mm7
  pmovmskb ebx,mm0
  pmovmskb ecx,mm1
;eax is running count.

  movzx  edx,bl
  movzx  ebx,cl

  add    eax,dword ptr [count_table+edx*4]
  add    eax,dword ptr [count_table+ebx*4]

  cmp edi,esi
  jb L1

  RET




When I converted the above MMX code to SSE2 it ran faster on my P4.  Here's an SSE2 version of a line counter.  It doesn't use the same algorithm as the previous one.  See if you can figure out what it does



  mov eax,0a0a0a0ah

  movd xmm7,eax
  mov edx,[pFilebuffer]
  pshufd xmm7,xmm7,00000000b
  mov ecx,[cbFile]

pxor xmm6, xmm6

shr ecx,11 ;divide by 128*16
jnc evenly_divisible
inc ecx
evenly_divisible:

@@:
pxor xmm5, xmm5
;i think it is the unrolling that is hosing things.
i = 0
WHILE i LT (16*128)
movdqa xmm0, [edx + i +  0]
movdqa xmm1, [edx + i + 16]
movdqa xmm2, [edx + i + 32]
movdqa xmm3, [edx + i + 48]

pcmpeqb xmm0, xmm7
pcmpeqb xmm1, xmm7
pcmpeqb xmm2, xmm7
pcmpeqb xmm3, xmm7

paddb xmm0, xmm1
paddb xmm2, xmm3

paddb xmm0, xmm2

;can't you do this step outside the loop?  I am pretty sure you can. 
psubb xmm5, xmm0 ; total 128*8 max = 1K
i=i+16*4
ENDM

; unpack MM5 to get sum in 1K block
pxor xmm0, xmm0
psadbw xmm5, xmm0
paddd xmm6, xmm5

dec ecx
lea edx, [edx + 128*16]
jne @B

movd eax, xmm6
psrldq xmm6,8 ;shift Right 8 bytes
movd ebx, xmm6
add eax, ebx
markl_CountFileLines ENDP


good luck
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

ToutEnMasm


Later is better than never,thanks for help.

donkey

Want something to bend your mind ? Try GPGPU, using the GPU to do your math I looked into it a bit for a project and had to lie down until the pain went away...

http://www.gpgpu.org/

For a lighter subject matter look here:

http://www.df.lth.se/~john_e/gems.html
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

PBrennick

Edgar,

Looking through the list, I see this:


;
; fast strlen()
;
; input:
;   eax = offset to string
;
; output:
;   ecx = length
;
; destroys:
;   ebx
;   eflags
;

        lea     ecx,[eax-1]
l1:     inc     ecx
        test    ecx,3
        jz      l2
        cmp     [byte ptr ecx],0
        jne     l1
        jmp     l6
l2:     mov     ebx,[ecx]       ; U
        add     ecx,4           ;   V
        test    bl,bl           ; U
        jz      l5              ;   V
        test    bh,bh           ; U
        jz      l4              ;   V
        test    ebx,0ff0000h    ; U
        jz      l3              ;   V
        test    ebx,0ff000000h  ; U
        jnz     l2              ;   V +1brt
        inc     ecx
l3:     inc     ecx
l4:     inc     ecx
l5:     sub     ecx,4
l6:     sub     ecx,eax


What is your impression of this algo?
Paul
The GeneSys Project is available from:
The Repository or My crappy website

donkey

Hi Paul,

I have to admit that I have never really looked through any of the algorithms but the population count, which I needed for a project many moons ago, but the one you posted, though it would be faster than a byte scan would not be the optimal, too many conditional jumps.
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable