News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Raycasting Engine

Started by LithiumDex, January 19, 2007, 12:53:10 AM

Previous topic - Next topic

LithiumDex

Thanks,

Although my previous raycaster was stable, I couldn't get more than a steady 40fps out of it on my 1ghz -- But contrary to what my good friend thinks, (i.e. floats are just as fast fixed point these days), in my last release where I converted from floating to fixed point for just the raycasting loop alone, I gained about 10fps -- That's not a huge difference, but it's notable.

As far as this engine goes, I'm thinking about creating a checkpoint now, and at some point adapt it to allow for variable wall heights, (i.e. doom) -- because this requires checking beyond the closest ray intersection point, this will be slower, and I havn't figure out exactly how to implement raised floors with speed, but it's something to think about anyway.

Back to the current work though, getting the right perspective with the floors/seelings and having it fast is an issue -- I noticed permadi's tutorial doesn't cover this in much detail -- If i remeber correctly they were the biggest slow down in my first engine... (Although entirely fixed point, and luts where possible, I think I still had a mul/div for each pixel)

daydreamer

Quote from: j_groothu on January 23, 2007, 03:27:50 PM
no cracks here . and a bit faster too i think.  :cheekygreen:,  Let me explain a little why I can appreciate exploring soft rendering like this.  The tendency these days is, quite justifiably,  to chuck as much of the graphics processing out to the specialised GPU via directX or other abstracted mechanism.  what this does is save you from reinventing the wheel, and hides some of the gory mechanisms you are exploring ( on purpose).

But there are some of us who would like to understand, say for example, how a line is drawn, and even try our own implementaion of breshenam's algorithm. I think DirectX IS designed for games programming, not for understanding and learning how graphics algorithms actually work.  I think by exploring your own implementation will better equip you to do the same thing using DirectX ( or any other available API ) because you'll have a better understanding of the data structures and processing going on within the black boX.
no you can put up a quad and render with program pixelshaders I can run RTRT at 20 fps, but its also designed to that specialcase of squares with 90 degrees as raycasting and its programmable 128bit instructions that make SSE look like crap, because it has so many built in matricesfunctions etc that SSE lack, look at some shaders the circle is closed, they are inspired by the old assembly waterrippleshaders,fireshaders etc
I would think that then using Both the GPU and CPU to full potential, rather than leaving the 2+ GHz CPU Idle ( and perhaps handling a few mouse events )
would be a worth a try.
use the gpu to full potential, you want the cpu for all AI etc, unless you wanna go for an asm RTRT demo

[Later: i.e. perhaps a hybrid asmcast+direct3d,  direct3d (GPU) for hardware textures etc, asmcast for the projection & collision physics (CPU) could be quite a setup with some potential]
Jason

I was rewriting a kinda my homebrew raycaster to 80 rays, preparing it for an experiment to only cast a few rays and let direct3d draw whole textured/lit/3dtransformed walls and objects and trace thru transparent windows and store where it hit and blend some window ontop afterwards

j_groothu

Quoteno you can put up a quad and render with program pixelshaders I can run RTRT at 20 fps, but its also designed to that specialcase of squares with 90 degrees as raycasting and its programmable 128bit instructions that make SSE look like crap, because it has so many built in matricesfunctions etc that SSE lack, look at some shaders the circle is closed, they are inspired by the old assembly waterrippleshaders,fireshaders etc
....
use the gpu to full potential, you want the cpu for all AI etc, unless you wanna go for an asm RTRT demo
....
I was rewriting a kinda my homebrew raycaster to 80 rays, preparing it for an experiment to only cast a few rays and let direct3d draw whole textured/lit/3dtransformed walls and objects and trace thru transparent windows and store where it hit and blend some window ontop afterwards

Nice approach that, certainly will be looking more into the pixel shaders myself ( a lot to learn there ).  I assume that it would avoid the bandwidth limitations that would be dominant in my older ATI AGP 8x card and free up the CPU some more as you mention.  Taking a closer look at some of the demos with source  on the ATI website is an eye opener fpr me ( I like the terrain one that uses some interesting data structures), I hadn't really considered my old radeon 9550 was capable of that.

Jason

LithiumDex

Allright, here's today's work:

http://lithium.zext.net/asmcast.zip



I've designed it so you can use any power of two texture size, up to 1024, there is also a basic blit function and bmp load function...
the bmp function only loads 24bit uncompressed bitmaps, and the blit function doesn't do transparencey or clipping.

Also this screenshot shows the engine with a 640x480 resolution.

LithiumDex

I stayed up all night trying to code a floor-caster that wouldn't have any muls in the inner most loop, my first attempt failed, so I went back to my old method, and in the process found some other issues and fixed them... and now I think if I tried my optimization again it might work...

Anyway, I've recoded some small, non-speed-critical parts of the engine using the FPU... I was a little weary to learn it but it's not so bad now.

So, there's now a textured floor in the demo -- right now it's just one texture repeated, but it won't be hard to implement for a floor map,
and as for the ceiling it's just a matter of copy and paste with a little edit.

I think I will do depth shading next... then sprites... then movable blocks and doors... and somewhere inbetween there I will write a map editor. (and improve the existing) 2d drawing functions

You can download this version here: http://lithium.zext.net/asmcast.zip

A screenie:


But since this is the laboratory, (and not the workshop as you would think by these last two posts), I'll breifly discuss my floor-mapping algorithms...

The current Algorithm
for each frame, the distance from the origin for each y is computed and stored in a lut.
the wall slices are drawn a column at a time, scaling the directional vector for that x by the distanced stored in the lut for each y, added to the camera position, and the map-coordinates are found accordingly... So basically there's two muls for each pixel (bad, but how bad?)

The faster Algorithm
I'm not entirley sure if it will work, but this algorithm is based of my assumption that for ever row of mapped texture on the screen, the difference in map coordinates from (screen)x to x+1, will be the same as x+1 to x+2, and x+2 to x+3 and so on...

So I would, for each y:
Calculate the distance and position of the left-most ray (x=0) for this y
Calculate the distance and position of the right-most ray(x=screen width) for this y
take the difference of those two vectors, divide by screen width to normalize

Set the map coordinate to the position of the left-most ray, and increment by the vector from the last step, and increment x by 1, get texture coordinate, draw pixel, loop until x=screen width

EDIT: I couldn't get my mind off it... so I tried it, and it worked.. The entire problem with it wasn't actually a problem with it, infact it worked fine in the first place -- it was another part of my engine that was making it look incorrect... now that that's fixed it works great... that makes this all-nighter worth-while ;)

LithiumDex

Added depth shading today... It uses luts, totalling about 192KB of memory... the only unfortunate part is that it requires 3 bytes to be read seperatley for each pixel from the lut (which is actually three) -- is mov'ing a byte faster or slower than a dword?

Anyway, here's a screenshot:


And the download link is the same:
http://lithium.zext.net/asmcast.zip

j_groothu

me being a  relative Noob at the optimisation thing, I would have to say it depends on the instructions around it. You might only get 1 byte or word move per cycle ( or 2... )   or you might be able to squeeze out several per cycle.  If you could point out a particular critical code section and list it here ( perhaps an inner loop or something) someone much much better than me would likely point you in the right direction.  Have you taken a look at Agner Fog's optimisation docs yet ?

Jason,

P.S. Wish I had more time to play with this at the moment myself, oh well back to study

stanhebben

1 dword operation is faster than 3 byte operations - try using 32 bits per pixel.

daydreamer

I was trying to change it to use 1024x1024 textures, but it breaks the skymapping and outerwalls
no fps counter? could be useful when you let many with different cpus test it, I was interested to see if it slowed down much if use different hirestexture for each and every block

LithiumDex

unfortunatley I can't use one dword operation... I was trying to set it up as such, but it would require an enormous lut, unless I sacrifised alot of colour depth.

my plan of attack now, is to decrease the colour depth by a factor of four, and the amount of depth levels by a factor of two... which will decrease the size of my lut from 192kb to 24kb, then I will have it allocated as static memory, as apose to dynamic.

as for the 1024x1024 textures.. I theorized that it would be possible, but I didn't test it... (oops), I'll have to run some tests on that and see if I can fix it.

oh and: http://lithium.zext.net/asmcast_test.zip -- press F to get a MessageBox with the framerate, warning though -- the frame limiter is turned off, so be carefull not to run outside of the level, or it will crash.

Rockoon


Avoid floats in the casting simply because of the rounding errors, and not just because they are less efficient. Absolutely take advantage of power-of-two sizes and power-of-two fixed point (Signed 15.16 fixed point would be my choice)

The casting shouldnt be a big efficiency issue with only ScreenWidth casts per frame (maybe double that if you allow for some mirrored walls, something ive never seen in a raycaster but certainly possible.) If you have (are expecting) a lot of wide open spaces in the voxel map, then you might want to store the closest distance to the next voxel (so you can skip several at a time) but I dont think that will be a major benefit.

The rendering of the strips to your backbuffer (system ram I assume) could take advantage of the SSE "streaming store" movnti instruction. Additionally, you could be prefetchnta'n the texels as you run down the strip.

The biggest issue is going to be the presentation method. Lots of choices but none of them should much better than another because you are going across the AGP/PCI BUS. I'd stick with GDI and DibSections or CreateBitmap() myself simply because DirectDraw doesnt have the greatest of support anymore and DirectX/OpenGL is way overcomplicated with little benefit for a simple presentation routine. SDL is an option but you will run into the same DirectDraw issues since thats what it'll be using on the back end (unless SDL uses OpenGL on windows now? its been awhile..)

When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

daydreamer

Quote from: Rockoon on February 02, 2007, 09:45:31 PM
The biggest issue is going to be the presentation method. Lots of choices but none of them should much better than another because you are going across the AGP/PCI BUS. I'd stick with GDI and DibSections or CreateBitmap() myself simply because DirectDraw doesnt have the greatest of support anymore and DirectX/OpenGL is way overcomplicated with little benefit for a simple presentation routine. SDL is an option but you will run into the same DirectDraw issues since thats what it'll be using on the back end (unless SDL uses OpenGL on windows now? its been awhile..)

why not opengl/dx solution?, let the cpu castrays and tell gpu what coordinates and UVcoordinates to render each of 640 quads?latest gpus have support for bumpmapping, which is what you need for realisticlooking brickwalls
or you could render to systemram and upload it as texturefrommemory and turn on all antialiasing,texturefiltering,trilinear filtering etc

Rockoon

Quote from: daydreamer on February 12, 2007, 05:48:57 AM

why not opengl/dx solution?, let the cpu castrays and tell gpu what coordinates and UVcoordinates to render each of 640 quads?latest gpus have support for bumpmapping, which is what you need for realisticlooking brickwalls
or you could render to systemram and upload it as texturefrommemory and turn on all antialiasing,texturefiltering,trilinear filtering etc


Because there is little advantage to using OpenGL/DX besides having cleaner control over screen resolution. Each ray is a point sample, not an area sample, so it would be hard to generate quad texture vertices from the raycast data.
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

daydreamer

Quote from: Rockoon on February 12, 2007, 08:07:33 PM
Quote from: daydreamer on February 12, 2007, 05:48:57 AM

why not opengl/dx solution?, let the cpu castrays and tell gpu what coordinates and UVcoordinates to render each of 640 quads?latest gpus have support for bumpmapping, which is what you need for realisticlooking brickwalls
or you could render to systemram and upload it as texturefrommemory and turn on all antialiasing,texturefiltering,trilinear filtering etc


Because there is little advantage to using OpenGL/DX besides having cleaner control over screen resolution. Each ray is a point sample, not an area sample, so it would be hard to generate quad texture vertices from the raycast data.


you can tell the hardware to pointsample the texture
are you confusing it with raytracing that each ray is a point sample?each ray result in you render a 1pixelwide slice of the wall, its only adress it to use a value between 0.0f to 1.0f than his crappy 0-63 int, vertical you could set flag for tiledtexture and set it 0.0f in the top and 6.0f in the bottom means it repeats the texture 6 times
x values could be initialized values to 640 tiles, while ytop and ybottom is set different to the usual verticalsize that makes it pseudo3d

Rockoon

Quote from: daydreamer on February 17, 2007, 02:43:27 PM

you can tell the hardware to pointsample the texture
are you confusing it with raytracing that each ray is a point sample?each ray result in you render a 1pixelwide slice of the wall, its only adress it to use a value between 0.0f to 1.0f than his crappy 0-63 int, vertical you could set flag for tiledtexture and set it 0.0f in the top and 6.0f in the bottom means it repeats the texture 6 times
x values could be initialized values to 640 tiles, while ytop and ybottom is set different to the usual verticalsize that makes it pseudo3d


I think you are confusing screen space with texel space.

Each ray cast relates to an infinitely thin world space strip, which is used as a point sample for a 1 pixel wide strip of screen space, and this will not map linearly to a 1-pixel anything in texel space.

In texel space that strip could represent an area 1 texel wide, 100 texels wide, or 1/100 texels wide, or any other value based on distance, scaling, and orientation factors.

The end result being that you are making slow draw calls to hardware (yes, draw calls and state changes are slow.. thats why we batch geometry together) to do exactly what you would be doing rapidly in software .. and without any of the control software gives.

Now if you wanted to raycast on the other side of the draw call (such as with displacement mapping), then I could probably agree with you.


As for his "crappy" 0..63 .. if he is using a fixed-point to voxel resolution of 64:1 then no amount of GPU tricks will get him (or you) more resolution from it .. 64 points per voxel.. thats it.
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.