News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Raytracing, Integer / FP conversion

Started by Mark_Larson, September 16, 2008, 09:39:02 PM

Previous topic - Next topic

Mark_Larson



    * Report this post
    * Reply with quote

Integer / FP conversion

Postby phkahler on Tue Jul 08, 2008 7:59 am
I'm considering changing my octree traversal code to use 64 bit integer math. I would assume this will make it faster no? It should even increase precision in most cases. The issue I see is that ray generation will still be done in double precision and converted to floats (about 6 conversions). Also, primitive intersection tests will remain FP so their results will have to be compared to the integer values used in structure traversal. I could convert each intersection distance to integer. How bad are FP->integer and integer->FP conversion on Intel and AMD processors these days?

Another idea would be to convert to something like 32.32 fixed point throughout the code, but that would be another story alltogether, and would not have the same effect I'm expecting with traversal code.

For testing I may also render the leaf node of the octree again rather than doing any primitive intersections at all. This would still leave ray setup in FP with conversions.

So how slow are conversions? any other thoughts on this before I try it?

Thanks,
Paul
--Paul
The OctTree Guy

phkahler
     

        * Private message

Top

    * Report this post
    * Reply with quote

Re: Integer / FP conversion

Postby ingenious on Tue Jul 08, 2008 8:34 am
Well, it is slow :( Actually, that's what I'm dealing with right now. Previously, I used an SDL surface as a frame buffer, and SDL surfaces are always integer. This meant that I had to convert the (float, float, float) colors from my ray tracer to a single integer before writing it to the framebuffer. On my latest generation mobile Core 2 writing to the framebuffer became a real bottleneck at some point. After measuring, I found out that generating a sample on the image plane, initializing a color to black and writing it to the SDL frame buffer took 25ms. This essentially meant that I was limited to 40fps, excluding the ray tracing itself!

Now I've switched to a simple float frame buffer (still using the SDL window) and sending it to the GPU with glDrawPixels. The above operations now take 7ms. The only single difference between the two versions is that in the previous one I converted the float values to an integer before writing to the framebuffer (even the loops and everything are completely the same). Now the conversion is done on the GPU and writing to the frame buffer is not that expensive any more. So beware - it can really impact your performance... much!
http://ompf.org/forum/rss.php - the best RTRT resource on the web 8)

User avatar
ingenious
     
    Location: Saarbrücken, Germany

        * Private message

Top

    * Report this post
    * Reply with quote

Re: Integer / FP conversion

Postby gfyffe on Tue Jul 08, 2008 5:12 pm
Well, you don't have to do it too often right? When you create your ray, you can precompute the integer information needed for the tree traversal. You only need to update this information when you actually hit a primitive, in which case you update the FP information and recompute some of the integer information. So the number of FP->int conversions depends only on how many primitives you actually hit during the traversal. I wouldn't worry about it :D
- Graham Fyffe

gfyffe
     

        * Private message
        * Website

Top

    * Report this post
    * Reply with quote

Re: Integer / FP conversion

Postby Michael77 on Tue Jul 08, 2008 11:43 pm

    ingenious wrote:Now I've switched to a simple float frame buffer (still using the SDL window) and sending it to the GPU with glDrawPixels. The above operations now take 7ms.



Which resolution? I think the bottleneck is glDrawPixels as it is pretty slow always. Writing directly to an PBO and drawing a single quad is a lot faster (If I remember correctly, it was about a factor of 3 or something like that). Also, using PBOs gives you more control about what format you use since 8-Bit RGBA Integer (8Bit Int) is slower to upload than 8-Bit BGRA Integer.

On the original question: There are still quite a lot integer operations missing for SSE or require multiple operations. Integer multiplication for example doesn´t work the same as in normal C++ since multiplying two 32bit integers results in a 64bit integer result. So I would stick to float/double.

Michael77
     
    Location: Darmstadt, Germany

        * Private message

Top

    * Report this post
    * Reply with quote

Re: Integer / FP conversion

Postby ingenious on Wed Jul 09, 2008 2:51 am

    Michael77 wrote:

        ingenious wrote:Now I've switched to a simple float frame buffer (still using the SDL window) and sending it to the GPU with glDrawPixels. The above operations now take 7ms.



    Which resolution? I think the bottleneck is glDrawPixels as it is pretty slow always. Writing directly to an PBO and drawing a single quad is a lot faster (If I remember correctly, it was about a factor of 3 or something like that). Also, using PBOs gives you more control about what format you use since 8-Bit RGBA Integer (8Bit Int) is slower to upload than 8-Bit BGRA Integer.



Yes, you are right. It is slow, I noticed that afterwords. The thing is that I don't measure the time it takes to display the framebuffer :) (ideally I should, of course). My point was that removing the (float, float, float) -> int conversion made writing to the frame buffer a whole lot faster. The resolution is 1024x1024 pixels. So, a million conversions like these on my Core 2 @ 2.6Ghz take about 18ms (I haven't tried any SIMD conversion tricks).
http://ompf.org/forum/rss.php - the best RTRT resource on the web 8)

User avatar
ingenious
     
    Location: Saarbrücken, Germany

        * Private message

Top

    * Report this post
    * Reply with quote

Re: Integer / FP conversion

Postby toxie on Wed Jul 09, 2008 3:23 am

    Michael77 wrote:Also, using PBOs gives you more control about what format you use since 8-Bit RGBA Integer (8Bit Int) is slower to upload than 8-Bit BGRA Integer.



is this valid for nvidia AND ati/amd?
what about FP formats? is it still slower (bandwidth)?
I say destroy the cosmos, ask questions later.

User avatar
toxie
    Overlord
     
    Location: Germany

        * Private message
        * Website

Top

    * Report this post
    * Reply with quote

Re: Integer / FP conversion

Postby Michael77 on Wed Jul 09, 2008 4:47 am

    toxie wrote:is this valid for nvidia AND ati/amd?
    what about FP formats? is it still slower (bandwidth)?



Sadly I don´t know about AMD/ATI. There was a whitepaper available from Nvidia (can´t find it anymore after they changed their developer page) that compares different upload (or how they call it download) rates. In general: Fp-Formats should be RGBA, integer should be BGRA for whatever reason.

Michael77
     
    Location: Darmstadt, Germany

        * Private message

Top

    * Report this post
    * Reply with quote

Re: Integer / FP conversion

Postby toxie on Wed Jul 09, 2008 8:05 am
that's why i like GPUs..





not.. ;)
I say destroy the cosmos, ask questions later.

User avatar
toxie
    Overlord
     
    Location: Germany

        * Private message
        * Website

Top

    * Report this post
    * Reply with quote

Re: Integer / FP conversion

Postby tbp on Wed Jul 09, 2008 8:51 am
... and I thought such topics (conversions, upload etc...) were already discussed to death...
Better than chasing outdated whitepapers, why don't you just measure?
"Never try to teach a pig to sing. It wastes time and annoys the pig." M. Twain
radius | ompf | stuff

User avatar
tbp
    Overlord
     
    Location: France

        * Private message
        * Website
        * Jabber

Top

    * Report this post
    * Reply with quote

Re: Integer / FP conversion

Postby syoyo on Wed Jul 09, 2008 9:14 am

    phkahler wrote: How bad are FP->integer and integer->FP conversion on Intel and AMD processors these days?



SSE2 provides cvttpd2pi(double -> uint64_t) and cvtpi2ps(uin64_t -> double) instruction(float version is also provided), and according to Intel's and AMD's manual, these instruction has just 1 cycle / throughput(Latency is less than 10 cycles) in case of Core2 and K10.

And recent compiler, for example gcc -mfpmath=sse, uses this instruction for FP <-> Int conversion.

Thus, FP <-> Int conversion is extremely fast in recent Intel's and AMD's x86 CPUs.

User avatar
syoyo
     

        * Private message
        * Website

Top

    * Report this post
    * Reply with quote

Re: Integer / FP conversion

Postby phkahler on Wed Jul 09, 2008 11:59 am

    syoyo wrote:SSE2 provides cvttpd2pi(double -> uint64_t) and cvtpi2ps(uin64_t -> double) instruction(float version is also provided), and according to Intel's and AMD's manual, these instruction has just 1 cycle / throughput(Latency is less than 10 cycles) in case of Core2 and K10.

    And recent compiler, for example gcc -mfpmath=sse, uses this instruction for FP <-> Int conversion.

    Thus, FP <-> Int conversion is extremely fast in recent Intel's and AMD's x86 CPUs.



That is almost the best possible answer I could have hoped for (except the latency). I tell GCC to use SSE3 these days, so I should be all set.

To Michael77, since I'm only going to convert traversal code to integer, there will be no multiplications. Only addition, subtraction, comparison, and right shift (single bit).

About 8 converisons to int will be needed before traversal of each ray, and then each primitive intersection result will need to be converted for comparisons. I predict the resulting performance will be somewhere between spectacular and disappointing :-)
--Paul
The OctTree Guy

phkahler
     

        * Private message

Top

    * Report this post
    * Reply with quote

Re: Integer / FP conversion

Postby ingenious on Thu Jul 10, 2008 11:41 am

    syoyo wrote:And recent compiler, for example gcc -mfpmath=sse, uses this instruction for FP <-> Int conversion.

    Thus, FP <-> Int conversion is extremely fast in recent Intel's and AMD's x86 CPUs.



Do you know if there's a similar option for ICC or MSVC?
http://ompf.org/forum/rss.php - the best RTRT resource on the web 8)

User avatar
ingenious
     
    Location: Saarbrücken, Germany

        * Private message

Top

    * Report this post
    * Reply with quote

Re: Integer / FP conversion

Postby syoyo on Sat Jul 12, 2008 8:17 am
To phkahler:

    phkahler wrote:

        syoyo wrote:SSE2 provides cvttpd2pi(double -> uint64_t) and cvtpi2ps(uin64_t -> double) instruction(float version is also provided), and according to Intel's and AMD's manual, these instruction has just 1 cycle / throughput(Latency is less than 10 cycles) in case of Core2 and K10.

        And recent compiler, for example gcc -mfpmath=sse, uses this instruction for FP <-> Int conversion.

        Thus, FP <-> Int conversion is extremely fast in recent Intel's and AMD's x86 CPUs.



    That is almost the best possible answer I could have hoped for (except the latency). I tell GCC to use SSE3 these days, so I should be all set.



Latency might be tolerated by pipelined and Out-of-Order execution.

To ingenious:

    ingenious wrote:Do you know if there's a similar option for ICC or MSVC?



I don't know specific compiler option command for these compilers, but they should have such a optimization flag since using SSE2 instead of x87 for floating point operation & math function is a default behavior of AMD64(EM64T) environment.
Anyway, x87 is a source of hell, we are encouraged to use SSE2 fp whenever possible(in 32bit env).

User avatar
syoyo
     

        * Private message
        * Website

Top

    * Report this post
    * Reply with quote

Re: Integer / FP conversion

Postby phkahler on Wed Aug 27, 2008 7:40 pm
I just thought I'd follow up on this since I finally did some measurements.
All timings are on a AMD64 2GHz single core (socket 939 dual core but only one is used)
Running the incoherent bunny benchmark with an octree for acceleration.

1) stock double precisions FP hits 1.39M rays per second.
2) change to just tracing the voxels (leaf nodes of the octree) results in 2.99M rays per second !!!!
3) slight change to avoid multiplication in traversal gives about 2.83M rps.
4) now switching to pure integer traversal gives 2.73M rps.... bah!

So then I'm poking around and think - gee I don't need to normalize direction vectors since I'm not shading....

5) drop the normalization. 3.0M rps integer performance.
6) back to step 2 plus elimination of the normalization: 3.21Mrps.

So using 64bit integer traversal got me nothing and along the way I discovered about 7 to 10 percent of my tracing time is in this:

Code: Select all
        inline void Normalize() {

        double l;
           l = 1.0/sqrt(e1*e1 + e2*e2 + e3*e3);
           e1 *= l;
           e2 *= l;
           e3 *= l;
        }



That's in my vec3 class and is called once per ray. All values are doubles. It could be deferred until shading, but for indoor scenes it's going to happen every ray anyway. I'm using GCC 4.3 with SSE3. Does anyone have a drop-in replacement for this function?
--Paul
The OctTree Guy

phkahler
     

        * Private message

Top

    * Report this post
    * Reply with quote

Re: Integer / FP conversion

Postby tbp on Wed Aug 27, 2008 9:39 pm
Using my mad divination skills, i'd say it's all dominated by the (r)sqrt latency. You could use an approximate rsqrt+NR to mostly get that done a tad faster, if only those were floats, not doubles (no rsqrtsd, Intel has gripes about symmetry).
So, you're left with plan B which is about batching and/or hiding that (r)sqrt latency (ie doing it early enough that it's ready for consumption on the spot).
PS: just for the kick, try 'float l = 1/std::sqrt(float(e1*e1 + e2*e2 + e3*e3));' with -ffast-math -mrecip ;)
PS²: also, see that mildly relevant thread viewtopic.php?f=11&t=333
"Never try to teach a pig to sing. It wastes time and annoys the pig." M. Twain
radius | ompf | stuff

User avatar
tbp
    Overlord
     
    Location: France

        * Private message
        * Website
        * Jabber

Top

    * Edit post
    * Report this post
    * Reply with quote

Re: Integer / FP conversion

Postby Mark_Larson on Thu Aug 28, 2008 5:07 am

    syoyo wrote:

        phkahler wrote: How bad are FP->integer and integer->FP conversion on Intel and AMD processors these days?



    SSE2 provides cvttpd2pi(double -> uint64_t) and cvtpi2ps(uin64_t -> double) instruction(float version is also provided), and according to Intel's and AMD's manual, these instruction has just 1 cycle / throughput(Latency is less than 10 cycles) in case of Core2 and K10.

    And recent compiler, for example gcc -mfpmath=sse, uses this instruction for FP <-> Int conversion.

    Thus, FP <-> Int conversion is extremely fast in recent Intel's and AMD's x86 CPUs.



you are using the wrong command. you want cvtpi2pd, you accidentally used the one for converting packed int to packed float. On my core 2 duo it is 4 cycle lantecy and 1 cycle througput. The fastest way to pump it out using GCC, is to break the buffer up into 4 groups each doing one CVT to a different XMM register. That will break up the stalls and allow 3 of the instructions to execute in parallel on the core 2 duo.

excuse any bugs in my code. It's 5:30 in the morning here and I haven't woken up yet :) I am only doing the conversion part of the loop. I am not sure what the intrinsic are , so I am going to use pseudocode for that. Can someone time this and see how fast it is? if not I'll do it once I wake up. It was optimized for the core 2 duo.

Code: Select all
    int arr_i[1024*1024];
    double arr_d[1024*1024];

    for ( int t = (1024*1024 / 4)-1, int d = (1024*1024/8)-1 ;t >= 0;  --t,--d) {
    //16 integers
    the next 3 lines should execute in parallel
        const __m128i value1 = arr_i[t]
        const __m128 value2 = arr_i[t+1]
        const __m128 value3 = arr_i[t+2]

    the next 3 lines should execute in parallel
        const __m128 value4 = arr_i[t+3]

        const __m128  cvt1  = cvtpi2pd    value1
        const __m128  cvt2  = cvtpi2pd    value2
    the next 3 should execute in parallel
        const __m128  cvt3  = cvtpi2pd    value3
        const __m128  cvt4  = cvtpi2pd    value4

    //8 dwords
        arr_d[d] = cvt1;
    the next 3 should execute in parallel
        arr_d[d+1] = cvt2;
        arr_d[d+2] = cvt3;
        arr_d[d+3] = cvt4;
    }

BIOS programmers do it fastest ;)

Mark_Larson
     

        * Private message
        * Website
        * YIM

Top

    * Edit post
    * Report this post
    * Reply with quote

Re: Integer / FP conversion

Postby Mark_Larson on Thu Aug 28, 2008 5:13 am

    phkahler wrote:I just thought I'd follow up on this since I finally did some measurements.
    All timings are on a AMD64 2GHz single core (socket 939 dual core but only one is used)
    Running the incoherent bunny benchmark with an octree for acceleration.

    1) stock double precisions FP hits 1.39M rays per second.
    2) change to just tracing the voxels (leaf nodes of the octree) results in 2.99M rays per second !!!!
    3) slight change to avoid multiplication in traversal gives about 2.83M rps.
    4) now switching to pure integer traversal gives 2.73M rps.... bah!

    So then I'm poking around and think - gee I don't need to normalize direction vectors since I'm not shading....

    5) drop the normalization. 3.0M rps integer performance.
    6) back to step 2 plus elimination of the normalization: 3.21Mrps.

    So using 64bit integer traversal got me nothing and along the way I discovered about 7 to 10 percent of my tracing time is in this:

    Code: Select all
            inline void Normalize() {

            double l;
               l = 1.0/sqrt(e1*e1 + e2*e2 + e3*e3);
               e1 *= l;
               e2 *= l;
               e3 *= l;
            }



    That's in my vec3 class and is called once per ray. All values are doubles. It could be deferred until shading, but for indoor scenes it's going to happen every ray anyway. I'm using GCC 4.3 with SSE3. Does anyone have a drop-in replacement for this function?




you can also try using SSE2 and using SQRTD, which will let you do two double square roots in paralell. timings for it on my core 2 duo

make sure the value is in an xmm register when you do the sqrtd
latency 6-58 for two of them. so that comes to 3-29 for one
recip throughput 6-58 cycles

for a normal fsqrt the latency is
6-69, so as you can see it's best to go for the SSE2 version. It is more than twice as fast.
BIOS programmers do it fastest ;)

Mark_Larson
     

        * Private message
        * Website
        * YIM

Top

    * Edit post
    * Report this post
    * Reply with quote

Re: Integer / FP conversion

Postby Mark_Larson on Thu Aug 28, 2008 5:38 pm
I got it converted, but I have a bug. I am sure it's in the way I cast. arr_d was an array, but I couldn't get it to compile, so I allocated memory instead.

on the first store_pd I get

    asf.cpp:323: error: cannot convert 'double' to 'double*' for argument '1' to 'void _mm_store_pd(double*, double __vector__)'



the second store_pd actually compiles but I get a segmentation fault. I verified that my allocated array was 16 byte aligned. I also added an exit(1) after the second store_pd, to see if it would just execute it once. And it still died. any ideas on how to get it to work? I have two doubles in a XMM register I want to write to memory. The code WITHOUT the store_pd runs in 0.375558973 cycles / int. So the code is running 3 instructions in parallel for every group of 3.

it's probably something easy. My gcc intrinsic ability is very new. I also never used it when I had Windows.

Code: Select all
        for ( i = 0, d2 = 0;i < 1024*1024;  i+=16,d2+=16) {.
        //16 integers at a time
        //the next 3 lines should execute in parallel
    //movdqa
    //__m128i _mm_load_si128 ( __m128i *p)
               const __m128i value1 = _mm_load_si128( (__m128i *)arr_i);
               const __m128i value2 = _mm_load_si128( (__m128i *)arr_i[i+4]);
               const __m128i value3 = _mm_load_si128( (__m128i *)arr_i[i+8]);
       //the next 3 lines should execute in parallel
               const __m128i value4 = _mm_load_si128( (__m128i *)arr_i[i+12]);

    // do 4 shifts.
    //__m128i _mm_srli_si128 ( __m128i a, int imm)
             const   __m128i shift1 = _mm_srli_si128 ( value1, 64);
             const   __m128i shift2 = _mm_srli_si128 ( value2, 64);
       //the next 3 lines should execute in parallel
             const   __m128i shift3 = _mm_srli_si128 ( value3, 64);
             const   __m128i shift4 = _mm_srli_si128 ( value4, 64);


    //__m128d _mm_cvtepi32_pd(__m128i a)
             const   __m128d cvt1 = _mm_cvtepi32_pd(value1);         //converts 2 values at a time.
       //the next 3 lines should execute in parallel
             const   __m128d cvt2 = _mm_cvtepi32_pd(value2);         //converts 2 values at a time.
             const   __m128d cvt3 = _mm_cvtepi32_pd(value3);         //converts 2 values at a time.
             const   __m128d cvt4 = _mm_cvtepi32_pd(value4);         //converts 2 values at a time.
       //the next 3 lines should execute in parallel
             const   __m128d cvt5 = _mm_cvtepi32_pd(shift1);         //converts 2 values at a time.
             const   __m128d cvt6 = _mm_cvtepi32_pd(shift2);         //converts 2 values at a time.
             const   __m128d cvt7 = _mm_cvtepi32_pd(shift3);         //converts 2 values at a time.
       //the next 3 lines should execute in parallel
             const   __m128d cvt8 = _mm_cvtepi32_pd(shift4);         //converts 2 values at a time.

    //_mm_store_pd(double *p, __m128 a)
    //arr_d is a pointer to a double array.  I also tried just doing the array, but I get the same problem.
          _mm_store_pd(arr_d[d2], cvt1);
          _mm_store_pd(arr_d, cvt2);
        }
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm