Paper on Performance Optimization

Neo · April 12, 2010, 04:43:31 AM

Hey guys! Long time no see, but I figured I'd let you know about a paper I recently submitted for peer review, about the optimizations I did to get a 12x speedup on the AQUA@Home quantum computer simulations. A whole lot of painstaking assembly went into it, especially having 6 platforms to support. :lol

Importance of Explicit Vectorization for CPU and GPU Software Performance

It basically walks through several optimizations some of you are probably familiar with, along with a very funky way of vectorizing Metropolis Monte Carlo simulations, and shows the performance at different levels of optimization. It also features an Intel Core i7 outperforming an NVIDIA GTX-285 by a factor of 2 when both are running their respective best-optimized code versions.

I'm expecting it to be fairly controversial, as most papers comparing CPU vs. GPU don't optimize the CPU code and thus the GPU wins hands down, and similarly, most people don't believe me when I tell them that a 10x speedup is often possible on the CPU on top of multi-threading. Hopefully it doesn't get outright rejected by the reviewers. :U

dedndave · April 12, 2010, 04:57:32 AM

well - from what i know, the paper looks very nice :P
i will try to strugle thorugh it
i wonder what Larabee and his budies will say :bg
i guess he was being gently swept aside, anyways

Neo · April 12, 2010, 05:34:05 AM

Thanks, and feel free to let me know if you have any questions. If there's something you don't understand from it, it's unlikely that miscellaneous professors will understand it. Plus, it will probably need revisions before it gets accepted anyway. :wink

We had a 48-core machine running our project at one point and we wondered if it was someone at Intel testing a Larabee, or just someone faking their system specs. :lol

hutch-- · April 12, 2010, 08:34:24 AM

Looks good Neo, I confess to knowing little of the architecture of Nvidia video cards but your general theory and testing methods hang together well.

Neo · April 12, 2010, 02:44:16 PM

Thanks! The NVIDIA architecture is complicated... very complicated... like 3 or more levels of manually-managed memory spaces and 3 or more levels of parallelism complicated. It makes me wonder why people thought general-purpose computing on a GPU was a good idea.

Some detail:
Each GPU (one of their cards has 2 GPUs) has "device global memory", "constant memory", and around 30 "multi-processors", each of which has its own separate 16KB "shared memory" (ironically not shared) space and 8 "streaming processors", each of which is 4x hyper-threaded. There's no cache, except on newer Fermi cards, and it's only like 32KB of shared memory on those cards. If a group of 32 threads, a "warp", running on a multi-processor tries to access non-adjacent locations in any of the memory spaces at the same time, there's a huge penalty, analogous to the CPU doing scalar accesses of main memory vs. vector access of cache.

In other words... pretty darn complicated to manage. :lol

baltoro · April 14, 2010, 12:57:49 AM

Years ago, I used to search the Los Alamos Pre-Print Archive.
It's really great if you need some highly technical input, but are a complete moron.

Neo · April 14, 2010, 01:10:23 AM

It looks like it's just a mirror of arxiv.org now. Was it something else back then?

baltoro · April 14, 2010, 11:08:34 PM

...It IS a mirror site,...
But, several years ago, the Los Alamos site was a unique archive. Looks like they've amalgamated.
Anyway, GREAT site to search. Very entertaining stuff,...

Neo · April 19, 2010, 12:53:14 AM

The paper's been rejected without review for not being enough about larger-scale parallelism. They do publish papers showing supposed good performance results from GPUs, which they consider parallel enough, but evidently they won't even review something showing a less-parallel CPU beating a more-parallel GPU. ::)

Anyone know of a journal that'll publish papers on real performance optimization instead of just papers that follow marketing hype?

hutch-- · April 19, 2010, 01:04:32 AM

Thats bad luck as your paper is fundamentally sound. Allowing for the hype and assumptions of reviewers, could you tweak the content to use a higher thread/core count ?

dedndave · April 19, 2010, 01:12:14 AM

that's a shame, Neil - we were all pulling for ya

Hutch may have the right idea
pick the processor manufacturer that best suits your needs, showing why that architecture is superior :U
throw their name in there a couple times and they will have it all over the web in 1 day

may be greasing the wheels a bit, but be prepared to entertain a job offer :bdg

Neo · April 19, 2010, 01:26:55 AM

Quote from: hutch-- on April 19, 2010, 01:04:32 AM
Thats bad luck as your paper is fundamentally sound. Allowing for the hype and assumptions of reviewers, could you tweak the content to use a higher thread/core count ?

Nah, it didn't even get to reviewers. Two editors of the journal say it's not suitable for their journal as the paper's focus isn't on large-scale parallelism, so I'm unlikely to do anything but annoy them by resubmitting it. The journal is called "Parallel Computing" after all, so they're reasonable to want papers on computation with more than 16 CPU cores; I'm just a bit miffed that they do accept papers on GPU computing, when GPU cores are cores in name only. The main editor did at least call it a "worthwhile contribution".

@dedndave: Ironically enough, within 24 hours of posting the paper on arxiv, I got a recruitment email from Intel. It may not be related at all, but I thought it was pretty funny. :wink

hutch-- · April 19, 2010, 01:34:35 AM

They tend to snap up talent if they can find it so the paper may have done more for you than you bargained for. Intel are starting production of mixed CPU/GPU chips these days so if they offer to pave your way to Oregon with gold, it may be hard to resist. :bg

dedndave · April 19, 2010, 02:10:10 AM

even if those type of job offers are not accepted, they can be used as leverage to boost your salary with the current employer
you just have to "let it slip out" to a couple select employees :P
when i worked at MA-COM, one of the guys got 2 raises by walking around the hallways with a Motorola job app sticking out of his pocket :bg

Rockoon · June 02, 2010, 11:31:50 PM

That 48 core machine would be an AMD Opteron system with:

4 cpus, 12 cores each (the Opteron 6168 solution)

or

8 cpu's, 6 cores each (the Opteron 8435 solution)

Its probably one of the same guys that submitted benchmark results to http://www.cpubenchmark.net/multi_cpu.html

News:

Paper on Performance Optimization