Hey guys! Long time no see, but I figured I'd let you know about a paper I recently submitted for peer review, about the optimizations I did to get a 12x speedup on the AQUA@Home (http://aqua.dwavesys.com/) quantum computer simulations. A whole lot of painstaking assembly went into it, especially having 6 platforms to support. :lol
Importance of Explicit Vectorization for CPU and GPU Software Performance (http://arxiv.org/ftp/arxiv/papers/1004/1004.0024.pdf)
It basically walks through several optimizations some of you are probably familiar with, along with a very funky way of vectorizing Metropolis Monte Carlo simulations, and shows the performance at different levels of optimization. It also features an Intel Core i7 outperforming an NVIDIA GTX-285 by a factor of 2 when both are running their respective best-optimized code versions.
I'm expecting it to be fairly controversial, as most papers comparing CPU vs. GPU don't optimize the CPU code and thus the GPU wins hands down, and similarly, most people don't believe me when I tell them that a 10x speedup is often possible on the CPU on top of multi-threading. Hopefully it doesn't get outright rejected by the reviewers. :U
well - from what i know, the paper looks very nice :P
i will try to strugle thorugh it
i wonder what Larabee and his budies will say :bg
i guess he was being gently swept aside, anyways
Thanks, and feel free to let me know if you have any questions. If there's something you don't understand from it, it's unlikely that miscellaneous professors will understand it. Plus, it will probably need revisions before it gets accepted anyway. :wink
We had a 48-core machine running our project at one point and we wondered if it was someone at Intel testing a Larabee, or just someone faking their system specs. :lol
Looks good Neo, I confess to knowing little of the architecture of Nvidia video cards but your general theory and testing methods hang together well.
Thanks! The NVIDIA architecture is complicated... very complicated... like 3 or more levels of manually-managed memory spaces and 3 or more levels of parallelism complicated. It makes me wonder why people thought general-purpose computing on a GPU was a good idea.
Some detail:
Each GPU (one of their cards has 2 GPUs) has "device global memory", "constant memory", and around 30 "multi-processors", each of which has its own separate 16KB "shared memory" (ironically not shared) space and 8 "streaming processors", each of which is 4x hyper-threaded. There's no cache, except on newer Fermi cards, and it's only like 32KB of shared memory on those cards. If a group of 32 threads, a "warp", running on a multi-processor tries to access non-adjacent locations in any of the memory spaces at the same time, there's a huge penalty, analogous to the CPU doing scalar accesses of main memory vs. vector access of cache.
In other words... pretty darn complicated to manage. :lol
Years ago, I used to search the Los Alamos Pre-Print Archive (http://xxx.lanl.gov/).
It's really great if you need some highly technical input, but are a complete moron.
It looks like it's just a mirror of arxiv.org now. Was it something else back then?
...It IS a mirror site,...
But, several years ago, the Los Alamos site was a unique archive. Looks like they've amalgamated.
Anyway, GREAT site to search. Very entertaining stuff,...
The paper's been rejected without review for not being enough about larger-scale parallelism. They do publish papers showing supposed good performance results from GPUs, which they consider parallel enough, but evidently they won't even review something showing a less-parallel CPU beating a more-parallel GPU. ::)
Anyone know of a journal that'll publish papers on real performance optimization instead of just papers that follow marketing hype?
Thats bad luck as your paper is fundamentally sound. Allowing for the hype and assumptions of reviewers, could you tweak the content to use a higher thread/core count ?
that's a shame, Neil - we were all pulling for ya
Hutch may have the right idea
pick the processor manufacturer that best suits your needs, showing why that architecture is superior :U
throw their name in there a couple times and they will have it all over the web in 1 day
may be greasing the wheels a bit, but be prepared to entertain a job offer :bdg
Quote from: hutch-- on April 19, 2010, 01:04:32 AM
Thats bad luck as your paper is fundamentally sound. Allowing for the hype and assumptions of reviewers, could you tweak the content to use a higher thread/core count ?
Nah, it didn't even get to reviewers. Two editors of the journal say it's not suitable for their journal as the paper's focus isn't on large-scale parallelism, so I'm unlikely to do anything but annoy them by resubmitting it. The journal is called "Parallel Computing" after all, so they're reasonable to want papers on computation with more than 16 CPU cores; I'm just a bit miffed that they
do accept papers on GPU computing, when GPU cores are cores in name only. The main editor did at least call it a "worthwhile contribution".
@dedndave: Ironically enough, within 24 hours of posting the paper on arxiv, I got a recruitment email from Intel. It may not be related at all, but I thought it was pretty funny. :wink
They tend to snap up talent if they can find it so the paper may have done more for you than you bargained for. Intel are starting production of mixed CPU/GPU chips these days so if they offer to pave your way to Oregon with gold, it may be hard to resist. :bg
even if those type of job offers are not accepted, they can be used as leverage to boost your salary with the current employer
you just have to "let it slip out" to a couple select employees :P
when i worked at MA-COM, one of the guys got 2 raises by walking around the hallways with a Motorola job app sticking out of his pocket :bg
That 48 core machine would be an AMD Opteron system with:
4 cpus, 12 cores each (the Opteron 6168 solution)
or
8 cpu's, 6 cores each (the Opteron 8435 solution)
Its probably one of the same guys that submitted benchmark results to http://www.cpubenchmark.net/multi_cpu.html
Cool rigs. :U
An update on this paper: It's now been sitting with an editor at Journal of Parallel and Distributed Computing for 6 weeks still not being reviewed. I think (though I'm just speculating) he doesn't want it published but doesn't have a legitimate argument against it, so he's just making us wait until we get fed up and retract it. :( At least it sounds like the journal manager is none too pleased about the situation.
Quote from: Neo on June 03, 2010, 06:55:59 AM
An update on this paper: It's now been sitting with an editor at Journal of Parallel and Distributed Computing for 6 weeks still not being reviewed.
Neo, don't despair. One of my papers has been lying around for three years, and is now in print. Another one was received about one year ago, and I am pretty sure it will be accepted. Six weeks is a *very* short delay :P
Quote from: jj2007 on June 03, 2010, 07:38:23 AM
Quote from: Neo on June 03, 2010, 06:55:59 AM
An update on this paper: It's now been sitting with an editor at Journal of Parallel and Distributed Computing for 6 weeks still not being reviewed.
Neo, don't despair. One of my papers has been lying around for three years, and is now in print. Another one was received about one year ago, and I am pretty sure it will be accepted. Six weeks is a *very* short delay :P
Six weeks is a short delay for reviewers, but all the editor has to do is pick a few names and send the paper to those people. We even sent a list of possible reviewers after 4 weeks, in case he's legitimately looking for reviewers. Barring any catastrophes, this editor has simply been refusing to send the paper out for 6 weeks. A second, much less significant, paper we sent to the same journal 2 weeks ago was sent out for review after about 4 days. :(
To further put it in perspective, another paper we had on a big improvement to our preprocessing (http://arxiv4.library.cornell.edu/abs/1004.2840) was fully reviewed 2 days after submitting it. Another one (http://arxiv.org/abs/1004.0023) took a few months, which is more the norm, but it was still sent out for review after only a couple days.
Speaking of cool rigs, AMD has recently released an affordable line of 6 core processors. Currently just two models, the $200 1055T at 2.8ghz and the $295 1090T at 3.2ghz. Both are socket AM3 processors.
The 1090T has an unlocked multiplier while the 1055T does not, but both are overclockable. Many reviewers have pushed the 1055T to a stable 4ghz using the stock heatsink and fan that comes with the processor. The advantage of the unlocked multiplier of the 1090T is that you can overclock the processor without also overclocking the system bus.
That $200 price point of the 1055T is very attractive IMHO.. can put together a very nice system for under $500 (~$100 motherboard, ~$100 memory, ~$100 case and power supply)
Hmm... it'll still be tough for them to compete with the 8-logical-core Intel Core i7's, which you can get for $260. The Core i7's also made a big leap in per-core performance, so hopefully AMD has something to counter.
Actually the $280 i7 860's and 920's are ranking on par with the $200 1055T in benchmarks
The i7 930 does rank a bit higher, but thats $290.
The 1090T beats all of these for $295
http://www.cpubenchmark.net/high_end_cpus.html
REJECTED!... again, after another 5 months of waiting ::)
Looks like academia is fine with sticking its head in the sand. One reviewer said the optimizations were obvious, but if that were true, countless authors of published papers and books would be guilty of academic fraud for blatantly ignoring them.
:tdown
I am sorry to hear that but I have a cynical view that academia is a plot by the obscure to try and look profound in the absence of talent. Mediocrity abounds in a world where beeing seen to be politically correct is more important than delivering in terms of performance or output.
Find another context to publish your designs and ideas and don't allow the idiot fringe access to it.