I've ported my NTHashTickler program to NVIDIA CUDA... poorly.
As before, my motivation for writing this program is not because I really care about finding NT hashes per se, but just teaching myself more about multithreaded programming and in this case, NVIDIA CUDA. If you know anything at all about CUDA, please feel free to criticize my work and tell me what I'm doing wrong. (Source code is on Github.)
So CUDA is basically like standard C, but with a few extensions. The challenge is that you have "host" code, and you have "device" code comingling, with host code only being able to run on your CPU and access your main memory, while "device" code can only run on your GPU and only access GPU memory. What this means is that you are totally on your own when you're writing device code to run on your GPU. You can't even use C standard library functions like memcmp or strtol, etc. And you definitely cannot call something as luxurious as a Windows API function from device code. The sole exception as far as I know is that CUDA does allow you to use printf from within device code for debugging purposes...
Traditionally, you could only transfer data from the host to the device through CUDA API calls such as cudaMalloc, etc., but in CUDA compute capability 3.0 they introduced what I like to call "CUDA Easy-Mode," or essentially, variables that can be seamlessly accessed by both host and device simultaneously, making your code look a lot cleaner and simpler. These are the __device__ __managed__ variables. They provide what CUDA calls a "unified" view of memory. I'm guessing it's probably just the CUDA runtime doing the dirty work for you under the hood that you used to have to do yourself. You can still cause the same kind of problems reading and writing to the variables from different threads that applies to any multithreaded code, but it's still so much easier to work with than cudaMalloc, cudaFree, etc.
So the idea is that if you have a problem that can be parallelized sufficiently, it can take advantage of your GPU's ability to run thousands of parallel threads, instead of your CPU's mere 4 or 8 simultaneous threads.
In practice though, when I finally got the code working, I get about 750,000 hashes per second of throughput, which is slower than the ~ 5 million hashes/second of throughput by my previous version in C that just ran on the CPU!
So I'm definitely doing something wrong. I think I should be seeing a billion hashes/sec of throughput or better. I have a lot more work to do.
Luckily, the CUDA Toolkit comes with some really amazing profiling and tracing tools, and I think that's a good place to start looking for optimization opportunities.
(Source code is on Github.