NTHashTickler_CUDA v1.0

I've ported my NTHashTickler program to NVIDIA CUDA... poorly.

As before, my motivation for writing this program is not because I really care about finding NT hashes per se, but just teaching myself more about multithreaded programming and in this case, NVIDIA CUDA. If you know anything at all about CUDA, please feel free to criticize my work and tell me what I'm doing wrong. (Source code is on Github.)



So CUDA is basically like standard C, but with a few extensions.  The challenge is that you have "host" code, and you have "device" code comingling, with host code only being able to run on your CPU and access your main memory, while "device" code can only run on your GPU and only access GPU memory.  What this means is that you are totally on your own when you're writing device code to run on your GPU. You can't even use C standard library functions like memcmp or strtol, etc.  And you definitely cannot call something as luxurious as a Windows API function from device code. The sole exception as far as I know is that CUDA does allow you to use printf from within device code for debugging purposes...

Traditionally, you could only transfer data from the host to the device through CUDA API calls such as cudaMalloc, etc., but in CUDA compute capability 3.0 they introduced what I like to call "CUDA Easy-Mode," or essentially, variables that can be seamlessly accessed by both host and device simultaneously, making your code look a lot cleaner and simpler.  These are the __device__ __managed__ variables.  They provide what CUDA calls a "unified" view of memory.  I'm guessing it's probably just the CUDA runtime doing the dirty work for you under the hood that you used to have to do yourself.  You can still cause the same kind of problems reading and writing to the variables from different threads that applies to any multithreaded code, but it's still so much easier to work with than cudaMalloc, cudaFree, etc.

So the idea is that if you have a problem that can be parallelized sufficiently, it can take advantage of your GPU's ability to run thousands of parallel threads, instead of your CPU's mere 4 or 8 simultaneous threads.

In practice though, when I finally got the code working, I get about 750,000 hashes per second of throughput, which is slower than the ~ 5 million hashes/second of throughput by my previous version in C that just ran on the CPU!

So I'm definitely doing something wrong.  I think I should be seeing a billion hashes/sec of throughput or better.  I have a lot more work to do.

Luckily, the CUDA Toolkit comes with some really amazing profiling and tracing tools, and I think that's a good place to start looking for optimization opportunities.

(Source code is on Github.)

Comments (2) -

Karthikeyan 3/24/2016 10:25:09 AM

I had glance at your code. You are launching block with 8x8x8 threads (x, y, z) but you are utilizing only x direction. similary in grid also, you are using x threads only. But you are lunching 64x64x1. Either launch only in x. or start utilizing all dimensions. If you are not using them in code, but launching them, it's duplicate work which goes waste. (If you have used this code just to calculate throughput, it's fine then).
You are declaring character arrays. In general, all local arrays are stored in DRAM. So, the access is slow. Start utilizing shared memory which is much faster. (as fast as cache). specifically for arrays which are frequently accessed.
Another point to take care - Memory coalescing. With careful planning, this will give great speedup. Read about it.

Thank you -- this is very useful info. Hopefully I will get back around to this project and fix my code.

Comments are closed