This project should be very easy to a any CUDA developer, it's purpose is to demonstrate the a difference in run time of a simple matrix multiplication program when it's written once without shared memory and another time with shared memory. I will provide specific details and files later on.
Hi, Is a comparison between shared vs. global memory all that you want, when it comes to matrix multiplication? Or do you want to test on other types of memory, also? Thanks, Nicu