Cuda
- Hello World
- Everything
// For all CUDA calls void e(cudaError_t result) { if (result != cudaSuccess) { printf("%s\n", cudaGetErrorString(result)); } } // Sanity checks after running a kernel e(cudaGetLastError()); e(cudaDeviceSynchronize()); // Common functions cudaMemcpy(dest, source, size, cudaMemcpyHostToDevice | cudaMemcpyDeviceToHost) cudaMalloc(&<pointer>, size) // Calling a kernel kernel<<<blocks, threadsinblock>>>() threadIdx.x blockDim.x // threads in a block in that direction gridDim.x // blocks in a grid in that direction // Shared memory __shared__ int s[]; <- size defined by parameter to the kernel
 - References
 
 - Everything
 - Thread Block
- https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/
 - group of threads is called a thread block
 - CUDA blocks are grouped into a grid
 - Kernel is executed as a grid of blocks of threads
 - each block is executed by one Streaming Multiprocessor (SM)
 - One SM can run several CUDA blocks (depending on resources)
 - Cuda limits threads per block to 1024
 kernel_call<<<numBlocks, threadsPerBlock>>>(...)
 - https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/
 - Warp
- Groups of threads with consecutive thread indexes are bundled into warps
 - One full warp is executed on a single cuda core
 - blocks are divided into multiple warps for execution on corers of an SM
 - SIMT – single instruction, multiple threads
 - https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/
 
 - Groups of threads with consecutive thread indexes are bundled into warps
 - Parallel Reduce
 - Shared memory
 - Printf in Cuda