⭅ Previous (Intro to GPUs)

CUDA 101 - Write a simple program from scratch

In this article we’ll look at writing a CUDA Program from scratch. As we learned in the previous article, CUDA allows us to write programs that use the GPU to run computations requested by the CPU.

CUDA is developed by Nvidia and can only run on Nvidia GPUs. If you would like to follow along with this article, you can run your CUDA programs online in our GPU Lab.

We also plan on covering OpenCL soon. The CUDA API is simpler, and once you understand CUDA you can learn OpenCL easily.

There is also a lower-level way of interacting with CUDA, without the “runtime api” we’ll use here. This lower level interface is easier to understand once we have the big picture, so we’ll save it for a later article.

A Simple Problem for GPUs: Filling an array.

For our simple end-to-end example, we’ll fill an array, so that each element holds its own position. For a length 5 array, this would be: [0.0 1.0 2.0 3.0 4.0]

While it isn’t particularly compute heavy, this will allow us to walk through the workflow of interacting with a GPU, and give us a template for future tasks.

This task is suitable for running on a GPU, since we can think of each element as its own task. If we know the position, we can compute the value that needs to go into that slot.

Splitting a workload : grid and blocks and threads

Last time we saw that GPU compute units(cores) are slower than cores of a CPU. But with a GPU we have thousands of them instead of about ten. So if we want our code to run faster on a GPU, we need to find a way to split it up into many little tasks that can be run independently.

Before we write any GPU code, we need to plan on how to partition it. We call the full problem the “grid”. We then split this grid up into a number of “blocks”. Each block then has multiple “threads”. Each thread corresponds with work that can be run in parallel on a hardware thread.

This three level hierarchy might be a bit different from what you expect. Surely we want to split a problem into a number of tasks, but why have these “blocks” in the middle? This is a clever piece of design that allows us to achieve better locality when running our programs. We’ll look at hardware architecture in the future, but for now it is enough to know that a block is the unit that gets scheduled onto some hardware element. If we partition our problem so that “blocks” can share data, we can usually make our kernels more efficient.

Both the grid and the block dimensions can be between 1 and 3 dimensions. You are free to use as many as naturally fit your problem. We’re going to be filling a 1D array, so only need one dimension (x).

For image processing, you might want to chunk your image into 2 dimensions.

For this task, we’ll split our problem into blocks of size 2, so each contains just two threads.

Now lets see how to start this parallel task.

Starting a task

If you’re using the nvcc compiler and writing a .cu file, we have a bit of special syntax available for starting a parallel task. This looks like:

    vector_fill<<<blocksPerGrid, threadsPerBlock>>>(output, length);

vector_fill is the name of a special “kernel” function we’ll write soon. The special syntax is the «< and »>, or “triple chevron” (useful for searching). This takes two parameters, one is the size of the grid in terms of blocks, and the other is the size of the block in terms of threads.

These can either be 3d vectors, or can be integers. If an integer is passed, it is assumed to be a 1d vector with x=N.

Lets take our array filling task, where we take an array of length N and want 1x1x1 blocks as mentioned earlier. Here’s how we might calculate our dimensions:

void run(float* output, int length) {
    const int threads_per_block = 2;
    // This is equivalent to blocksPerGrid = ceil(length / threads_per_block).
    // You'll probably see this often.
    int blocksPerGrid = (length + threads_per_block - 1) / threads_per_block;
    vector_fill<<<blocksPerGrid, threadsPerBlock>>>(output, length);
    cudaDeviceSynchronize();
}

This corresponds with running vector_fill with the parameters output and length, across each “position” in our grid. The cudaDeviceSynchronize blocks until the kernel function complets. This is helpful in case we wanted to start several different tasks, or do other work on the CPU while it is running. For now we’re just running the kernel, so we’ll block immediately after starting it.

Now how do we write that function?

What is a CUDA kernel?

A CUDA Kernel is the code that actually runs on the GPU. It is a function written in CUDA, which is a special flavor of the C++ programming language.

CUDA Kernel functions are declared with the special __global__ annotation, which tells the compiler that this function needs to be available for the GPU.

When this kernel runs, each is called with the same parameters. But it will get some extra information to know which instance of the problem(block + thread) it is running. Lets take a look:

A Simple CUDA Kernel

With the setup out of the way, we can finally examine a basic CUDA kernel.

__global__ void vector_fill(float* A, int len) {
    int i = blockIdx.x*blockDim.x + threadIdx.x;
    if (i >= len) return;

    A[i] = float(i);
}

This first line is the only “tricky” piece here, the rest is standard C++.

blockIdx, blockDim, and threadIdx are special variables that are available to read in your kernel function. blockDim is fixed across a run of our function, and tells us how many threads are in each dimension of our block. For our 1d blocks we set up earlier, this will be 2.

blockIdx tells us which block we are running. This varies based on our run() calculation earlier. Lets say we have an array of size 10 with our 2 thread blocks. We’ll need ceil(10/2) = 5 blocks in order to cover out grid. This means blockIdx.x will range from 0 up to including 4.

threadIdx tells us which block we are within. Since we have blockDim.x = 2, threadIdx.x will range from 0 to 1, inclusively.

The formula for int i gives us a way to take these block/thread coordinates and linearize them into 1d coordinate. This is similar to the matrix packing expression in the previous article.

This guarantees that each thread will have its unique index. One thing we need to watch out for, is that CUDA will spawn full blocks, regardless of the size of our task. It is up to us to ensure that our kernel functions dont read/write out of bounds. Here we just return if not within our array of length len.

Host vs Device Memory

We have one more step before we can run our GPU code. We need to pass our data into and out of the GPU.

GPUs have their own dedicated memory, independant of the memory available to your CPU. This video ram, or VRAM, is used by the GPU and its kernel functions. Though we passed a float * to our vector_fill function, this needs to be a pointer in device(GPU) memory, not host(CPU) memory.

We copy it there using a series of functions very similar to the malloc/free used for managing memory in C.

    const int len = 10;
    const int bytes = len * sizeof(float);
    float *arr_host = (float*)malloc(bytes);  // Allocate host storage for the result.

    float *arr_dev;
    cudaMalloc(&arr_dev, bytes); // Allocate device storage.
    run(arr_dev, bytes);

    // Copy the result from gpu to host when done.
    cudaMemcpy(arr_host, arr_dev, bytes, cudaMemcpyDeviceToHost);
    cudaFree(arr_dev);
    free(arr_host);

While the signatures of these new cudaMemcpy and cudaMalloc are a bit different, these effectively do the same thing as their C counterparts. The final parameter to cudaMemcpy tells us the direction of the copy. In our example, we produce data on the GPU but have no inputs. If we wanted to pass in data, we would also need to use the cudaMemcpyHostToDevice in order to copy data from system memory onto the GPU.

Putting it all together:

Here is the full code listing. Once you compile this with the cuda compiler, nvcc, you will have what is called a “fat binary”, or cubin. This file contains both the code that runs on the cpu, and the compiled code for the GPU. The program starts running on the CPU, and it will upload the compiled kernel code to the GPU so that it can run.

// Try it on the GPU Lab: gpu.emulationonline.com
#include <stdio.h>
#include <cuda_runtime.h>
#include <stdlib.h>

__global__ void vector_fill(float* A, int len) {
    int i = blockIdx.x*blockDim.x + threadIdx.x;
    if (i >= len) return;

    A[i] = float(i);
}

// A is a device pointer (a pointer to memory on the GPU)
extern "C" void run(float* A, int N) {
    int threadsPerBlock = 256;
    int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;

    vector_fill<<<blocksPerGrid, threadsPerBlock>>>(A, N);
    cudaDeviceSynchronize();
}


int main() {
    int count = 0;
    cudaGetDeviceCount(&count);
    printf("Found %d cuda devices\n", count);

    const int length = 5;
    float *arr = malloc(sizeof(float)*length);

    float *arr_dev;
    cudaMalloc(&arr_dev, bytes);
    run(arr_dev, 5);
    cudaMemcpy(arr, arr_dev, bytes, cudaMemcpyDeviceToHost);
    cudaFree(arr_dev);

    printf("Got: ");
    for (int i = 0; i < length; i++) {
        printf("%f ", arr[i]);
    }
    printf("\n");

    return 0;
}

Output:

We’ll run this on our gpu lab, and observe the following output:

Found 1 cuda devices
Got: 0.000000 1.000000 2.000000 3.000000 4.000000

real	0m1.664s
user	0m0.049s
sys	0m0.538s

Our kernel runs!

Experiment ideas

Here are a few experiments you can run, to help understand or generate questions for further investigation:

Try passing a host pointer (ex: arr instead of arr_dev) to our run() function. Why doesn’t this work?
Change the kernel invocation, to run only a single block with 2 threads. What happens to your output?
Try changing the int i calculation to int i = threadIdx.x; and observing the output.

Next up: Basic GPU Architecture

With this knowledge, you should be able to write pretty much any function you would want to run on a GPU. However there’s a lot more to know about CUDA and GPUs if you want to write functions that run quickly and efficiently.

Next up we’ll cover the basics of GPU Architecture, and use this knowledge to optimize a matrix multiplication.