⭅ Previous (CUDA 101)

OpenCL 101 - Write an OpenCL program from scratch

In this article we’ll write parallel code that can run on a CPU or GPU, for both AMD and NVidia devices. This is the OpenCL version of our CUDA 101 article.

Even if you plan on writing code for CPU or an AMD GPU, it you might find it helpful to first skip the CUDA article above. The OpenCL interface is unfortunately much more verbose, so it can be helpful to understand the high level concepts first.

What is OpenCL

OpenCL is an industry standard for running parallel computations on accelerator devices. It is a standard managed by Khronos Group, the same organization behind OpenGL.

OpenCL is a generic interface that lets you write code to run on an accelerator. It also manages transferring the data from a Host (cpu+memory) to the accelerator device (GPU etc).

(These next two sections are identical to the CUDA article, feel free to skip these if you’ve already read that one.)

A Simple Problem for GPUs: Filling an array.

For our simple end-to-end example, we’ll fill an array, so that each element holds its own position. For a length 5 array, this would be: [0.0 1.0 2.0 3.0 4.0]

While it isn’t particularly compute heavy, this will allow us to walk through the workflow of interacting with a GPU, and give us a template for future tasks.

This task is suitable for running on a GPU, since we can think of each element as its own task. If we know the position, we can compute the value that needs to go into that slot.

Splitting a workload : grid and blocks and threads

Last time we saw that GPU compute units(cores) are slower than cores of a CPU. But with a GPU we have thousands of them instead of about ten. So if we want our code to run faster on a GPU, we need to find a way to split it up into many little tasks that can be run independently.

Before we write any GPU code, we need to plan on how to partition it. We call the full problem the “grid”. We then split this grid up into a number of “blocks”. Each block then has multiple “threads”. Each thread corresponds with work that can be run in parallel on a hardware thread.

This three level hierarchy might be a bit different from what you expect. Surely we want to split a problem into a number of tasks, but why have these “blocks” in the middle? This is a clever piece of design that allows us to achieve better locality when running our programs. We’ll look at hardware architecture in the future, but for now it is enough to know that a block is the unit that gets scheduled onto some hardware element. If we partition our problem so that “blocks” can share data, we can usually make our kernels more efficient.

Both the grid and the block dimensions can be between 1 and 3 dimensions. You are free to use as many as naturally fit your problem. We’re going to be filling a 1D array, so only need one dimension (x).

For image processing, you might want to chunk your image into 2 dimensions.

For this task, we’ll split our problem into chunks of size 2, so each contains just two threads.

Now lets see how to start this parallel task.

One notation note, blocks vs work items in CUDA vs OpenCL

In CUDA, the grid is split into blocks, and blocks are split into threads. We then need to calculate a coordinate in the global space, using block id and thread id.

In OpenCL, we specify the size of “local work item”, and the global space of work items. Just like CUDA, each of these can be multi-dimensional. But unlike CUDA, OpenCL gives you the global position directly.

Starting an OpenCL task

Before we look at code, lets outline the process for running the code on a GPU:

Select the device we want to run on.
Create device queues for issueing commands (ex: run kernel, copy data) to the device.
Compile our kernel for our device.
Create buffers to hold data on the device.
Copy inputs from host to device
Set work size(global + local) for our parallel task
Start the kernel
Copy outputs from device back to host

Our full example below is heavily commented, to help follow along.

Full OpenCL Example Code

And here it is all together.

#include <CL/opencl.h>

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

static const char* all_kernels = R"(
__kernel void fill_arr(__global float* a, int N) {
    size_t i = get_global_id(0);
    if (i >= N) return;
    a[i] = (float)i;
}
)";

// host driver
void fill_cl(float* a, size_t N) {
    // fill A (N)
    const char* kernel_source = all_kernels;
    printf("Kernel:\n-----%s-------\n", kernel_source);

    // task per input row.
    const int NUM_ENTRIES = N;
	size_t local_work_size[1] = { 2 };
	size_t global_work_size[1] = { 2 };  // Not block id, but total threads.
    while (global_work_size[0] <= NUM_ENTRIES) {
        // global work size must be a multiple of local work size.
        global_work_size[0] <<= 1;
    }

    // Looking up the available GPUs
	const cl_uint num = 1;
	clGetDeviceIDs(NULL, CL_DEVICE_TYPE_GPU, 0, NULL, (cl_uint*)&num);

	cl_device_id devices[1];
	clGetDeviceIDs(NULL, CL_DEVICE_TYPE_GPU, num, devices, NULL);

	// create a compute context with GPU device
    puts("creating context");
    cl_int error = 0;
	cl_context context = clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU, NULL, NULL, &error);
    assert(error == 0);

	// create a command queue
    puts("creating queue");
	clGetDeviceIDs(NULL, CL_DEVICE_TYPE_DEFAULT, 1, devices, NULL);
	cl_command_queue queue = clCreateCommandQueue(context, devices[0], 0, &error);
    assert(error == 0);

    cl_mem out_dev = clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(float) * NUM_ENTRIES, NULL, &error);
    printf("out err: %d\n", error);
    assert(error == 0);

	// create the compute program
	cl_program program = clCreateProgramWithSource(context, 1, (const char **)& kernel_source, NULL, &error);
    assert(error == 0);

	// build the compute program executable
	error = clBuildProgram(program, 1, devices, NULL, NULL, NULL);
    printf("build program returns: %d\n", error);
    cl_build_status build_status;
    clGetProgramBuildInfo(program, devices[0], CL_PROGRAM_BUILD_STATUS, sizeof(cl_build_status), &build_status, NULL);
    if (build_status != CL_BUILD_SUCCESS) {
        // Check the build log if something went wrong
        char log[2048];
        clGetProgramBuildInfo(program, devices[0], CL_PROGRAM_BUILD_LOG, sizeof(log), log, NULL);
        printf("Build Log:\n%s\n", log);
        exit(1);
    }

	// create the compute kernel
    puts("creating kernel...");
	cl_kernel kernel = clCreateKernel(program, "fill_arr", &error);
    printf("kernel error: %d\n", error);
    assert(error != CL_INVALID_KERNEL_DEFINITION);
    assert(error != CL_INVALID_PROGRAM_EXECUTABLE);
    assert(error == 0);

	clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&out_dev);
	clSetKernelArg(kernel, 1, sizeof(int), (void*)&N);

    puts("enqueing kernel");
	error = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, global_work_size, local_work_size, 0, NULL, NULL);
    printf("queue returns: %d\n", error);
    assert(error == 0);

    // wait for kernel to finish.
    clFinish(queue);

    int err = clEnqueueReadBuffer(
            queue, out_dev, /*blocking_read=*/CL_TRUE,
            /*offset=*/0,
            /*size=*/sizeof(float)*NUM_ENTRIES,
            /*ptr=*/a,
            /*num_events_in_wait_list=*/0,
            /*event_wait_list=*/NULL,
            NULL);
}

int main() {
    int count = 0;

    const int length = 5;
    float *arr = (float*)malloc(sizeof(float)*length);

    fill_cl(arr, 5);

    printf("Got: ");
    for (int i = 0; i < length; i++) {
        printf("%f ", arr[i]);
    }
    printf("\n");
    free(arr);

    return 0;
}

On linux, you can build with:

g++ -O3 fill.cl.cc -lOpenCL -o fill_cl