Intro to GPUs

In this article we’ll introduce the GPU, and compare it with computing on a CPU. We’ll take a high-level look at the design of a GPU, and sees how it varies with a CPU. We’ll also write a basic program and compare performance of Javascript, Python, C, and CUDA.

The GPU

GPU stands for “Graphics Processing Unit”. It is an evolution of the graphics card. Its naming is a reference to the “CPU”, or central processing unit, that came before it.

In the early days of computers, all graphical processing was performed by the CPU. A CPU is designed to run programs which consist of a large number of simple steps, and to run each step as quickly as possible. While a CPU is very specialized for running a series of steps quickly, it primarily operates by running one step at a time.

This is great for complex sequential logic that you might see in desktop applications like a web browser.

(There’s a longer and more interesting history here, involving the evolution of discrete logic -> cpu -> graphics card -> gpu. I may write more in the future, send an email if you’re interested in this).

But there are many types of problems where a CPU is not the best solution. In computer graphics, the task is often of running some relatively simple logic, but running the same equations over each pixel on your screen. There is no dependence between the pixel at (1,1) and the pixel at (1,2), so these could theoretically be computed at the same time. (This applies only to some graphics tasks, like shading or raytracing.)

And this is precisely the sort of task the GPU was designed for. While each computing element or “core” within a GPU may be slower and less sophisticated than a core from a CPU, there are many many more than would be in a typical CPU. Consider these devices, two CPUs and two GPUS:

Device	Type	Cores x Max Speed/Frequency	Cost
Intel i3 14100F	CPU (low end)	4x3.5Ghz	$100
Intel Ultra 9 295K	CPU (high end)	8x5.4GHz (Perf) + 16x4.6Ghz (Efficient)	$550
NVidia RTX 5060	GPU (low end)	3,840 x 2.41Ghz	$300
NVidia RTX 5090	GPU (high end)	21,760 x 2.41Ghz	$3600

At the low end, we see GPUs for 3x the cost of a CPU but with 3840/4 = 960 times the number of compute cores. And the comparison of the high end is similar. There’s a lot more to the differences in the design, but we’ll cover that in a future article. For now, it is enough to know that GPU cores are typically a bit slower, but there are MANY more than you could find in a CPU.

A Problem for GPUs: Matrix Multiplication

GPUs do best operating on a large number of relatively simple problems in parallel. Rather than compute each one sequentially, it is designed to queue up a large number of tasks and compute these over slightly different data.

A classic example is multiplying two matrices of numbers. This problem is often useful for large simulations (research and physics), graphics (computing positions of objects), and machine learning ( applying a model to some input).

We’ll have a quick refresher on the problem, followed by implementations in Javascript, Python, C, and CUDA. If you’re familiar with the problem or would rather read code, jump to your language of choice.

Matrix Multiplication Refresher

In Matrix Multiplication wiki, we have a matrix A and B, each is a 2D grid of numbers. Using notation Rows x Cols, our inputs must have compatible sizes: A (MxN) x B (NxK) -> C (MxK)

Each number in the output matrix, (Y,X) is the dot product of row Y from A and column X from B. The dot product is the Sum of the products of each pair. Ex:

1 2  4 5    p q   16 19
3 4 x6 7 => r s =>36 43

p = dot([1,2], [4,6]) = 1*4 + 2*6 = 16
q = dot([1,2], [5,7]) = 1*5 + 2*7 = 19
r = dot([3,4], [4,6]) = 3*4 + 4*6 = 36
s = dot([3,4], [5,7]) = 3*5 + 4*7 = 43

Output size is MxK, and computing each of these elements takes N multiplications and sums, so the work is O(M*N*K)

We’ve used both 2x2 matrices in the example, but in practice these can be very large. Now lets take a look at a few basic implementations. The goal has been to translate the simplest algorithm into each language. There are many optimizations that can be applied to make it even faster, but we’ll address this in a future article.

Implementation note: packed row-major matrices.

Some high level programming languages support two dimensional matrices. Languages can either implement this as a packed array, or by an outer array that itself holds pointers to arrays.

The packed array is more efficient, and most similar to what we can implement on a GPU, so we’ll use that. This means that our example matrices will be stored just like a 1d array, with the second row following the first and so forth. So:

1 2
3 4 -> [1, 2, 3, 4]

4 5
6 7 -> [4, 5, 6, 7]

And so forth. Since this loses the dimensional information, we’ll also need to pass along the dimensions M, N, and K.

Translating from a 2d dimension into a 1d index uses the following formula. For each row or y value, we essentially add the width of the matrix times the y value. The element 7 comes 2 elements after 5, since the width of that matrix is 2.

idx1D(x, y, w) = w*y + x

And now for the implementations:

Python Matrix Multiplication:

#!/usr/bin/env python3
def matmul(a, b, M, N, K, out):
    '''
        Multiply a * b -> out
        A is (MxN)
        B is (NxK)
        Out is (MxK)
    '''
    assert(len(a) == M*N)
    assert(len(b) == N*K)
    assert(len(out) == M*K)
    for y in range(0, M):
        for x in range(0, K):
            sum = 0
            for i in range(0, N):
                ai = a[y*N + i]
                bi = b[i*K + x]
                sum += ai*bi
            out[y*K + x] = sum

if __name__ == "__main__":
    M = 4000
    N = 4000
    K = 4000
    a = [0]*(M*N)
    b = [0]*(N*K)
    c = [0]*(M*K)
    matmul(a, b, M, N, K, c)
    print("done")

Javascript Matrix Multiplication:

#!/usr/bin/env node
function matmul(a, b, M, N, K, out) {
    /*
        Multiply a * b -> out
        A is (MxN)
        B is (NxK)
        Out is (MxK)
    */
    // assert(a.length == M*N)
    // assert(b.length == N*K)
    // assert(out.length == M*K)
    for(let y = 0; y < M; y++) {
        for(let x = 0; x < K; x++) {
            let sum = 0;
            for(let i = 0; i < N; i++) {
                let ai = a[y*N + i];
                let bi = b[i*K + x];
                sum += ai*bi;
            }
            out[y*K + x] = sum;
        }
    }
}
M = 4000
N = 4000
K = 4000
a = new Array(M*N)
b = new Array(N*K)
c = new Array(M*K)
matmul(a, b, M, N, K, c)
console.log("done")

C Matrix Multiplication

void matmul_naive_cpu(const vector<float>& a, const vector<float>& b, size_t M, size_t N, size_t K, vector<float> *out) {
    // Naive baseline matrix multiplication.
    assert(a.size() == M*N);
    assert(b.size() == N*K);
    assert(out->size() == M*K);
    for (int y = 0; y < M; y++) {
        for (int x = 0; x < K; x++) {
            float sum = 0.0;
            for (int n = 0; n < N; n++) {
                sum += a[idx2D(N, n, y)] * b[idx2D(K, x, n)];
            }
            (*out)[idx2D(K, x, y)] = sum;
        }
    }
}

int main() {
    int M = 4000;
    int N = 4000;
    int K = 4000;
    vector<float> a(M*N);
    vector<float> b(N*K);
    vector<float> c(M*K);

    matmul_naive_cpu(a, b, M, N, K, &c);
    puts("done");
    return 0;
}

CUDA Matrix Multiplication

CUDA is the language created by NVidia for interacting with their GPUs.

We’ll walk through this implementation in detail in the next article, but for now I just want to give a taste of what working with CUDA/GPUS is like. For now you can observe that the code is very similar to the C, with some extra keywords that help parallelize the computation.

__global__ void matrix_multiplication_kernel(const float* A, const float* B, float* C, int M, int N,
                                             int K) {
    int x = blockIdx.x*blockDim.x + threadIdx.x;
    int y = blockIdx.y*blockDim.y + threadIdx.y;
    float sum = 0.0;
    const int AW = N;
    const int AH = M;
    const int BW = K;
    const int BH = N;
    if (x >= K || y >= M) return;
    for (int i = 0; i < N; i++) {
        int ay = y;
        int ax = i;
        int by = i;
        int bx = x;
        sum += A[AW * ay + ax] * B[BW * by + bx];
    }
    C[y * K + x] = sum;
}

Performance Comparison

I’ve run these each with two 4000x4000 matrices. Lets compare running time, to appreciate the power of parallel programming on a GPU:

Language	Running time	Speed relative to Python (Higher is better)
Python (3.12)	174m41s = 10491s	1.0
Javascript (node v22.20)	38m51s = 2331s	4.5x
C++ (g++ 13.3 -O3)	5m39 = 339s	30.9x
CUDA	1.8s	5828x

This was actually a fairly small workload for the GPU, so it wasnt fully utilized. In practice we could expect an even larger speedup relative to single threaded C++, for problems that lend themselves to parallel computing.

Hardware used:

Python3, Javascript, and C++ are all single threaded and ran on the following CPU: Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz

CUDA ran on an L4 NVidia GPU(7424 cuda cores @ 795 MHz) on GCP.

Up next: Basic GPU Programming with CUDA and OpenCL

Next time we’ll walk through building a simple GPU accelerated program. We’ll explain the popular CUDA language used in this article, as well as the open standard OpenCL which supports many additional device types.