Why Is LLM Inference So Much Slower on CPU? A Deep Dive

Researchers and engineers have known for a long time that training models on a GPU is much faster than on a CPU. But a wider range of the population has recently seen the difference in performances between CPUs and GPUs when trying to run LLMs locally. So why are GPUs so much faster? They are not faster on all tasks, and the easy answer we often hear is that they parallelize more. Which is true, but a bit simplistic. So I wanted to dig deeper on this topic, and this article is the result of this deep dive.

We will do a complete walkthrough of how data actually moves through a CPU+RAM system versus a GPU+VRAM system. We'll start from the basics (cache lines, SIMD, GPU warps), build up through shared memory tiling, and arrive at the full picture of why streaming 14 GB of model weights through a GPU's wide memory bus is so much faster than through a CPU's narrow one.

I hope that you'll find it interesting.


Part 1: The Two Worlds

CPUs and GPUs solve different problems, and their memory systems reflect that. They take different positions on the latency vs throughput trade-off, a trade-off you see on many pieces of computer infrastructure.

A CPU is optimized for latency. CPUs have to complete huge amounts of often very small and different tasks, so we expect it to be very efficient. It has a small number of powerful cores (8–24 on a typical desktop). The goal is to make one thread go as fast as possible. When a CPU core requests data from RAM, it needs it now. A cache miss that takes 100 nanoseconds means the core is sitting idle for hundreds of cycles.

A GPU is optimized for throughput. GPU works on fewer amounts of big tasks, so a little overhead per task is not a big deal if it means the big task can be run fast. It has thousands of small cores (e.g. 16,384 on an RTX 4090[2]) designed to run the same instruction across many data elements simultaneously. No single thread is fast, but the aggregate work done per second is huge. When a GPU thread requests data from VRAM, it doesn't need it now. It can wait, because the GPU will switch to other threads in the meantime.

These different philosophies lead to different memory designs:

CPU + RAMGPU + VRAM
Optimize forLow latency (fast single access)High throughput (many accesses in flight)
Bus widthNarrow: 64–128 bitWide: 256–5120 bit
Bandwidth~100 GB/s (Apple M2, LPDDR5)[1]~1 TB/s (GDDR6X)[2] to ~3 TB/s (HBM3)[4]
Latency strategyReduce it (caches, prefetching)Hide it (thread switching)
Capacity16–128 GB typical8–80 GB typical

And they're connected by a PCIe bus, a comparatively narrow bridge (~32 GB/s for PCIe 4.0 x16)[3] that becomes the bottleneck whenever data must travel between the two worlds.

CPU RAM 128-bit bus · ~100 GB/s GPU VRAM 384-bit bus · ~1 TB/s PCIe 4.0 ~32 GB/s
Data must travel across the PCIe bridge between the two memory worlds. The packets animate the flow.

With that context, let's look at each side in detail.


Part 2: The CPU + RAM Side

2.1 A simple loop

Let's start with the most basic operation: adding two arrays element by element.

// vector_add.c - scalar CPU version
void vector_add(float* a, float* b, float* c, int n) {
    for (int i = 0; i < n; i++) {
        c[i] = a[i] + b[i];
    }
}

But what actually happens when the CPU executes a[i]?

The CPU doesn't talk to RAM directly. It goes through a cache hierarchy:

  1. L1 cache (~32 KB, ~1 ns): tiny, per-core, very very fast
  2. L2 cache (~256 KB–1 MB, ~5 ns): per-core, still very fast
  3. L3 cache (~8–32 MB, ~20 ns): shared across cores, 20x slower than L1
  4. RAM (~64+ GB, ~100 ns): the main memory, 100x slower than L1

When the CPU needs a[0], it checks L1 first. On a miss, it checks L2, then L3, then finally goes to RAM. And when it does fetch from RAM, it doesn't fetch just 4 bytes (one float). It fetches an entire cache line: 64 bytes, which corresponds to 16 floats at once.

RAM, array a[] a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7] a[8] a[9] a[10] a[11] a[12] a[13] a[14] a[15] 64 bytes = 1 cache line = 16 floats L1 Cache, 64-byte line CPU core
When the CPU reads a[0], it fetches an entire 64-byte cache line: a[0] through a[15]. The next 15 reads are served from L1 — no RAM access needed.

This is where CPU memory optimization happens: spatial locality. If you access a[0], the hardware bets that you'll access a[1] through a[15] soon. For a sequential loop, that bet pays off: one RAM access serves 16 iterations.

2.2 Sequential vs strided access

See benchmark

What happens when the bet doesn't pay off? Consider these two loops:

// Sequential: accesses a[0], a[1], a[2], ... - cache-friendly
float sum = 0;
for (int i = 0; i < n; i++) {
    sum += a[i];
}

// Strided: accesses a[0], a[64], a[128], ... - cache-hostile
float sum = 0;
for (int i = 0; i < n; i += 16) {
    sum += a[i];
}

In the sequential version, every cache line fetch is fully utilized: all 16 floats get read. In the strided version, we fetch 64 bytes from RAM but only use 4 bytes, meaning 93.75% of the bandwidth is wasted.

On real hardware, the difference is huge. For a large array that doesn't fit in cache:

Access patternEffective bandwidthUtilization
Sequential~80 GB/s~100%
Stride-16 (every 64th byte)~5 GB/s~6%

So with the exact same hardware, just changing the way we walk through memory impact bandwidth by a factor of 16.

Sequential access
Cache line (64 bytes = 16 floats) 0123456789101112131415 16/16 floats used → 100% utilization ~80 GB/s effective ✅ Prefetcher active Every byte from RAM is used
Stride-16 access
Cache line (64 bytes = 16 floats) 0123456789101112131415 1/16 floats used → 6.25% utilization ~5 GB/s effective ❌ Prefetcher defeated 93.75% of fetched bytes wasted
Same RAM, same hardware — only the access pattern changes. Sequential access uses the full cache line; stride-16 wastes almost everything.

The CPU also has a hardware prefetcher: it detects sequential access patterns and starts fetching future cache lines before you need them. This means sequential access can effectively hide RAM latency: the data is already in L1 by the time the loop gets to it. Strided or random access defeats the prefetcher, and every access stalls.

time CPU core STALL ~100ns cache miss! process a[0..15] process a[16..31] cache hit! process a[32..47] cache hit! HW prefetcher idle detect pattern fetch CL1 fetch CL2 fetch CL3 Effective stall time Only the first access stalls — prefetcher hides the rest
The first access to a[0] misses the cache and stalls. The prefetcher detects the sequential pattern and fetches future cache lines from RAM in the background, while the CPU processes the current one. By the time the CPU needs a[16], it is already in L1.

2.3 Structure of Arrays vs Array of Structures

See benchmark

This cache line principle has direct implications for how you lay out data. Let's take the implementation of a particle system as a toy example:

// Array of Structures (AoS) - the "natural" layout
struct Particle {
    float x, y, z;       // position
    float vx, vy, vz;    // velocity
    float mass;
    float charge;         // 32 bytes total
};
struct Particle particles[1000000];

// To update positions:
for (int i = 0; i < n; i++) {
    particles[i].x += particles[i].vx * dt;
    particles[i].y += particles[i].vy * dt;
    particles[i].z += particles[i].vz * dt;
}
// Structure of Arrays (SoA) - the cache-friendly layout
struct Particles {
    float x[1000000], y[1000000], z[1000000];
    float vx[1000000], vy[1000000], vz[1000000];
    float mass[1000000];
    float charge[1000000];
};
struct Particles p;

// To update positions:
for (int i = 0; i < n; i++) {
    p.x[i] += p.vx[i] * dt;
    p.y[i] += p.vy[i] * dt;
    p.z[i] += p.vz[i] * dt;
}

With AoS, each cache line (64 bytes) holds 2 complete particles, including mass and charge that we don't need for position updates, so we load data we never use.

With SoA, a cache line holds 16 consecutive x values, and we use every single one. When the loop processes x values, the data stream is perfectly sequential and the prefetcher is happy.

Array of Structures (AoS)
x y z vx vy vz m q particle[0] x y z vx vy vz m q particle[1] 6/16 fields used per cache line → 37.5%
Structure of Arrays (SoA)
x[0] x[1] x[2] x[3] x[4] ... x[15] — all used ✓ vx[0] ... vx[15] — separate cache line, not loaded 16/16 fields used per cache line → 100%
When updating only positions (x, y, z), AoS wastes bandwidth loading velocity, mass, and charge. SoA keeps each field contiguous — every byte in the cache line is useful.

In real-world code, switching from AoS to SoA often gives a 2–4x speedup for data-parallel workloads. Game engines, physics simulations, and database systems all use SoA layouts for hot paths.


Part 3: CPU SIMD: Doing More Per Clock

3.1 From scalar to SIMD

See benchmark

So far, our CPU is processing one float at a time. But modern CPUs have SIMD (Single Instruction, Multiple Data) registers that can hold and operate on multiple values simultaneously:

SIMD generationArchitectureRegister widthFloats per op
NEONARM (Apple Silicon)128 bit4
SSEx86128 bit4
AVX2x86256 bit8
AVX-512x86512 bit16

Here's our vector add:

#include <arm_neon.h>

void vector_add_neon(float* a, float* b, float* c, int n) {
    int i = 0;
    for (; i + 4 <= n; i += 4) {
        // Load 4 floats from a and b into 128-bit registers
        float32x4_t va = vld1q_f32(&a[i]);
        float32x4_t vb = vld1q_f32(&b[i]);

        // Add all 4 pairs in a single instruction
        float32x4_t vc = vaddq_f32(va, vb);

        // Store 4 results
        vst1q_f32(&c[i], vc);
    }
    // Handle remaining elements
    for (; i < n; i++) {
        c[i] = a[i] + b[i];
    }
}

One vaddq_f32 instruction does the work of 4 scalar additions. The CPU issues one instruction and the ALU performs the operation across all 4 lanes of the 128-bit register in parallel.

Scalar (1 float / instruction)
a[i] 3.0 b[i] 2.0 ADD 5.0 1 result / instruction
NEON (4 floats / instruction)
float32x4_t va = {a[0]..a[3]} 3.0 3.1 3.2 3.3 2.0 2.1 2.2 2.3 ADD ADD ADD ADD 5.0 5.2 5.4 5.6 4 results / instruction → 4× throughput
One NEON vaddq_f32 does the same work as 4 scalar additions, using 4 parallel lanes in a 128-bit register.

3.2 SIMD and memory

SIMD makes the memory bandwidth question even more critical. A scalar loop might process 1 float per cycle. At 4 bytes per float and a 3.5 GHz clock (Apple M2 performance core), that's ~14 GB/s of data consumed per input array. A NEON loop processes 4 floats per cycle: ~56 GB/s per input array. An operation like vector add that reads two arrays and writes one demands roughly ~168 GB/s of total memory traffic. But Apple Silicon's unified memory delivers ~100 GB/s[1].

This means that for simple operations like vector add, the CPU becomes memory-bound once SIMD is enabled. The compute units can process data faster than memory can deliver it, so they spend cycles waiting. On x86 systems with DDR5 (~50 GB/s) and wider SIMD (AVX2 at 8 floats/op, AVX-512 at 16), the imbalance is even larger.

Modern compilers can often auto-vectorize simple loops, so now even the scalar implementation may end up having the same performance level as the optimized version.

3.3 The limits of CPU parallelism

Let's count the total parallelism available on an Apple M2 CPU:

  • 4 performance cores x 4 floats per NEON instruction x 2 FMA units per core = 32 float operations per cycle

At 3.5 GHz, that's ~0.1 TFLOPS peak (we are only talking about the CPU here. The Apple M2 also have a GPU, and the total peak is around 3.6 TFLOPS when counting the CPU/GPU association). A GPU like the RTX 4090 delivers ~83 TFLOPS (FP32)[2], roughly 800x more raw compute. Even a high-end x86 CPU with AVX-512 (16 cores x 16 floats x 2 FMA x 4 GHz ≈ 2 TFLOPS) is still ~40x behind. The GPU wins by having thousands of simpler cores instead of a few powerful ones.

The CPU's strength is flexibility: it can complete very fast a multitude of different tasks. The GPU's strength is raw data parallelism: when you have the same operation applied to millions of elements, nothing beats it.

This is the bridge to the GPU world. SIMD is the CPU's way of doing data-parallel work. The GPU takes this concept and pushes it to the extreme.


Part 4: The GPU + VRAM Side

4.1 The GPU execution model

A GPU doesn't run code the way a CPU does. Instead of a few powerful threads, it runs thousands of lightweight threads organized in a strict hierarchy:

  • Thread: The smallest unit. Each thread runs the same program (the kernel) but on different data.
  • Warp (NVIDIA) / Wavefront (AMD): 32 threads that execute in lockstep, running the same instruction at the same time. This is SIMT (Single Instruction, Multiple Threads), the GPU's version of SIMD.
  • Block (a.k.a. thread block): A group of warps (up to 1024 threads) that share fast on-chip memory and can synchronize with each other.
  • Grid: The entire collection of blocks that execute a kernel.

Here's the same vector add as a CUDA kernel:

// vector_add.cu - GPU version
__global__ void vector_add(float* a, float* b, float* c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        c[i] = a[i] + b[i];
    }
}

// Launch with 256 threads per block
int threads = 256;
int blocks = (n + threads - 1) / threads;
vector_add<<<blocks, threads>>>(d_a, d_b, d_c, n);

There is no loop in that code. Each thread computes one element. If you have 10 million elements, you launch 10 million threads. The GPU hardware schedules them across its streaming multiprocessors (SMs).

Grid (all blocks for this kernel) Block 0 Warp 0 (32 threads) Warp 1 (32 threads) Warp 2 (32 threads) ⋯ more warps Block 1 Warp 0 (32 threads) Warp 1 (32 threads) Warp 2 (32 threads) ⋯ more warps Block 2 Warp 0 (32 threads) Warp 1 (32 threads) Warp 2 (32 threads) ⋯ more warps
A kernel launch creates a grid of blocks. Each block contains warps of 32 threads that execute in lockstep (SIMT). The GPU schedules warps across its streaming multiprocessors.

4.2 VRAM and the wide bus

When those thousands of threads request data, VRAM must deliver. And it does, through a fundamentally different bus architecture than RAM.

An NVIDIA RTX 4090 has a 384-bit memory bus to its GDDR6X VRAM[2]. That means each memory transaction can move 48 bytes in a single cycle. At 21 Gbps per pin[5], this yields ~1 TB/s of bandwidth, roughly 10x more than Apple Silicon's LPDDR5 (and ~20x more than typical x86 DDR5).

HBM (High Bandwidth Memory), used in data center GPUs like the A100 and H100, goes even further: a 5120-bit bus delivering ~3.35 TB/s on the H100[3][4].

This massive bandwidth exists because the GPU needs it. A warp of 32 threads, each reading a 4-byte float, needs 128 bytes per memory access. With thousands of warps active, the demand is enormous. The wide bus is what makes it work.

But bandwidth alone isn't enough. How those threads access memory determines whether you use that bandwidth efficiently.

4.3 Coalesced vs uncoalesced access

This is a very important GPU optimization concept.

When 32 threads in a warp access memory, the GPU memory controller tries to coalesce their requests into the minimum number of transactions. If adjacent threads access adjacent addresses, a single 128-byte transaction serves the entire warp:

// Coalesced: thread 0 reads a[0], thread 1 reads a[1], ...
int i = blockIdx.x * blockDim.x + threadIdx.x;
float val = a[i];  // ✅ 1 transaction for 32 threads
// Uncoalesced: thread 0 reads a[0], thread 1 reads a[32], ...
int i = (blockIdx.x * blockDim.x + threadIdx.x) * 32;
float val = a[i];  // ❌ 32 separate transactions!

In the coalesced case, the 32 threads read a[0] through a[31], a contiguous 128-byte block. One transaction.

In the uncoalesced case, the 32 threads read a[0], a[32], a[64], scattered across memory. Each read hits a different 128-byte segment, so the hardware must issue up to 32 separate transactions, serializing them. The warp waits 32x longer.

32 threads in a warp, each reading one float Threads VRAM (global memory) 128 bytes — 1 transaction 1 memory transaction serves all 32 threads Full bandwidth utilization ✅
32 threads, each reading with stride 32 Threads VRAM (global memory) scattered — up to 32 transactions (showing 8 of 32 segments for clarity) Up to 32 separate memory transactions Bandwidth wasted on unused bytes ❌
Click to toggle. Coalesced access: adjacent threads read adjacent addresses → 1 wide transaction. Uncoalesced: threads read scattered addresses → many serialized transactions.

Sounds familiar? This is the GPU equivalent of the CPU's sequential-vs-strided access problem, but the penalty is even more severe because there's no hardware prefetcher to save you. The rule is simple: adjacent threads must access adjacent memory addresses.

4.4 Latency hiding

Here's the key insight that makes GPUs work despite VRAM having higher latency than RAM (~400 ns vs ~100 ns for RAM):

When warp 0 issues a memory request and has to wait 400 ns for the data to arrive, the GPU doesn't stall. It switches to warp 1, which has data ready to compute. When warp 1 stalls, it switches to warp 2. And so on.

If you have enough warps in flight (enough occupancy), there's always a warp ready to execute, and the memory latency is completely hidden. The cores never go idle.

Time ──────────────────────────────────────────────►

Warp 0: [compute] [    waiting for memory...    ] [compute]
Warp 1:           [compute] [  waiting for memory...  ] [compute]
Warp 2:                     [compute] [ waiting for memory... ] [compute]
Warp 3:                               [compute] [   wait...   ] [compute]
         ▲                  ▲                    ▲
         GPU always has     No idle              100% utilization
         a warp to run      cycles               despite high latency

This is fundamentally different from the CPU approach:

Time ──────────────────────────────────────────────►

Core 0: [compute] [  STALL - waiting for RAM  ] [compute] [  STALL  ]

                   Core is idle. Nothing to do.
                   That's why CPUs need caches.

The CPU uses big caches to reduce the frequency of stalls. The GPU uses massive parallelism to tolerate stalls. Two different solutions to the same problem.

CPU — single core
Core 0 work STALL (RAM ~100ns) work STALL Core utilization ~40% ❌ Core idles during memory stalls Solution: bigger caches, prefetching
GPU — warp scheduling
← SM switches warps, always has work → W0 W1 W2 W3 SM utilization ~95%
The CPU stalls on cache misses. The GPU hides latency by switching between warps — while one waits for memory, another computes. With enough warps (occupancy), the SM is never idle.

Part 5: Shared Memory and Tiling

5.1 The GPU memory hierarchy

Just like the CPU has L1/L2/L3 caches, the GPU has its own memory hierarchy:

MemoryScopeSizeLatencyManaged by
RegistersPer thread~256 KB per SM~1 cycleCompiler
Shared memoryPer block48–100 KB per SM~5 cyclesProgrammer
L2 cacheGlobal4–50 MB~200 cyclesHardware
VRAM (global)Global8–80 GB~400 cyclesProgrammer

The critical difference from CPUs is that shared memory is explicitly managed by the programmer. It's like having a programmable L1 cache, you decide what to load into it, when to load it, and how threads access it. This is very powerful and unlocks enormous optimizations, but you have to do the work yourself.

Registers Per thread · ~1 cycle · 256 KB/SM Shared Memory Per block · ~5 cycles · 48–100 KB/SM · programmer-managed L2 Cache Global · ~200 cycles · 4–50 MB · hardware-managed VRAM (Global Memory) Global · ~400 cycles · 8–80 GB · ~1–3 TB/s ⚡ Fastest 🐢 Slowest Smallest Largest Shared memory = programmable L1 cache (you manage it)
The GPU memory hierarchy mirrors the CPU's — keep hot data close to compute. The key difference: shared memory is explicitly managed by the programmer, unlike CPU caches.

5.2 Matrix multiply: naive vs tiled

Matrix multiplication is the canonical example of why shared memory matters. For C=A×BC = A \times B where all matrices are N×NN \times N:

Cij=k=0N1AikBkjC_{ij} = \sum_{k=0}^{N-1} A_{ik} \cdot B_{kj}

Each element of CC requires reading an entire row of AA and an entire column of BB: 2N2N reads from global memory. For N2N^2 elements, that's 2N32N^3 total reads. But AA and BB only have N2N^2 elements each, meaning every element is read NN times on average.

The naive kernel, where every thread reads what it needs from VRAM:

__global__ void matmul_naive(float* A, float* B, float* C, int N) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    if (row < N && col < N) {
        float sum = 0.0f;
        for (int k = 0; k < N; k++) {
            sum += A[row * N + k] * B[k * N + col];
            //     ^^^^^^^^^^^^^^^^   ^^^^^^^^^^^^^
            //     Each read from global memory (~400 cycles)
        }
        C[row * N + col] = sum;
    }
}

For a 4096x4096 matrix, each thread reads 8192 floats from VRAM. With 40962=16.7M4096^2 = 16.7M threads, that's ~137 billion global memory reads.

The tiled kernel, where threads cooperatively load tiles into shared memory:

#define TILE 16

__global__ void matmul_tiled(float* A, float* B, float* C, int N) {
    __shared__ float As[TILE][TILE];  // shared memory tile for A
    __shared__ float Bs[TILE][TILE];  // shared memory tile for B

    int row = blockIdx.y * TILE + threadIdx.y;
    int col = blockIdx.x * TILE + threadIdx.x;
    float sum = 0.0f;

    // Slide a TILExTILE window across A's row and B's column
    for (int t = 0; t < N / TILE; t++) {
        // Each thread loads ONE element of each tile from global memory
        As[threadIdx.y][threadIdx.x] = A[row * N + t * TILE + threadIdx.x];
        Bs[threadIdx.y][threadIdx.x] = B[(t * TILE + threadIdx.y) * N + col];

        __syncthreads();  // Wait for all threads to finish loading

        // Now multiply using fast shared memory
        for (int k = 0; k < TILE; k++) {
            sum += As[threadIdx.y][k] * Bs[k][threadIdx.x];
            //     ^^^^^^^^^^^^^^^^^^   ^^^^^^^^^^^^^^^^^^
            //     From shared memory (~5 cycles, not ~400)
        }

        __syncthreads();  // Wait before loading next tile
    }

    C[row * N + col] = sum;
}

The key insight: each TILExTILE block of threads cooperatively loads two small tiles (16x16 = 256 elements each) into shared memory. Then every thread in the block reads from that fast local copy instead of going to VRAM. Each element of A and B is loaded from VRAM once per tile, but read from shared memory by 16 threads.

Global memory reads drop from 2N32N^3 to 2N3/TILE2N^3 / \text{TILE}, a 16x reduction for TILE=16.

Naive — every read from VRAM
Matrix A Matrix B VRAM (~400 cycles each read) Thread → C[1][2] Each thread: 2N reads from VRAM Total: 2N³ global memory reads Massive redundancy
Tiled — cooperative shared memory load
Matrix A Matrix B 1 load per tile As[TILE] Bs[TILE] ~5 cycles All threads in block read shared mem VRAM reads: 2N³/TILE 16× reduction with TILE=16 Each element loaded once, shared by block
Left: every thread independently reads from slow VRAM. Right: threads cooperatively load a tile into fast shared memory, then all read from it. Dramatically fewer global memory accesses.

Part 6: The Full Journey

6.1 End-to-end CUDA program

Let's put it all together. Here's the complete flow, from CPU memory allocation to GPU computation and back:

#include <stdio.h>
#include <cuda_runtime.h>

__global__ void vector_add(float* a, float* b, float* c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) c[i] = a[i] + b[i];
}

int main() {
    int n = 1 << 20;  // ~1 million elements
    size_t size = n * sizeof(float);

    // 1. Allocate in RAM (host memory)
    float *h_a = (float*)malloc(size);
    float *h_b = (float*)malloc(size);
    float *h_c = (float*)malloc(size);

    // Initialize on CPU
    for (int i = 0; i < n; i++) {
        h_a[i] = 1.0f;
        h_b[i] = 2.0f;
    }

    // 2. Allocate in VRAM (device memory)
    float *d_a, *d_b, *d_c;
    cudaMalloc(&d_a, size);
    cudaMalloc(&d_b, size);
    cudaMalloc(&d_c, size);

    // 3. Copy RAM → VRAM (over PCIe bus)
    cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);

    // 4. Launch kernel (GPU reads/writes VRAM)
    int threads = 256;
    int blocks = (n + threads - 1) / threads;
    vector_add<<<blocks, threads>>>(d_a, d_b, d_c, n);

    // 5. Copy VRAM → RAM (over PCIe bus)
    cudaMemcpy(h_c, d_c, size, cudaMemcpyDeviceToHost);

    // 6. Verify on CPU
    printf("c[0] = %f (expected 3.0)\n", h_c[0]);

    // Cleanup
    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
    free(h_a); free(h_b); free(h_c);
    return 0;
}

Each numbered step maps to a distinct phase of the data journey:

1 malloc Allocate in RAM 2 cudaMalloc Allocate in VRAM 3 cudaMemcpy H→D RAM → VRAM via PCIe 4 kernel<<>> GPU reads/writes VRAM 5 cudaMemcpy D→H VRAM → RAM via PCIe RAM PCIe VRAM GPU cores Results back: VRAM → PCIe → RAM ~32 GB/s (bottleneck!) ~1 TB/s
The complete lifecycle: allocate, transfer to GPU, compute, transfer back. The PCIe bridge is the bottleneck — minimize crossings.

6.2 The same thing in PyTorch

All of the above (malloc, cudaMalloc, cudaMemcpy, kernel launch) is what frameworks do under the hood. Here's the mapping:

CUDA C
int n = 1 << 20;
size_t size = n * sizeof(float);

// Allocate on CPU
float *h_a = (float*)malloc(size);
float *h_b = (float*)malloc(size);
float *h_c = (float*)malloc(size);

// Allocate on GPU
float *d_a, *d_b, *d_c;
cudaMalloc(&d_a, size);
cudaMalloc(&d_b, size);
cudaMalloc(&d_c, size);

// Copy CPUGPU
cudaMemcpy(d_a, h_a, size,
cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, size,
cudaMemcpyHostToDevice);

// Run kernel
int threads = 256;
int blocks = (n + threads - 1) / threads;
vector_add<<<blocks, threads>>>
(d_a, d_b, d_c, n);

// Copy GPUCPU
cudaMemcpy(h_c, d_c, size,
cudaMemcpyDeviceToHost);
PyTorch
n = 1 << 20

# Allocate on CPU
a = torch.randn(n)
b = torch.randn(n)

# Allocate + copy to GPU
# (cudaMalloc + cudaMemcpy)
a_gpu = a.to("cuda")
b_gpu = b.to("cuda")

# Run operation
# (kernel launch hidden)
c_gpu = a_gpu + b_gpu

# Copy GPU → CPU
# (cudaMemcpy back)
c = c_gpu.cpu()

The framework handles memory allocation, transfers, and kernel selection. But the same principles apply: data lives in RAM or VRAM, transfers go over PCIe, and your GPU code's performance depends on coalesced access, occupancy, and shared memory usage, whether you write CUDA or call torch.matmul.

6.3 The bandwidth reality

Here's the full picture, the real bandwidth at every level of the system:

PathBandwidthWhat it means
L1 cache ↔ CPU core~2 TB/s~64 bytes/cycle x 4 GHz
L2 cache ↔ L1~800 GB/sPer-core
L3 cache ↔ L2~400 GB/sShared
RAM ↔ L3~100 GB/sApple M2, LPDDR5 [1]
PCIe 4.0 x16~32 GB/sThe CPU↔GPU bridge [3]
PCIe 5.0 x16~64 GB/sNewer systems [3]
VRAM ↔ GPU (GDDR6X)~1 TB/sRTX 4090 [2]
VRAM ↔ GPU (HBM3)~3.35 TB/sH100 [4]
Shared memory ↔ SM~19 TB/sPer-SM, 128 bytes/cycle

Two things stand out:

  1. VRAM bandwidth is ~10x RAM bandwidth (or ~20x on x86 systems with DDR5): GPU can feed their cores faster.

  2. PCIe is the bottleneck. At 32 GB/s, transferring data between RAM and VRAM is 30x slower than reading from VRAM. This is why GPU programs try to minimize transfers and keep data on the GPU as long as possible. It's why PyTorch's .to("cuda") should be called once, not in a loop.


Part 7: Why Local LLM Inference Is 10–20x Faster on GPU

Now we have all the pieces to answer the question from the introduction.

When an LLM generates text, there are two phases. During prefill, the model processes the entire input prompt at once. This is a matrix-matrix multiply, which is compute-bound and where GPUs' raw TFLOPS advantage shines. During decode, the model produces one token at a time: the current hidden state (a vector) is multiplied by the weight matrices of every layer (a matrix-vector multiply).

For a 7B parameter model in FP16 (2 bytes per weight), decode means reading ~14 GB of weights from memory for every single token. Each weight is used exactly once per token: read, multiply, move on.

This makes the decode phase (the one that determines how fast text appears) memory-bandwidth-bound, not compute-bound. The arithmetic intensity is roughly 1 FLOP per byte loaded, far below the threshold where compute speed matters. Token generation speed is almost entirely determined by how fast you can stream weights from memory.

The bandwidth gap explains the speed gap

CPU (Apple M2, LPDDR5)GPU (RTX 4090, GDDR6X)
Memory bandwidth~100 GB/s~1,000 GB/s
Time to stream 14 GB~140 ms~14 ms
Tokens per second~7~70
~10x faster

The ~10x speed ratio matches the ~10x bandwidth ratio almost exactly. On an x86 system with DDR5 (~50 GB/s), the gap widens to ~20x. Either way, the principle is the same: the GPU's memory bus can deliver weights to the compute units far faster, and that's what matters for this workload.

Every mechanism we've covered plays a role in this:

  • The GPU's 384-bit bus streams weights ~10x faster than Apple Silicon's 128-bit LPDDR5 bus (~20x faster than a typical x86 DDR5 setup)
  • Weight matrices are stored contiguously in VRAM, so warp reads are perfectly coalesced (one transaction serves 32 threads)
  • While one warp waits for the next chunk of weights, another is already multiplying, so latency is hidden despite VRAM's ~400 ns access time
  • The actual GEMM kernels (cuBLAS, used by PyTorch and llama.cpp) use tiled algorithms with shared memory, reducing redundant VRAM reads
  • The model must be loaded into VRAM before inference starts. Streaming over PCIe (~32 GB/s) during generation would be 30x slower than reading from VRAM
CPU inference (Apple M2, LPDDR5)
Model weights (14 GB) in RAM 128-bit ~100 GB/s CPU (16 cores) Matrix-vector multiply token ~140 ms / token ~7 tokens/sec ⚠ bottleneck
GPU inference (GDDR6X)
Model weights (14 GB) in VRAM 384-bit ~1,000 GB/s GPU (thousands of cores) Massively parallel matmul token ~14 ms / token ~70 tokens/sec 10× faster
Token generation over 1 second
CPU
~7 tokens
GPU
~70 tokens
LLM inference is memory-bandwidth-bound: every token requires streaming the full model weights. The GPU's ~10× higher bandwidth translates almost directly to ~10× faster generation (up to ~20× vs x86 DDR5).

What about quantization?

Quantization (INT8, INT4) reduces the model size: a 7B model goes from 14 GB (FP16) to ~3.5 GB (INT4). This directly speeds up inference because there's less data to stream:

PrecisionModel size (7B)GPU tok/sCPU tok/s
FP1614 GB~70~7
INT87 GB~140~14
INT43.5 GB~280~28

Note: These numbers are idealized upper bounds assuming pure bandwidth-limited streaming. In practice, dequantization overhead (unpacking INT4 → FP16 for computation), attention over a growing KV cache, and other fixed costs eat into the gains. Real-world INT4 speedups over FP16 are closer to ~2–3x than a clean 4x. The GPU/CPU ratio also isn't perfectly constant: at very low precision, compute and dequantization can start to matter more than raw bandwidth. The GPU advantage stays at roughly ~10x regardless of quantization, because the bandwidth ratio doesn't change. But quantization lets smaller GPUs (or even CPUs) reach usable speeds by reducing the data that needs to stream through the bus.

This is why llama.cpp defaults to Q4 quantization: it makes CPU inference on Apple Silicon quite usable (~28 tok/s for 7B) and GPU inference very fast (~280 tok/s).


Conclusion

LLM token generation is memory-bandwidth-bound. The GPU's ~10x wider memory bus translates almost directly into ~10x faster inference, and every mechanism we covered (coalesced access, latency hiding, tiling) has been developed to keep that wide bus saturated.

That said, the gap is shrinking from the other direction. Quantization and distillation are making models small enough that CPU inference on Apple Silicon is becoming genuinely usable for everyday tasks. Whether that trend can close the gap enough for compute-heavy workloads like coding assistance remains an open question, but now you know exactly what the gap is made of.


References

  1. Apple unveils M2 - Apple Newsroom — "100GB/s of unified memory bandwidth"
  2. NVIDIA GeForce RTX 4090 - NVIDIA — 16,384 CUDA cores, 384-bit memory bus, 83 TFLOPS FP32, GDDR6X
  3. NVIDIA Hopper Architecture In-Depth - NVIDIA Developer Blog — H100 SXM5: 5 HBM3 stacks, 10 x 512-bit memory controllers (5120-bit), over 3 TB/s bandwidth; PCIe Gen 5: 128 GB/s total (64 GB/s each direction)
  4. NVIDIA H100 - NVIDIA Data Center — 3.35 TB/s GPU memory bandwidth (SXM5)
  5. GDDR6X - Micron — 19–24 Gb/s per pin