SBVR Kernels | Wonjun Bang

First, I recommend reading the full paper on SBVR available at arXiv. Here, I will discuss how we designed a GPU-friendly kernel for SBVR-formatted weights, focusing on the code-level details.

We defined a struct template to store the bitvectors in the SBVR format. Each bitvector is divided into 32-bit integers, and one bitvector is stored in a struct instance named bvrs.

Struct template bvrs (truncated at target-bit 6)

The interface of the kernel that actually computes the GEMV for the SBVR format is as follows.
It takes the following inputs:

l/r_bvr: the set of bitvectors representing the LLM weights,
l/r_coeff_cache: the corresponding coefficients for the bitvectors in the SBVR representation,
l/r_coeff_idx: the indices indicating the coefficient positions for each bitvector in coeff_cache.

SBVR GEMV interface

The actual computation starts from the inner dimension, K (1×K, K×N, GEMV). Here, we divide the K dimension according to the size of the bitvector used for quantization (typically 128).
The inner product is computed between two K_PER_BVR*32-sized vectors and accumulated to produce one final element of the GEMV. This process is executed in parallel across multiple thread blocks, each handling different parts of the outer dimension, N.

Computation order of the kernel

Now, here comes the interesting part. As mentioned in the paper, we replace the fused multiply-add operation in GEMV with a series of logical bitwise operations (& and popcount), followed by a reduction along the K dimension.

SBVR Inner Product Computation

This part can be implemented as follows. First, fetch the target bitvectors from the parameters l_bvr and r_bvr. Then compute the popcount of the bitwise AND. As noted above, each bitvector is stored in the bvrs struct. In the innermost loop, take one uint32_t segment from each bitvector and accumulate the partial result __popc(l_i & r_i) into the uchar4 array pop_cache. After the outer loop sweeps the K dimension—covering all K_PER_BVR segments—the pop_cache contains all combinations of the logical-operation results for the participating bitvectors.

Handling Bitwise-Operations

Finally, we multiply the corresponding coefficients of the bitvectors with the logical-operation results stored in pop_cache.
The final reduction result is accumulated in a variable named sum, and the threads within a warp—each computing different parts of the K dimension—are reduced using __shfl_down_sync().
The outcome of these computations is then written to the output vector out.

This concludes the computation of the SBVR GEMV kernel.

Final Reduction

You can find more details about the SBVR implementation—such as the parameters used to launch the kernel (grid and block sizes) and how this kernel is integrated into the SBVR forward pass—on GitHub.