The RISC-V Vector extension (RVV 1.0) is one of the most important pieces of the ISA β the foundation of RISC-Vβs push into AI and HPC. But it works differently from the SIMD you may know from x86 (AVX) or ARM (NEON). This hands-on guide explains the vector-length-agnostic model and walks through a runnable example.

The Big Idea: Length-Agnostic Vectors
Traditional SIMD bakes the register width into the instruction set. AVX-512 means 512-bit registers; NEON means 128-bit. When a new, wider generation arrives, you rewrite or recompile your kernels. That is a recurring tax.
RISC-V takes a different path. With RVV, you do not hardcode the width. Instead, your code asks the hardware how many elements it can handle right now, processes that many, and loops. The same binary runs on a 128-bit implementation and a multi-kilobit one β each just does more or fewer elements per iteration. This is vector-length-agnostic (VLA) programming, and it is RVVβs defining feature.
The Core Mechanism: vsetvl
The instruction that makes VLA work is vsetvli (set vector length). You tell it how many elements you want to process and the element type; it returns how many it will process this iteration (the βvlβ), capped by the hardwareβs capability.
# a0 = number of elements remaining
# request that many 32-bit (e32) elements, grouping registers x1 (m1)
vsetvli t0, a0, e32, m1, ta, ma # t0 = granted vector length (vl)Key parameters:
- SEW (Selected Element Width) β
e8,e16,e32,e64: the size of each element. - LMUL (vector register grouping) β
m1,m2,m4,m8(or fractionalmf2): how many vector registers to gang together for more elements per pass. - Tail/mask policy β
ta/tuandma/mucontrol what happens to unused tail and masked elements.
You then operate on whole vectors and advance your pointers by vl elements. This loop pattern is called stripmining.
A Worked Example: SAXPY
SAXPY (y = a*x + y, single-precision) is the βhello worldβ of vector code. Here is the VLA loop in RISC-V vector assembly:
# void saxpy(size_t n, float a, const float *x, float *y)
# a0 = n, fa0 = a, a1 = x, a2 = y
saxpy:
.loop:
vsetvli t0, a0, e32, m8, ta, ma # t0 = elements this pass
vle32.v v0, (a1) # load x[i..i+vl]
sub a0, a0, t0 # n -= vl
slli t1, t0, 2 # bytes = vl * 4
add a1, a1, t1 # x += vl
vle32.v v8, (a2) # load y[i..i+vl]
vfmacc.vf v8, fa0, v0 # y = a*x + y (fused multiply-add)
vse32.v v8, (a2) # store y back
add a2, a2, t1 # y += vl
bnez a0, .loop # repeat until done
retNotice there is no hardcoded width anywhere. The same routine runs optimally whether the chip has 128-bit or 2048-bit vector registers β vsetvli adapts every iteration.
You Usually Donβt Write Assembly
In practice you let the compiler vectorize. Modern GCC and Clang/LLVM auto-vectorize for RVV, and you can also use intrinsics for control without dropping to assembly:
#include <riscv_vector.h>
void saxpy(size_t n, float a, const float *x, float *y) {
for (size_t vl; n > 0; n -= vl, x += vl, y += vl) {
vl = __riscv_vsetvl_e32m8(n);
vfloat32m8_t vx = __riscv_vle32_v_f32m8(x, vl);
vfloat32m8_t vy = __riscv_vle32_v_f32m8(y, vl);
vy = __riscv_vfmacc_vf_f32m8(vy, a, vx, vl);
__riscv_vse32_v_f32m8(y, vy, vl);
}
}Compile it with the vector extension enabled:
riscv64-linux-gnu-gcc -march=rv64gcv -O3 -c saxpy.c
# or with Clang:
clang --target=riscv64-linux-gnu -march=rv64gcv -O3 -c saxpy.c(See Build a RISC-V Toolchain if you do not have a cross-compiler yet.)
Running It Without Hardware
You do not need a vector board to experiment. Enable RVV in QEMU:
qemu-system-riscv64 -machine virt -cpu rv64,v=true,vlen=128 ...Or run user-mode binaries with a vector-capable CPU model. For real silicon, the Banana Pi BPI-F3 (SpacemiT K1) implements RVV 1.0 and is the most accessible vector board in 2026.
Why VLA Wins for AI and HPC
The length-agnostic model is exactly what a fast-moving field like AI wants: write the kernel once, and it scales as vector hardware grows β no per-generation rewrites. This is why RVV underpins the Monte Cimone HPC results and why the ecosystem is layering matrix extensions (IME/VME) on top, with a unified LLVM/MLIR lowering path so AI frameworks target them automatically.
Tips and Pitfalls
- Pick LMUL deliberately.
m8maximizes throughput but uses more registers;m1leaves room for other live vectors. Profile both. - Mind the tail. VLA loops handle remainder elements naturally β no scalar cleanup loop needed. That is a real advantage over fixed-width SIMD.
- Use FMA.
vfmaccfuses multiply-add for better accuracy and throughput. - Let the compiler try first.
-O3 -march=rv64gcvauto-vectorizes a lot; reach for intrinsics only where it falls short.
The Bottom Line
RVV 1.0 reframes SIMD around a simple, powerful idea: ask the hardware how wide it is, then loop. That single decision delivers binary portability across vector widths and makes RISC-V genuinely future-proof for AI and HPC. Start with the compiler, drop to intrinsics when you must, and verify in QEMU before moving to silicon.
Part of my RISC-V series β see also RISC-V Extensions Explained and RISC-V AI Accelerators.



