Fun and Hackable Tensors in Rust, From Scratch

Tensors from Scratch, Part 1

May 01, 2023

Over the past year, deep learning has been a hot topic with the release of new language and generative models like ChatGPT, Llama, Stable Diffusion, and Midjourney. These models are impressive, and I recommend trying them out if you haven't already. I use GitHub CoPilot daily, and now I feel lost when it's not accessible, despite my initial hesitation towards it.

But how does deep learning work? Let's find out in the craziest way possible: by building a PyTorch from the ground up.

My imposter syndrome compels me to inform you that I have little real-world experience with deep learning or programming GPUs, or working with tensor libraries like NumPy or PyTorch. I hope to provide a useful entry point for your explorations if you're in a similar spot. The library I'm trying to build, which I've called Tensorken, is not finished at the time of this writing. It's all about the journey!

In this post, we will explore a crucial aspect of deep learning frameworks, which is executing tensor operations. Along the way, I'll provide an overview of these operations and how they function, giving you a tour of a subset of NumPy.

Prerequisites: intermediate programming experience. Some Rust knowledge is useful, especially if you want to play around with Tensorken! The concepts are orthogonal, and I won't be using advanced Rust features.

No knowledge about tensors as used in NumPy or PyTorch s is expected: we'll find out about tensors, broadcasting, shapes, dimensions, and other strange beasts as we go along.

All code for Tensorken is on GitHub. The version this post discusses is tagged v0.1, and all further links are that version.

What does it take to build a neural network?

The main three real contenders at the time of this writing are Tensorflow (Google), PyTorch (Meta), and JAX (Google). If you're wondering why Google produced two libraries, they seem to be phasing out Tensorflow for JAX, after TensorFlow lost the popularity battle with PyTorch.

While these libraries are gazillions of lines of code each, my understanding is they consist of the following main parts:

A tensor library. A tensor is a multi-dimensional array - something programmers are pretty familiar with. However, as we'll see, multi-dimensional arrays in programming languages are pretty boring: you can just index in them and get a number out. Tensor libraries have many more higher-level (and frankly bewildering) operations to slice, dice, and apply bulk operations to tensors.
One or more accelerators. While it's possible to execute tensor operations on the CPU, operating on huge sets of floating point numbers is what GPUs and specialized hardware like TPUs (Tensor Processing Units) excel at. All deep learning libraries can transparently execute tensor operations on the GPU, accelerating them greatly.
Automatic differentiation. Neural networks are trained via an optimization process called gradient descent. Efficiently calculating gradients, i.e. derivatives, is of critical importance. Every library has a component that automatically calculates the gradient of a program expressed in terms of tensor operations.
Neural network building blocks. Interestingly, this is not where the meat is in terms of implementation: once you have the three things above, you're 90% there.

This post focuses on implementing a tensor library on the CPU and laying the groundwork for extending it with accelerators and automatic differentiation. We'll become familiar with commonly used operations on tensors, and how they are implemented.

I'll introduce what a tensor is, and what it's used for in the context of neural networks. I'll give a brief tour of some commonly used tensor operations. Then I'll describe a minimal yet powerful set of fundamental tensor operations. The idea is to compose useful user-facing tensor operations from these simple, lower-level operations. I'll then move on to describe an implementation strategy for representing tensors, and I'll implement the operations on the CPU. Finally, to convince you that the minimal operations are indeed powerful, I'll show how to implement matrix multiplication and create an "eye" matrix (a matrix with ones on the diagonal and zeros elsewhere).

The design of Tensorken is greatly inspired by tinygrad. Tinygrad is much more advanced and runs real models like Stable Diffusion.

Ten-what?

One tensor library you've probably heard about is NumPy, which describes itself as "...a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays ...". From a programmer's perspective, tensors are the most boring of data structures: an n-dimensional array of homogenous primitive type, with a size fixed at creation. They typically contain floating point numbers. The power of tensor libraries is the efficient implementation of a collection of useful tensor operations.

In the context of neural networks, tensor operations are everywhere. Neural networks don't have much to do with networks or neurons. A better name would be "tensor multiplication layers", but that's decidedly less catchy. (I was fired from the Naming Things committee ages ago).

My mental model of neural networks is as programs that take a tensor of floating point numbers, typically called X, as input. They spit out another tensor of floating point numbers Y as output. Y is produced by applying various operations to X, mostly matrix multiplications and elementwise function applications. Neural networks are organized in layers - "deep" in deep learning refers to the number of "hidden", or intermediate layers that are involved in eventually producing the Y. So we have something like X -> hidden layer -> H1 -> hidden layer -> H2 ... -> Y. Each hidden layer typically consists of a linear operation - i.e. a bunch of matrix multiplications - followed by a non-linear "activation function". The term activation again seems to refer to the idea that each output of a layer represents activating neurons, but I would say that has about as much to do with real neurons as my wedding ring with the Crown Jewels.

As an aside, a fascinating take on how this came to be the dominant architecture for AI is Sara Hooker's Hardware Lottery.

If you were completely unfamiliar with neural networks before reading this I'm fully expecting you to have all kinds of questions. I'm only hoping to convey that neural networks are short programs that mainly manipulate tensors. This allows them to run very efficiently on specialized, highly parallel hardware.

Don't just stand there, tensor, operate

Hopefully, that justifies spending some time learning how tensors work.

There are a few terms you need to be aware of. A tensor can have any number of dimensions or axes. Each axis is indexable - conventionally zero-based - and has a length. The shape of an n-dimensional tensor is an n-tuple that represents the length of each dimension. The product of the lengths of the shape is the total number of elements in the tensor - called the tensor's size.

Here are a few examples:

A tensor t with shape (37) is a one-dimensional row vector with 37 elements. You index for the first element using t[0]. This is interpreted equivalently to shape (1, 37) as a row vector or 1×37 matrix, except to get the first element you'd have to use t[0, 0].
A tensor t with shape (37, 1) is interpreted as a column vector or a 37x1 matrix with 37 elements. Its first element is t[0, 0].
A tensor t with shape (5, 5) is a square 5 by 5 matrix, which contains 25 elements.
A tensor t with shape (2, 3, 4) is a three-dimensional tensor with 24 elements. Its first element is t[0, 0, 0] and its last t[1, 2, 3].

Conventionally people think of the last two dimensions as rows and columns.

I'll illustrate further by showing some usages of Tensorken. The names and operations - and occasionally mind-boggling semantics - are similar to NumPy's. To follow along, clone the repo and check out the tour.rs example.

One difference with NumPy (but not PyTorch) is that tensors in Tensorken can live on the CPU or the GPU, and for the moment this is tracked in their type.

Further, tensors in NumPy have a dtype, short for datatype which is the type of their elements. In Tensorken, I tracked this in the static type of the tensor.

Here's the workhorse of tensor creation, new. It accepts a shape and a 1-dimensional Rust slice (if you're not familiar with Rust slices, you can think of them as read-only views on arrays):

>>> Tensor::<CpuRawTensor<f32>>::new(&[3, 2], &[0.0, 1.0, 2.0, 3.0, 4.0, 5.0])
[ 0  1]
[ 2  3]
[ 4  5]

Due to the given shape [3,2] Tensorken interprets the slice as a 3×2 matrix and arranges the slice's contents conventionally in row-major order. In other words, it maps the indices of the slice to indices of the matrix as follows:

slice[0] -> matrix[0, 0]
slice[1] -> matrix[0, 1]
slice[2] -> matrix[1, 0]
slice[3] -> matrix[1, 1]
slice[4] -> matrix[2, 0]
slice[5] -> matrix[2, 1]

This concept is important to understand, and I'll revisit it several times throughout this post, so don't worry if it doesn't fully click yet. It may be helpful to think of the index in the tensor as "counting up" in a number system with bases given by the shape. From that perspective, indexing in a tensor with shape [10, 10, 10] is like counting in decimal up to 999, and should feel very familiar.

Tensor::<CpuRawTensor<f32>> is quite a mouthful. Tensorken has a few type aliases to make it shorter, like Cpu32. Since we'll only work with tensors on the CPU containing 32-bit floats, I've aliased that again to Tr:

type Tr = Cpu32

For people familiar with NumPy, you can think of this as the counterpart for import numpy as np.

Now that we can create tensors, let's crack on with examples.

Unary operations

Tensor operations can be broadly categorized into a handful of classes. The easiest to understand are unary operations, which take a tensor and produce a new tensor by applying a function to each element.

>>> let t = Tr::new(&[3, 2], &[0.0, 1.0, 2.0, 3.0, 4.0, 5.0])
[ 0  1]
[ 2  3]
[ 4  5]

>>> t.exp()
[ 1         2.7182817]
[ 7.389056  20.085537]
[ 54.59815  148.41316]

>>> t.log()
[ -inf        0        ]
[ 0.69314724  1.0986124]
[ 1.3862945   1.6094381]

There is no in-place mutation in Tensorken - all operations return a new tensor.

In functional programming terms, these operations are maps: they apply a function to each element of an input tensor, returning a new tensor while preserving the input tensor's shape.

Binary operations and broadcasting

Next, a few binary operations, which take two tensors as input.

>>> let t1 = &Tr::new(&[2, 2], &[0.0, 1.0, 2.0, 3.0])
[ 0  1]
[ 2  3]

>>> let t2 = &Tr::new(&[2, 2], &[6.0, 7.0, 8.0, 9.0])
[ 6  7]
[ 8  9]

>>> t1 + t2
[ 6   8 ]
[ 10  12]

>>> t1 * t2
[ 0   7 ]
[ 16  27]

The basic binary operations are addition, subtraction, multiplication, and division. They operate element-wise - in particular t1 * t2 is not matrix multiplication. Since they operate element-wise, their shapes must match.

In functional programming terms, these operations are like map2 or zip: they apply a binary function elementwise to two tensors.

I said the input tensors' shape must match, but I didn't explain what that means. Trivially, two shapes match if they are identical. However, tensor operations support broadcasting which allows different shapes to match under certain constraints. The simplest example is adding a scalar to a row vector:

>>> let t1 = &Tr::new(&[6], &[2.0, 1.0, 4.0, 2.0, 8.0, 4.0])
[ 2  1  4  2  8  4]

>>> let s1 = &Tr::scalar(2.0)
[ 2]

>>> t1 + s1
[ 4  3  6  4  10  6]

Tensorken has done the sensible thing of adding the scalar to each element of the row vector t1. However the shape of t1 is [6], while the shape of s1 is [1]. These shapes match according to the first rule of broadcasting:

Two shapes match if:

they have the same number of dimensions, and
their dimensions are pairwise of equal length or either is of length 1.

More mechanically, a dimension of length 1 is "broadcasted" as many times as necessary to match the greater length.

Broadcasting generalizes to more than one dimension:

>>> let t1 = &Tr::new(&[3, 2], &[2.0, 1.0, 4.0, 2.0, 8.0, 4.0])
[ 2  1]
[ 4  2]
[ 8  4]

>>> let t2 = &Tr::new(&[1, 2], &[10.0, 100.0])
[ 10  100]

>>> t1 + t2 // broadcast row [ 10  100] 3 times
[ 12  101]
[ 14  102]
[ 18  104]

We can also broadcast along the last dimension:

>>> let t3 = &Tr::new(&[3, 1], &[10.0, 100.0, 1000.])
[ 10  ]
[ 100 ]
[ 1000]

>>> t1 + t3 // broadcast column 2 times
[ 12    11  ]
[ 104   102 ]
[ 1008  1004]

Pretty confusing, and we're not done yet.

We can not only add a scalar to a vector, but to tensors with any shape:

>>> let t1 = &Tr::new(&[2, 3], &[2.0, 1.0, 4.0, 2.0, 8.0, 4.0])
[ 2  1  4]
[ 2  8  4]

>>> let s1 = &Tr::scalar(2.0)
[ 2]

>>> (t1.shape(), s1.shape())
([2, 3], [1])

>>> t1 + s1
[ 4  3   6]
[ 4  10  6]

How is it adding two shapes that have different dimensions? Again, the scalar value is added to each element of the matrix, which seems sensible.

First realize that you can add any number of dimensions of length 1 anywhere in a shape, without substantially affecting the tensor. Row vectors may become column vectors or vice versa, but that's just a matter of interpretation. The number of elements and their order remains the same. So we can think of the shape [1] as [1, 1], and the latter shape does match with [2, 3] according to the first rule of broadcasting. The question remains, where do we add these dimensions of length 1? The second rule of broadcasting clarifies that:

If two shapes have a different number of dimensions, add dimensions of length 1 to the front of the shortest shape.

Let's see another example of that in action.

>>> let t1 = &Tr::new(&[3, 2], &[2.0, 1.0, 4.0, 2.0, 8.0, 4.0])
[ 2  1]
[ 4  2]
[ 8  4]

>>> let t2 = &Tr::new(&[2], &[10.0, 100.0])
[ 10  100]

>>> (t1.shape(), t2.shape())
([3, 2], [2])

>>> t1 + t2
[ 12  101]
[ 14  102]
[ 18  104]

Here we add a shape [3, 2] to shape [2]. If we align them, and apply the broadcast rules:

3 2
  2

=> Apply rule 2: add dimensions of length 1 to the front until the number of dimensions matches

3 2
1 2

=> Check rule 1: these shapes match. Broadcast row [ 10 100] 3 times.

Resulting shape:

3 2

Looks like a massive foot gun right? I agree. Even Andrej Karpathy thinks so.

Broadcasting is implemented efficiently: it does not work by copying data. I'll come back to this in a later section, rest assured for now that neither input tensor allocates memory during broadcasting.

Reduction operations

The third category is reduction operations. These take a single tensor as input, but change its shape by accumulating elements along one or more dimensions. The dimensions that are accumulated become of length 1.

Here's a simple example:

>>> let t = &Tr::new(&[4], &[0.0, 1.0, 2.0, 3.0])
[ 0  1  2  3]

>>> t.sum(&[0])
[ 6]

We can also accumulate along more than one dimension:

>>> let t = &Tr::new(&[2, 2], &[0.0, 1.0, 2.0, 3.0])
[ 0  1]
[ 2  3]

>>> t.sum(&[0, 1])
[ 6]

The foot gun comes out again when we don't accumulate all dimensions - using the same tensor t:

>>> t.sum(&[0])
[ 2  4]

>>> t.sum(&[1])
[ 1]
[ 5]

In the first sum we accumulate over dimension 0, which is rows. So we added rows together, and end up with a tensor with one row. In the second, we accumulated over dimension 1, which is columns, hence we added columns together and ended up with a tensor with one column. All dimensions that you reduce over become length 1 in the result: the first sum has shape [1, 2], and the second has shape [2, 1].

The foot guns of broadcasting and reduce explosively combine to a truly impressive BFG (Big Foot Gun. What were you thinking?). For example, say you have a tensor P of shape [26, 26] that counts how many times each letter in the alphabet is followed by another, in a particular dataset (these are called bigrams). For example, the element at index [0, 0] indicates how many times the letter 'a' was followed by another 'a'. Now we'd like to normalize this tensor, by dividing each element in each row by the sum of counts of that row. The result should give us transition probabilities per row: how likely is it that 'a' is followed by 'a', 'b', 'c',... etc. Now there are a couple of possibilities: P / P.sum(&[0]) or P / P.sum(&[1]). Note that thanks to broadcasting both of these work, but have different effects. Can you work out which one has the desired effect?

Movement operations

The fourth and last category is movement operations. These don't change the content of tensors, but allow changing their shape.

A useful operation is taking slices, to make smaller tensors from existing ones:

>>> let t = &Tr::new(&[2, 2], &[0.0, 1.0, 2.0, 3.0])
[ 0  1]
[ 2  3]

>>> t.at(1)
[ 2  3]

>>> t.at(&[1, 0])
2

These should be pretty self-explanatory. at with a single argument slices the first (leftmost) dimension at the given index out of the tensor. at with a slice argument gets a single element. A more practical tensor library would have many more options here, all of which Tensorken in principle supports, but I haven't put in the effort of exposing that on Tensor yet. For Rust-specific reasons, it's some work to get everything working with a nice slicing syntax.

Slicing does not copy the underlying data - the underlying buffer is shared.

Another useful movement operation is reshape:

>>> let t = Tr::linspace(0.0, 23.0, 24)
[ 0  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23]

>>> let t6x4 = t.reshape(&[6, 4])
[ 0   1   2   3 ]
[ 4   5   6   7 ]
[ 8   9   10  11]
[ 12  13  14  15]
[ 16  17  18  19]
[ 20  21  22  23]

>>> let t3x8 = t6x4.reshape(&[3, 8])
[ 0   1   2   3   4   5   6   7 ]
[ 8   9   10  11  12  13  14  15]
[ 16  17  18  19  20  21  22  23]

First, we've created a row vector with 24 elements using linspace, another name blatantly stolen from NumPy. linspace takes a range, the desired number of steps in that range, and returns a row vector with evenly spaced numbers in the range.

Then we reshape to a 6×4 matrix, and again to a 3×8 matrix. Neither of these reshapes copy the data, they all share a single buffer. In some cases reshape does have to copy. Unfortunately, it's not easy to explain when without going into implementation details, so I'll come back to this in the next section.

The final operation we'll consider is permute, which is a generalized matrix transposition. It applies a permutation to a tensor's dimensions. Permute re-orders dimensions but keeps the order of elements in each dimension the same. Illustrated by an example:

>>> t3x8.permute(&[1, 0])
[ 0  8   16]
[ 1  9   17]
[ 2  10  18]
[ 3  11  19]
[ 4  12  20]
[ 5  13  21]
[ 6  14  22]
[ 7  15  23]

In the original t3x8 tensor, keeping the first index constant at 0 and increasing the second index, you get elements 0, 1, 2, 3, ... 7. In the permuted tensor, you get the same elements if you keep the second index constant and increase the first index. Since permute is a re-ordering of the dimensions, it never needs to copy the underlying data.

Phew, you're about halfway through! This is a good time to take a break. The next section dives into Tensorken's internals.

mug with coffee and two cookies on brown coaster — Photo by Rumman Amin on Unsplash

It's a tensor, Jim, but not as we know it

After that whirlwind tour, let's move on to Tensorken's design. Tensorken is split into two layers:

A set of fundamental tensor operations, embodied in the RawTensor trait.
User-facing tensor operations, built out of the fundamental operations, in the Tensor type. Tensor is parametrized (via a generic type parameter) with a RawTensor implementation, and translates higher-level operations to the fundamental RawTensor operations. Tensor in addition implements all the necessary traits like Add, Neg etc so tensors work with all the usual rust operators.

The idea is that to implement a new kind of accelerator - whether it's via WebGPU, cuda, torch, XLA, or any other - we only need to implement the operations in RawTensor. The backend then becomes immediately available to use with all the high-level tensor operations in Tensor! This is quite powerful and is borne out in Tensor's tests. You can write code that's generic on RawTensor, and so works with any backend:

fn fun<'t, RT: RawTensor>(t1: &'t Tensor<RT>, t2: &'t Tensor<RT>) -> Tensor<RT> {
    let r1 = t1.exp();
    let r2 = t2.log();
    let r3 = t1 + r1;
    // ...more tensor operations
    r6 + r5 + r7
}

And then run it on the CPU or GPU:

let t_cpu = &Cpu32::linspace(1.0, 6.0, 6).reshape(shape);
let r_cpu = fun(t_cpu, t_cpu);

let t_wgpu = &Wgpu32::linspace(1.0, 6.0, 6).reshape(shape);
let r_gpu = fun(t_wgpu, t_wgpu);

assert_tensor_eq(&r_cpu, &r_gpu);

Another way to think of this is that RawTensor is a final-style embedded DSL for accelerators, and enjoys all its advantages. In particular, it's relatively easy to extend RawTensor with optimized operations on a particular backend, and then make those available on Tensor only if you select that backend. If that doesn't make sense to you, check out my previous post.

I'll leave the WebGPU-based backend for another post, but you can of course look at the code. I'll first explain the RawTensor operations and why they are there, and then go into some more implementation details.

Tensors, raw

The operations on RawTensor can be classified in the same way as the high-level operations explained above.

The trait itself is simple:

pub trait RawTensor {
    type Elem: Num;

    // ...fns...
}

It has an associated type Elem, a placeholder for whatever you can put into tensors. Currently only f32 is supported, as there's only a single implementation for Num. Num is a simple trait that groups a handful of existing traits like Add and Mul, and adds a few other operations that the Rust standard library doesn't have a trait for, like log and pow.

I'll start with the straightforward unary and binary operations:

fn exp(&self) -> Self;
fn log(&self) -> Self;

fn add(&self, other: &Self) -> Self;
fn sub(&self, other: &Self) -> Self;
// + mul, div, pow, eq

All these operations borrow their inputs and return a new RawTensor. As a result, no mutation is allowed by the rust compiler (except unsafe code or interior mutability, neither of which is used.)

The unary operations do what they say on the tin. For a tensor library that's more widely targeted than neural networks, there'd probably be more here, but this is pretty much what we need for activation functions.

The same goes for the binary operations. Note that these operations are all elementwise, and don't need to support broadcasting - they can assume both their arguments have the same shape.

Next are the reduce operations:

fn sum(&self, axes: &[usize]) -> Self;
fn max(&self, axes: &[usize]) -> Self;

Next creation and elimination operations:

fn new(shape: &[usize], data: &[Self::Elem]) -> Self;
fn shape(&self) -> &[usize];
fn ravel(&self) -> Vec<Self::Elem>;

new and shape we've already seen. Ravel (again, the name is the same as NumPy) returns a contiguous copy of the tensor's buffer as a Vec. Remember how I wrote in the previous section that it'd be important to understand how to "count up" the indices in a tensor? The symmetry is that new creates a tensor from a shape and a contiguous buffer, interpreting the order of the index in the buffer into a particular order of the indices in the tensor. ravel goes the opposite way: it creates a contiguous buffer by "counting up" indices in the tensor and placing them into the buffer. We thus have the invariant new(shape, buffer).ravel() === buffer.

Finally, movement operations:

fn reshape(&self, shape: &[usize]) -> Self;
fn permute(&self, permutation: &[usize]) -> Self;
fn expand(&self, shape: &[usize]) -> Self;
fn crop(&self, limits: &[(usize, usize)]) -> Self;
fn pad(&self, padding: &[(usize, usize)]) -> Self;

reshape and permute we've already seen in action earlier. expand is a primitive to allow broadcasting: it can replicate dimensions of length 1 to any desired length, but doesn't add any dimensions of length 1:

>>> let t = &Tr::new(&[1, 2, 2], &[0.0, 1.0, 2.0, 3.0])
+---------+
|         |
| [ 0  1] |
| [ 2  3] |
|         |
+---------+

>>> t.expand(&[5, 2, 2])
+---------------------------------------------+
|                                             |
| [ 0  1]  [ 0  1]  [ 0  1]  [ 0  1]  [ 0  1] |
| [ 2  3]  [ 2  3]  [ 2  3]  [ 2  3]  [ 2  3] |
|                                             |
+---------------------------------------------+

To implement broadcasted binary operations on the Tensor level, the process is roughly as follows:

Use reshape to add dimensions of length 1 to the front of the shortest shape.
Use expand on both shapes to increase any dimensions of length 1 to the length needed of the corresponding dimension in the other tensor.
Use the desired binary operator on the result.

Full code here.

Finally, we have crop and pad. crop is a limited slice operator: it creates a new tensor with a given sub-range of indexes in each dimension of the source tensor. crop is what underpins the implementation of at.

>>> let t = &Tr::new(&[3, 2], &[2.0, 1.0, 4.0, 2.0, 8.0, 4.0])
[ 2  1]
[ 4  2]
[ 8  4]

>>> t.crop(&[(0, 2), (1, 2)])
[ 1]
[ 2]

pad is the opposite of crop: it adds a given number of 0.0 values before and after each dimension, increasing the tensor's shape:

>>> let t = &Tr::new(&[3, 2], &[2.0, 1.0, 4.0, 2.0, 8.0, 4.0])
[ 2  1]
[ 4  2]
[ 8  4]

>>> t.pad(&[(1, 2), (1, 3)])
[ 0  0  0  0  0  0]
[ 0  2  1  0  0  0]
[ 0  4  2  0  0  0]
[ 0  8  4  0  0  0]
[ 0  0  0  0  0  0]
[ 0  0  0  0  0  0]

pad is currently unused, but it will be once I introduce reverse-mode automatic differentiation, as all the operations will then need a suitable reverse.

Strides, shapes, and shenanigans

Now we come to CpuRawTensor, an implementation of RawTensor on the CPU. This part goes quite deep into implementation details - you may want to skim it on first reading.

The most constraining aspect of the implementation is the movement operations, and in particular, a desire to minimize copying of the data buffer. The idea is to have a shared set of read-only, contiguous buffers:

struct Buffer<T> {
  data: Vec<T>,
}

The CpuRawTensor struct has a reference counted pointer to a buffer and a mysterious ShapeStrider:

pub struct CpuRawTensor<T> {
  buffer: Rc<Buffer<T>>,
  strider: ShapeStrider,
}

The buffer is reference counted so we can share it among many tensors: this allows Tensorken to do certain operations without copying (part of) the buffer.

To understand what ShapeStrider does, we must get to the essence of what tensors are. Their backing data is stored in a contiguous buffer in memory, while tensors are addressable as dynamically-shaped multi-dimensional arrays. ShapeStrider translates between these two representations, by mapping indices in the tensor to an index in the buffer. It's a more malleable and extended version of what compilers do when you access an index in an array: they find the address in memory by multiplying the index with the size of the array items (and possibly add an offset, if the array index is not zero-based).

To a first approximation, ShapeStrider is defined as:

pub(crate) struct ShapeStrider {
  shape: Vec<usize>,
  strides: Vec<usize>,
}

The shape is a vector that has as many elements as there are dimensions in the tensor, and stores the length of each dimension. The strides vector also has as many elements as there are dimensions and stores the offset in the buffer between successive elements in that dimension.

Here's an illustration for a tensor of shape (3, 3) with strides (3, 1):

As we go left to right within a row, the stride 𝚜₁ is 1. As we go top to bottom along a column, the stride 𝚜₀ is 3.

Let's be more precise. We need to define a mapping 𝚋 from an n-dimensional index (𝚝₀, 𝚝₁, ..., 𝚝ₙ₋₁) in the tensor to a single index in the buffer. Given n-dimensional shape lengths (l₀, l₁, ..., lₙ₋₁) and strides (𝚜₀ , 𝚜₁ , ..., 𝚜ₙ₋₁), that's:

𝚋(𝚝₀,𝚝₁,...,𝚝ₙ₋₁) = ∑ᵢ 𝚜ᵢ⋅𝚝ᵢ

Or in code:

fn buffer_index(&self, index: &[usize]) -> usize {
  index.iter().zip(self.strides.iter()).map(|(&t, &s)| t * s).sum::<usize>()
}

Now let's turn to the problem of how to choose strides. The basic tensor constructor takes a shape and a contiguous buffer as arguments, so what we're missing are the strides. We're interpreting the buffer in row-major order, which means that consecutive elements in the tensor's rows are consecutive in the buffer. As a result, we know that the last (rightmost) stride sₙ₋₁ = 1.

We can now figure out the rest of the strides, right to left:

sₙ₋₁ = 1 sₙ₋₂ = sₙ₋₁⋅lₙ₋₁ ... s₀ = s₁⋅l₁

If the shape is well-formed with all lengths strictly positive, then all strides are strictly positive: ∀ᵢ sᵢ > 0.

In code:

let mut strides = vec![1; shape.len()];
for i in (0..shape.len() - 1).rev() {
    strides[i] = strides[i + 1] * shape[i + 1];
}

This code is kind of kludgy - to my slight surprise Rust's standard library appears to be missing rscan (a reverse fold that returns a sequence of the intermediate results), while it does have scan, fold, and rfold. Also, I find (0..shape.len() - 1).rev() mentally taxing to read, I'd prefer a C-style loop here.

With this, it turns out we can implement reshape, expand, and permute without copying the underlying buffer (spoiler: for the most part...). pad always copies the underlying buffer to a new buffer with space for the padding, and updates the shape accordingly. It's possible to pad virtually by keeping an additional Vec of paddings in ShapeStrider, but I've not implemented that. We can now also implement all unary, binary, and reduce operations.

Crop and pad

Interestingly, we get stuck with crop - remember that crop reduces the length of dimensions by removing a given number of elements at the beginning and end of each dimension. To crop the end of a dimension, there's no issue: we just change the length for that axis by subtracting the crop amount.

A naïve way to crop at the beginning is to keep an additional Vec<usize> of cropped offsets for each of the dimensions. We don't need to do that, because this vector corresponds to a unique buffer index, and we can just calculate and store that single buffer index. The full ShapeStrider struct is then:

pub(crate) struct ShapeStrider {
    shape: Vec<usize>,
    strides: Vec<usize>,
    offset: usize,
}

And we need to update our function 𝚋 accordingly to take offset 𝚘 into account:

𝚋(𝚝₀,𝚝₁,...,𝚝ₙ₋₁) = 𝚘 + ∑ᵢ 𝚜ᵢ⋅𝚝ᵢ

And in buffer_index:

self.offset 
+ index.iter().zip(self.strides.iter()).map(|(&i, &s)| i * s).sum::<usize>()

Now, to crop we keep strides the same, and update shape and offset:

fn crop(&self, limits: &[(usize, usize)]) -> ShapeStrider {
  let offset = self.buffer_index(&limits.iter().map(|&(start, _)| start).collect::<Vec<_>>());
  let shape = limits
      .iter()
      .map(|&(start, end)| end - start)
      .collect::<Vec<_>>();
  // ... return new ShapeStrider with shape and offset
}

Note how we can use buffer_index to calculate the new offset - it will automatically take the current offset into account. The new shape is simply the difference between the given limits. I've put validation in place separately, so at this point it's safe to assume this gives sensible results.

crop's counterpart pad is relatively uninteresting - it calculates the new shape by adding before and after padding to each dimension length, allocates a new buffer of the new size, and then copies the relevant elements from the existing buffer.

Expand

An interesting development occurs when implementing expand. Recall that expand changes the shape by repeating dimensions of length 1. For example, expanding a row vector of shape (1, 3) to shape (10, 3) yields 10 rows with copies of the original row. We can do that without actually copying out data to a new buffer by allowing strides to become zero, i.e. we relax our earlier constraint to ∀ᵢ sᵢ ≥ 0. expand is then a no-copy operation: the shape is expanded to the desired size, and strides for the expanded dimensions are set to 0. Even when the original strides for dimensions of length 1 were non-zero, the only index they could have been multiplied with is 0 (otherwise the index would be out of bounds), so changing the stride to 0 does not affect the resulting buffer index - it just allows passing any index to that dimension, where the 0 stride will now absorb it.

Permute

As for permute, it simply changes the shape and the strides according to the given permutation. You may be able to convince yourself that this is correct by thinking of the two-dimensional case - transposing a matrix means "flipping" the order of the indices. We can leave the offset alone since it was calculated by the addition of strides times indexes, which is commutative.

Here's an illustration of a 3×3 matrix that can share the same buffer with the earlier illustration, but the tensor is transposed thanks to the clever use of strides.

Reshape

The final movement operation is reshape, which is subtle and has a surprising interplay with permute. First, let's introduce the concept of a contiguous tensor. You can guess what a contiguous buffer is: a buffer whose elements are laid out contiguously in memory, i.e. all elements are next to each other. A contiguous tensor is similar, in that the tensor elements as enumerated in increasing row-major index order (0, ..., 0, 0), (0, ..., 0, 1), (0, ..., 0, 2), (0, ..., 1, 0), and so on, are stored in the buffer contiguously.

Why does this matter? Because it turns out that, if a tensor is contiguous, we can reshape it without copying, and the resulting tensor is also contiguous. That's because reshape is unable to alter the order or the total number of elements - it can only split or join dimensions. Remember that the product of the shape's lengths is the total number of elements, and reshape cannot change this product. From a different perspective, the shape of a tensor is a factorization of the total number of elements. So, reshape either:

Splits ("factors") a dimension into two or more dimensions, e.g. changing a shape (..., 6, ...) to (..., 3, 2, ...); or
Joins ("multiplies") two or more dimensions into one, e.g. changing a shape (..., 3, 6, ...) to (..., 18, ...).

If you accept that, then it should be understandable why reshaping a contiguous tensor doesn't need a copy: neither joining nor splitting dimensions affects the order of the elements. So we "simply" need to update the shape and strides to effect the reshape. Simply is in scare quotes there because the actual algorithm is relatively subtle - we're only given the new expected shape, and we have to figure out appropriate joins and splits from the old and new shape.

At this point, it might be worth independently trying to think of the one movement operation that can make a tensor non-contiguous. Your options are reshape, permute, pad, crop, or expand.

(spoiler alert)

It is permute. This should become obvious if you think of a transposed matrix. We know that calling permute on a tensor re-orders the shape and strides. For example, if we start with a tensor of shape (6, 2), then its strides are (2, 1). As we increase the last (column) index, we increase the buffer index by 1. As we increase the first (row) index, we increase the buffer index by 2, because each row contains 2 elements. This tensor is contiguous because the mapping from tensor to buffer index is nice and clean:

(0, 0) -> 0
(0, 1) -> 1
(1, 0) -> 2
(1, 1) -> 3
(2, 0) -> 4
...
(5, 1) -> 11

Now let's transpose the matrix by calling permute(&[1, 0]). The resulting tensor has shape (2, 6) and strides (1, 2). Let's check if this is contiguous:

(0, 0) -> 0
(0, 1) -> 2
(0, 2) -> 4
(0, 3) -> 6
(0, 4) -> 8
(0, 5) -> 10
(1, 0) -> 1
...
(1, 5) -> 11

It's not! Now let's reshape this transposed tensor to shape (2, 2, 3), and try to work out what the strides would be. Reshaping does not change the order, so we can duplicate the second column of our schema, and update the tensor indexes:

(0, 0, 0) -> 0 <--
(0, 0, 1) -> 2
(0, 0, 2) -> 4
(0, 1, 0) -> 6 <--
(0, 1, 1) -> 8
(0, 1, 2) -> 10
(1, 0, 0) -> 1 <--
...
(1, 1, 2) -> 11

It appears that for the last stride, we can choose 2 - so far so good. But for the second to last stride, indicated with the left arrows, there seems to be a problem: there isn't a single stride that gets us through the sequence 0, 6, 1, ... Hence, in this case, reshape first copies the original tensor to a contiguous tensor, and then does a no-copy reshape.

Unary, binary, and reduce

Unary and binary operations are pretty straightforward. They always allocate a new buffer for the result. Reduce operations are more involved - the best way to think about reduce is that it's the opposite of expand. expand takes dimensions of length 1 and replicates them to dimensions of any length, reduce operations max and sum collapse some or all dimensions to length 1 by applying the appropriate accumulation function. The details are tricky but don't reveal any additional insights.

Phew. That was quite the ride.

Eying matrix multiplication

Two examples of how to build interesting operations out of the basic operations.

Creating an eye matrix

An eye matrix of a given dimension d is a d×d matrix with 1s on the diagonal and zeros elsewhere. You can follow along with this via the eye.rs example.

>>> let dim = 3
>>> &Tr::eye(dim)
[ 1  0  0]
[ 0  1  0]
[ 0  0  1]

Here's how it is created - this approach is copied from tinygrad. eye could also be easily created using new, but this is more fun:

>>> let t = &Tr::scalar(1.0)
[ 1]

>>> let t = t.pad(&[(0, dim)])
[ 1  0  0  0]

>>> let t = t.reshape(&[1, dim + 1])
[ 1  0  0  0]

>>> let t = t.expand(&[dim, dim + 1])
[ 1  0  0  0]
[ 1  0  0  0]
[ 1  0  0  0]

>>> let t = t.reshape(&[dim * (dim + 1)])
[ 1  0  0  0  1  0  0  0  1  0  0  0]

>>> let t = t.crop(&[(0, dim * dim)])
[ 1  0  0  0  1  0  0  0  1]

>>> let t = t.reshape(&[dim, dim])
[ 1  0  0]
[ 0  1  0]
[ 0  0  1]

Matrix multiplication

You may have been surprised that matrix multiplication is not part of the basic operations in RawTensor. For efficiency, it probably should be, but again inspired by tinygrad, Tensorken favors small over fast: we don't have to implement matrix multiplication separately for each accelerator.

As a secondary reason, showing how to create matrix multiplication out of the existing elementwise operations produces some interesting insights.

Here's the high-level interface to matrix multiplication, via matmul. Follow along via the matmul.rs example

>>> let l = Tr::linspace(0.0, 11.0, 12).reshape(&[3, 4])
[ 0  1  2   3 ]
[ 4  5  6   7 ]
[ 8  9  10  11]

>>> let r = Tr::linspace(12.0, 23.0, 12).reshape(&[4, 3])
[ 12  13  14]
[ 15  16  17]
[ 18  19  20]
[ 21  22  23]

>>> &l.matmul(&r)
[ 114  120  126]
[ 378  400  422]
[ 642  680  718]

Buckle up, we're going to use the combined firepower of reshape, permute, and broadcasting. Hopefully, aiming away from our feet.

The idea is to implement matrix multiplication using broadcasted elementwise multiplication, and then sum the appropriate dimensions.

First, let's massage the left matrix's shape:

>>> let s = l.shape()
[3, 4]
>>> let l_shape = [&s[..s.ndims() - 1], &[1, s[s.ndims() - 1]]].concat()
[3, 1, 4]
>>> let l = l.reshape(&l_shape)
+-----------------------------------------------+
|                                               |
| [ 0  1  2  3]  [ 4  5  6  7]  [ 8  9  10  11] |
|                                               |
+-----------------------------------------------+

Then, the right matrix:

>>> let s = r.shape()
[4, 3]
>>> let r_shape = [&s[..s.ndims() - 2], &[1], &s[s.ndims() - 2..]].concat()
[1, 4, 3]
>>> let r =
r.reshape(&r_shape).transpose(r_shape.ndims() - 1, r_shape.ndims() - 2)
+-------------------+
|                   |
| [ 12  15  18  21] |
| [ 13  16  19  22] |
| [ 14  17  20  23] |
|                   |
+-------------------+

We now have shapes as follows:

l's shape: [3, 1, 4]
r's shape: [1, 3, 4] (note the transpose after the reshape to [1, 4, 3])

These two shapes match per the broadcast rules: they have the same number of dimensions, and where the lengths differ one of them is 1.

As if by magic, if we elementwise broadcast-multiply l and r, we're multiplying the correct elements:

>>> let prod = &l * &r
+--------------------------------------------------------------+
|                                                              |
| [ 0  15  36  63]  [ 48  75  108  147]  [ 96   135  180  231] |
| [ 0  16  38  66]  [ 52  80  114  154]  [ 104  144  190  242] |
| [ 0  17  40  69]  [ 56  85  120  161]  [ 112  153  200  253] |
|                                                              |
+--------------------------------------------------------------+

This tensor has shape [3, 3, 4]. We're multiplying a 3×4 matrix with a 4×3 matrix, so we're expecting a 3×3 matrix. We sum over the last dimension to get us closer to that shape:

>>> let sum = prod.sum(&[prod.shape().ndims() - 1])
+------------------------+
|                        |
| [ 114]  [ 378]  [ 642] |
| [ 120]  [ 400]  [ 680] |
| [ 126]  [ 422]  [ 718] |
|                        |
+------------------------+
>>> let s = sum.shape()
[3, 3, 1]

This matrix has the right numbers! But its shape [3, 3, 1] is still slightly wrong. That part at least is easy - we "squeeze" out the last dimension:

>>> sum.reshape(&s[..s.ndims() - 1])
[ 114  120  126]
[ 378  400  422]
[ 642  680  718]

Mind blown yet?

Improvements

I've hopefully qualified my explanations enough that you by now understand that Tensorken has many possible improvements - without even mentioning extending it to run on GPU or adding automatic differentiation.

I'll list a few I have thought of, but the main idea of Tensorken is that it's a testbed for your ideas. Clone it and hack away!

Functionality

One RawTensor operation I haven't implemented is strides which allows setting the strides directly. In addition, strides can be negative numbers as well. Both of these are needed to implement operations like flip that re-order axes back-to-front. Also, it allows slicing using a step, e.g. slicing every second element out of a row vector. I don't think this comes with any deep changes, but it is probably somewhat fiddly. For example, if a stride is negative you'd have to complexify the index to buffer mapping further by adding offsets for the inverted axes.

Ergonomics

Tensorken now panics whenever something is wrong, for example, if you try to add two tensors with incompatible shapes. I initially took this approach because it seemed simpler (fine fine, it was laziness). But it makes Tensorken pretty hard to use in an interactive setting, like a Jupyter notebook, because every panic takes out the kernel and you lose all state. It's worth trying to return Result instead from tensor operations that may fail.

Secondly, the slicing API is limiting. On the one hand, there are some issues with using Rust's built-in indexer, because the trait to implement for that returns a reference to the result, while all tensor operations return a new struct by value. Additionally and annoyingly, quite a few movement operations on tensors are relatively nice to express in Python's more expressive slicing syntax. Especially the lack of a "reverse index" like Python's -1 to indicate the last element of a list, is quite limiting. Rust's ndarry has a purpose-built macro for slicing, which I think is overkill for Tensorken. Perhaps there is a nice compromise to be found.

Efficiency

As the previous section makes clear, Tensorken's high-level API is stitched together from the low-level RawTensor operations. Ideally, we should be able to do this with reckless abandon, performance-wise. This is currently certainly not the case: for example, in matmul there is at least one buffer allocation too many (the allocation of the intermediate elementwise multiplication, which is immediately reduced).

In a better world, Tensorken would fuse operations in a single compute kernel, allocating a single new buffer (if necessary) for the result. Fusing is even more important for GPU operations, as it avoids allocations as well as the launch of multiple compute kernels. GPUs are heavily optimized for bandwidth, so it's extra important to have chunky data and compute.

This could be done by writing a virtualized RawTensor implementation that lazily executes operations: we only need to compute the result if someone calls ravel or to_cpu. Once we have a graph of operations, we can normalize and fuse it however we wish, at least in principle. For example, if we can figure out that the result of a unary operation is used only by a subsequent unary operation, we could only allocate a single buffer and fuse the two operations.

This is a good chunk of work to be sure, but perhaps not as insurmountable as may seem.

One key idea is that it's straightforward to track shapes and strides through all operations, without executing them. When we know the expected shape and strides of the result, we can then work backward to find the elements from the input tensors to combine. We can work backward because the tensor to buffer index function 𝚋 is invertible - i.e. given a buffer index 𝚎, we can figure out which tensor index or indices map to it:

𝚋⁻¹(e) = (..., ((𝚎-𝚘) ÷ sᵢ) % 𝚕ᵢ, ... )

Recall 𝚘 is the offset, and sᵢ and 𝚕ᵢ are the stride and length for dimension 𝚒 respectively.

In the implementation, it'd be nice to do this all in the final style, somewhat similar to how we pushed down negations in the interpreters from my last post.

You made it

My sincere congratulations for making it this far.

In future - hopefully shorter! - posts, I plan to give an overview of the GPU support via WebGPU and more precisely wgpu, which is Mozilla's implementation of the emerging WebGPU standard. I also want to add automatic differentiation to Tensorken - I have an advanced prototype in a branch, but it's not ready for prime time.

I hopefully made clear I'm not an expert in any of this, but I've made a significant effort to learn how existing libraries work. If you know better, please let me know any comments, suggestions, or improvements!

References

tinygrad: a wonderful, small deep learning framework in Python. Currently the main inspiration and guide for Tensorken. The fundamental operations in RawTensor are tinygrad's, as well as the separation in unary, binary, reduce, and movement operations.
micrograd: an even tinier implementation of a toy neural net library by Andrej Karpathy. Only works on scalars on the CPU, but regardless very readable and hackable.
Neural Networks: Zero to Hero: A course by Andrej Karpathy on building neural networks, from scratch, in code. Not "from scratch" enough for me apparently, because for the real work he uses PyTorch. I've begun reproducing this course with Tensorken, and it drives features for the moment.
An Illustrated Guide to Shapes and Strides: a detailed, in-depth, and beautifully illustrated look at shapes and strides in NumPy arrays.
MiniTorch: Yet another small neural network library implementation, in Python, but this one focuses on the internals so is truly from scratch. The lesson notes have a decent overview of tensors and broadcasting. I didn't look at the code for this much, but the explanations were useful.
Online NumPy playground from w3schools: for quickly running some numpy functions.
NumPy: Tensors in Python.
ndarray: Tensors in Rust.

Harshad

Nov 24, 2023Edited

Beautiful, amazing post. It cleared out lot of things for me. There are very few resources which are putting things out core ideas in plain words (like `just set strides to 0 for broadcasting` etc) like this. Kudos!

Any idea how Tinygrad is achieving fusing of operations? I mean I can understand superficially that you'll do some kind of tracing ala compilers and figure out when ops can be fused. But is there a similar resource which explains how fusing is actually done? Thanks!

2 replies by Kurt and others

2 more comments...

Get Code

Discussion about this post

Ready for more?