Post

The Tensor Calculus You Need for Deep Learning

Deriving the gradient for the backward pass using tensor calculus and index notation

The Tensor Calculus You Need for Deep Learning

When reading about deep learning theory, I was incredibly dissatisfied with the literature that explained how to derive gradients for backpropagation. It would mostly focus on vectors (and rarely matrices), but in deep-learning we use tensors everywhere! In part, this is probably since working with tensors mathematically isn’t easy. This post is a spiritual successor to the excellent article The Matrix Calculus You Need For Deep Learning but for functions that work with tensors.

In this post, I introduce a complete and consistent framework for deriving gradients for backpropagation using tensor calculus. The first section describes tensors (the type we use in deep learning) and introduces index notation, a mathematical framework for working with them. Then we can get to tensor calculus its self and show some examples.

This article forms part of a series on differentiating and calculating gradients in deep learning. Part 1 introduced backpropagation and multivariable calculus, which sets up some of the ideas used in this article. Here, we get to the meat of the theory: tensors and tensor calculus using index notation.

Tensors

A tensor is a multi-dimensional ordered array of numbers, expanding the concept of a matrix into N-dimensions. Here and in deep learning, we are specifically talking about N-dimensional Cartesian tensors, which are simpler than tensors typically discussed in physics or mathematics. Focusing on Cartesian tensors removes the need to make a distinction between covariant and contravariant indices and the same transformation laws do not need to apply.

A tensor T\mathcal{T} has components tijt_{i \ldots j}, where iji \ldots j means an arbitrary list of indices, including the indices ii and jj. The number of indices indicates the number of axes the tensor has known as its rank or order; for example, tijt_{i j} is an order-2 tensor. Tensors are denoted using an uppercase calligraphic font with the corresponding components in lowercase and with indices using lowercase Latin symbols.

An order-1 tensor is analogous to vectors, and an order-2 tensor is analogous to matrices. You can also have an order-0 tensor, which is a scalar or single number.

Representing three-dimensional or higher objects on paper is challenging, so we need a method to flatten their structure. One approach is to utilize basis vectors. For example, a vector can be expressed as a linear combination of basis vectors:

x^=1e^1+2e^2=(12)\hat{x}=1 \hat{e}_{1}+2 \hat{e}_{2}=\binom{1}{2}

Where a basis vector is defined as:

e^1=(10)\hat{e}_{1}=\binom{1}{0}

The bases of a matrix can be constructed by applying the tensor product ()(\otimes) of two basis vectors. Don’t worry about the details of how the tensor product works - we won’t be using it much:

e^1e^2=e^1,2=(10)(01)=(0100)\hat{e}_{1} \otimes \hat{e}_{2}=\hat{e}_{1,2}=\binom{1}{0} \otimes\binom{0}{1}=\left(\begin{array}{ll} 0 & 1 \\ 0 & 0 \end{array}\right)

Then a matrix can be expressed using its basis:

X=1e^1e^1+2e^1e^2+3e^2e^1+4e^2e^2=(1234)\begin{aligned} X & =1 \hat{e}_{1} \otimes \hat{e}_{1}+2 \hat{e}_{1} \otimes \hat{e}_{2}+3 \hat{e}_{2} \otimes \hat{e}_{1}+4 \hat{e}_{2} \otimes \hat{e}_{2} \\ & =\left(\begin{array}{ll} 1 & 2 \\ 3 & 4 \end{array}\right) \end{aligned}

Now, to represent an order-3 tensor, we can write it as a linear combination of its basis:

T=1e^1,1,1+2e^1,1,2+3e^1,2,1+4e^1,2,2+5e^2,1,1+6e^2,1,2+7e^2,2,1+8e^2e^2,2,2=i,j,ktijke^ijk=e^1(1234)+e^2(5678)\begin{aligned} \mathcal{T}= & 1 \hat{e}_{1,1,1}+2 \hat{e}_{1,1,2}+3 \hat{e}_{1,2,1}+4 \hat{e}_{1,2,2} \\ & +5 \hat{e}_{2,1,1}+6 \hat{e}_{2,1,2}+7 \hat{e}_{2,2,1}+8 \hat{e}_{2} \hat{e}_{2,2,2} \\ = & \sum_{i, j, k} t_{i j k} \hat{e}_{i j k} \\ = & \hat{e}_{1} \otimes\left(\begin{array}{ll} 1 & 2 \\ 3 & 4 \end{array}\right)+\hat{e}_{2} \otimes\left(\begin{array}{ll} 5 & 6 \\ 7 & 8 \end{array}\right) \end{aligned}

As you can see, the notation gets unwieldy very quickly when the number of dimensions grows larger than 2. We also need a way of defining operations between tensors. We will, therefore, turn to index notation to provide a more convenient way to represent tensors of any order.

Index Notation

Index notation provides a convenient algebra to work with individual elements or components of a tensor. It can also be used to easily define operations between tensors, for example, a matrix multiplication can be expressed as:

cik=aijbjkc_{i k}=a_{i j} b_{j k}

Below, I explain the rules of index notation and how to understand the above expression.

Index notation (not to be confused with multi-index notation) is a simplified version of Einstein notation or Ricci calculus that works with Cartesian tensors.

Not all the normal algebra rules apply with index notation, so we must take care when first using it.

There are many flavours of index notation with different subtleties. When reading other texts, look at the details of the rules they use. The notation presented here has been adapted for deep learning.

Index Notation Rulebook

Tensor Indices

  • A tensor is written using uppercase curly font T\mathcal{T}. Its components are written in lowercase, with its indices in subscript tijkt_{i j k}. You can obtain the components of a tensor using square brackets e.g. [T]ijk=tijk\left [ \mathcal{T} \right ]_{i j k}=t_{i j k}.

  • The number of indices represents the order of the tensor. For example tijkt_{i j k} has three indices i,ji, j and kk, and so is an order- 3 tensor.
  • The indices can use any letters, so without further context tijkt_{i j k} and tabct_{a b c} refer to the same tensor and components.
  • The range of the indices is either determined by context or can be kept arbitrary.
  • Indices are labelled using lowercase Latin symbols.

Free Indices

  • Free indices describe a system of equations and imply that the expression is repeated, enumerating the index range. For example, if the index ii has the range i={1,2,3}i=\{1,2,3\}, then:
ai=bici{a1=b1c1a2=b2c2a3=b3c3a_{i}=b_{i} c_{i} \Leftrightarrow\left\{\begin{array}{l} a_{1}=b_{1} c_{1} \\ a_{2}=b_{2} c_{2} \\ a_{3}=b_{3} c_{3} \end{array}\right.
  • A free index must appear in every additive term in the equation. Most of the time you can determine the free indices by checking an equation’s left-hand side (LHS) or the subject of an equation.
  • A free index can be renamed if it is renamed in all additive terms.
  • An equation can have multiple free indices, representing the system of equations that enumerate all indices.

Dummy Indices

  • Dummy indices appear twice or more in an additive term and do not appear on the LHS or equation subject. Dummy indices imply a summation, also known as a contraction:
a=bicia=ibicia=b_{i} c_{i} \Leftrightarrow a=\sum_{i} b_{i} c_{i}
  • A dummy index can take any label not being used by a free index.
  • A dummy index is “local” to each additive term, so it can be renamed in one term and not others.
  • Dummy indices and free indices can be used together. For example, below ii is on the LHS and kk is missing, which means that ii is a free index and kk is a dummy index:
ai=bikcikdka_{i}=b_{i k} c_{i k} d_{k}
  • Sometimes, it is ambiguous if an index is a dummy or free index; on these occasions, free indices can be stated explicitly by writing “no sum”. For example, below ii is a free index and jj is a dummy index:
aijbij=αijβij(no sum i) a_{i j} b_{i j}=\alpha_{i j} \beta_{i j} \quad\text {(no sum i) }

Typically, in Physics, you use Einstein notation to represent tensors, which prevents dummy indices from appearing more than twice and guarantees that the resulting tensor conforms to the transformation laws. We are using Cartesian tensors, which don’t have this restriction, and in deep learning, it is common to sum over more than two tensors, so here, we allow a dummy index to appear more than twice in an expression.

Kronecker Delta

  • The Kronecker delta is an order-2 tensor whose components are 1 when the indices equal and zero otherwise:
δij={1 if i=j0 otherwise \delta_{i j}= \begin{cases}1 & \text { if } i=j \\ 0 & \text { otherwise }\end{cases}
  • The Kronecker delta is analogous to the identity matrix in matrix notation.
  • The tensor is symmetric δij=δji\delta_{i j}=\delta_{j i}.
  • The Kronecker delta can be used to select a specific term in a summation. For example, we can select the kk‘th term of tensor aa in index ii using a contraction:
δkiaij=δk0a0j+δk1a1j+=δkkakj=akj\delta_{k i} a_{i j}=\delta_{k 0} a_{0 j}+\delta_{k 1} a_{1 j}+\ldots=\delta_{k k} a_{k j}=a_{k j}
  • More generally, if index ii is a dummy index and kk is a dummy or free index, using the Kronecker delta δik\delta_{i k}, we can replace index ii with all instances of index kk. Consider these examples:

    • Replace ii with kk for multiple tensors:
    δkibijcij=bkjckj\delta_{k i} b_{i j} c_{i j}=b_{k j} c_{k j}
    • Replace ii with jj or jj with ii:
δijbipcqj=bjpcqj=bipcqi\delta_{i j} b_{i p} c_{q j}=b_{j p} c_{q j}=b_{i p} c_{q i}

One-Tensor

  • The one-tensor is a tensor with the value 1 for all its components. It can be of any order or shape:
1ij=1\mathbf{1}_{i \ldots j}=1
  • The one-tensor can be used to contract components of another tensor, for example, summing all components of bb:
a=1ibia=\mathbf{1}_{i} b_{i}
  • The one-tensor can be used to broadcast components of another tensor, for example, broadcasting an order-1 tensor into an order-2 tensor (the values of bb are repeated for all values of ii ):
aij=1ibja_{i j}=\mathbf{1}_{i} b_{j}
  • The contraction of a one-tensor with itself equals the cardinality of an index. For example, if nn has a range between 1 and NN, then:
a=1n1n=Na=\mathbf{1}_{n} \mathbf{1}_{n}=N

This is true for any number of one-tensors being contracted with each other:

a=1n1n1n=Na=\mathbf{1}_{n} \mathbf{1}_{n} \ldots \mathbf{1}_{n}=N
  • The broadcast of an order-MM one-tensor with another order-NN one-tensor is equal to an order-(M+N)(M+N) one-tensor:
aij=1i1j=1ija_{i j}=\mathbf{1}_{i} \mathbf{1}_{j}=\mathbf{1}_{i j}
  • A one-tensor can be dropped if a term already has a tensor with the same free or dummy index as the one-tensor, as this results in multiplying the terms by 1 :
ai=1ijbijcj=bijcja_{i}=\mathbf{1}_{i j} b_{i j} c_{j}=b_{i j} c_{j}

Substitution

When substituting one equation into another…

  • You need to use the same free indices. For example, given ai=bicia_{i}=b_{i} c_{i} and cj=djjc_{j}=d_{j j}, it would be wrong to conclude then ai=bidjja_{i}=b_{i} d_{j j}. Instead, we must change the free indices so they match to get ai=bidiia_{i}=b_{i} d_{i i}.
  • You need to use different dummy indices. For example, given ai=bij1jcia_{i}=b_{i j} \mathbf{1}_{j} c_{i} and ci=dijejc_{i}=d_{i j} e_{j}, we need to change jj to a new index kk in one of the equations to get ai=bij1jdikeka_{i}=b_{i j} \mathbf{1}_{j} d_{i k} e_{k} , otherwise we get ai=bij1jdijeja_{i}=b_{i j} \mathbf{1}_{j} d_{i j} e_{j} which is wrong.
  • You cannot substitute into a contraction. For example, given a=bkckdka=b_{k} c_{k} d_{k} and β=ckdk\beta=c_{k} d_{k} it would be wrong to conclude a=bkβa=b_{k} \beta because bkb_{k} is also part of the contraction.

Addition and Multiplication

  • Only tensors of the same order can be added together, for example, aij+bija_{i j}+b_{i j} is valid while aij+bia_{i j}+b_{i} is not. Except for adding a tensor with a scalar aij+ba_{i j}+b, which is implicitly broadcasted to the appropriate order aij+b1ija_{i j}+b \mathbf{1}_{i j}.
  • The usual commutative and associative properties of addition and multiplications remain true (see appendix).
  • Tensors don’t distribute or factorise in the usual way. An expression can only be factorised, provided that all additive terms result in having the same tensor-order. For example:
ai=bijcij+bijdjbij(cij+dj)\begin{aligned} a_{i} & =b_{i j} c_{i j}+b_{i j} d_{j} \\ & \neq b_{i j}\left(c_{i j}+d_{j}\right) \end{aligned}

But a one-tensor can be used to maintain the tensor-order of the terms:

ai=bijcij+bijdj=bij(cij+1idj)\begin{aligned} a_{i} & =b_{i j} c_{i j}+b_{i j} d_{j} \\ & =b_{i j}\left(c_{i j}+\mathbf{1}_{i} d_{j}\right) \end{aligned}

Algebra of Index Notation

  • Provided an equation, it is valid to apply a function to both sides:
ai=bikckf(ai)=f(bikck)\begin{aligned} a_{i} & =b_{i k} c_{k} \\ f\left(a_{i}\right) & =f\left(b_{i k} c_{k}\right) \end{aligned}
  • This includes adding a tensor to both sides, for example adding γi\gamma_{i}:
ai=bikckai+γi=bikck+γi\begin{aligned} a_{i} & =b_{i k} c_{k} \\ a_{i}+\gamma_{i} & =b_{i k} c_{k}+\gamma_{i} \end{aligned}
  • Or multiplying both sides with a scalar λ\lambda:
ai=bikckaiλ=bikckλ\begin{aligned} a_{i} & =b_{i k} c_{k} \\ a_{i} \lambda & =b_{i k} c_{k} \lambda \end{aligned}
  • Or multiplying both sides using new free indices γj\gamma_{j}:
ai=bikckaiγj=bikckγj\begin{aligned} a_{i} & =b_{i k} c_{k} \\ a_{i} \gamma_{j} & =b_{i k} c_{k} \gamma_{j} \end{aligned}
  • Or multiply both sides with an existing free index to contract both sides γi\gamma_{i}:
ai=bikckaiγi=bikckγi\begin{aligned} a_{i} & =b_{i k} c_{k} \\ a_{i} \gamma_{i} & =b_{i k} c_{k} \gamma_{i} \end{aligned}
  • It is not valid to contract an index that is already a dummy index, for example multiplying γk\gamma_{k}:
ai=bikckaiγkbikckγk\begin{gathered} a_{i}=b_{i k} c_{k} \\ a_{i} \gamma_{k} \neq b_{i k} c_{k} \gamma_{k} \end{gathered}

To show this further, consider another example where ak=(2,1)T;bk=(0,1)T;γ=(0,1)Ta_{k}=(2,-1)^{T} ; b_{k}=(0,1)^{T} ; \gamma=(0,1)^{T}:

ak1k=bk1kak1kγkbk1kγk201100+1111\begin{aligned} a_{k} \mathbf{1}_{k} & =b_{k} \mathbf{1}_{k} \\ a_{k} \mathbf{1}_{k} \gamma_{k} & \neq b_{k} \mathbf{1}_{k} \gamma_{k} \\ 2 * 0-1 * 1 & \neq 0 * 0+1 * 1 \\ -1 & \neq 1 \end{aligned}
  • It is valid to use existing function identities, such as logarithmic or exponential identities e.g. log(ab)=log(a)+log(b)\log (a b)=\log (a)+\log (b) or e.g. (ab)2=a2b2(a b)^{2}=a^{2} b^{2}, but they should not change the components of a contraction. For example if aij=bicja_{i j}=b_{i} c_{j}, then aij2=bi2cj2a_{i j}^{2}=b_{i}^{2} c_{j}^{2} is valid; but if aij=bikcjka_{i j}=b_{i k} c_{j k}, then aij2=bik2cjk2a_{i j}^{2}=b_{i k}^{2} c_{j k}^{2} is not valid.

Summary

We now have a way of representing order-N tensors using a convenient notation. It also allows us to define new operations between arbitrary ordered tensors, which was previously difficult to do using matrix notation without inventing new symbols.

Often, instead of referring to aija_{i j} as the components of an order-2 tensor, we will simply refer to aija_{i j} as the tensor. However, it should always be remembered that aija_{i j} are components and not the data structure itself.

Matrix Expressions using Index Notation

Here are some common matrix operations expressed using index notation.

OperationMatrix notationIndex notation
Matrix additionA=B+CA=B+Caij=bij+cija_{i j}=b_{i j}+c_{i j}
Matrix transposeA=BTA=B^{T}aij=bjia_{i j}=b_{j i}
Matrix traceλ=tr(A)\lambda=\operatorname{tr}(A)λ=aii\lambda=a_{i i}
Scalar-matrix multiplicationA=λBA=\lambda Baij=λbija_{i j}=\lambda b_{i j}
Matrix multiplicationA=BCA=B Caij=bikckja_{i j}=b_{i k} c_{k j}
Matrix Hadamard multiplication
(element-wise multiplication)
A=BCA=B \odot Caij=bijcija_{i j}=b_{i j} c_{i j}
Vector outer productA=b^c^TA=\hat{b} \hat{c}^{T}aij=bicja_{i j}=b_{i} c_{j}
Vector dot productλ=b^c^\lambda=\hat{b} \cdot \hat{c}λ=bici\lambda=b_{i} c_{i}
Vector Euclidean norm (L2 norm)λ=x^\lambda=\|\hat{x}\|λ=xixi\lambda=\sqrt{x_{i} x_{i}}

We often want to convert back and forth between the two notations and transposes are quite common. So a shorthand notation is often used: aij=(aT)jia_{i j} = (a^T)_{j i} whereby aTa^T are the components of the tensor ATA^T.

Example: Layer Normalisation using Index Notation

PyTorch defines the layer normalization (layer norm) operation for an input matrix XX, with shape batch size BB by hidden size HH, as:

y=xE[x]Var[x]+ϵγ+βy=\frac{x-\mathrm{E}[x]}{\sqrt{\operatorname{Var}[x]+\epsilon}} * \gamma+\beta

Where the mean E[x]\mathrm{E}[x] and variance Var[x]\operatorname{Var}[x] are calculated for each sample in a batch, and γ\gamma and β\beta are learnable vector weights with lengths equal to the hidden size. ϵ\epsilon is a constant usually equal to 1e051 \mathrm{e}{-05}.

Let’s first look at the mean E[x]\mathrm{E}[x], which is defined as:

E[x]=1Hi=1Hxi\mathrm{E}[x]=\frac{1}{H} \sum_{i=1}^{H} x_{i}

The mean is calculated for each sample, so ii ranges over all xx values for a sample. With index notation, we can use the one-tensor to sum the components of xx, drop the summation sign and explicitly state we have a value per batch by using a tensor mbm_{b} to store the values of the mean:

mb=1H1hxbhm_{b}=\frac{1}{H} \mathbf{1}_{h} x_{b h}

bb is a free index which ranges from 1 to BB, and hh is a dummy index ranging from 1 to HH. Similarly, we can apply the same idea to the variance, which is defined as:

Var[x]=1Hi=1H(xiE[x])2\operatorname{Var}[x]=\frac{1}{H} \sum_{i=1}^{H}\left(x_{i}-\mathrm{E}[x]\right)^{2}

We can represent this using index notation:

vb=1H1h(xbh1hmb)2v_{b}=\frac{1}{H} \mathbf{1}_{h}\left(x_{b h}-\mathbf{1}_{h} m_{b}\right)^{2}

Notice how we use the one-tensor for two different purposes here. The first is used to sum the terms like we used it for the mean. The second is used to broadcast the mean mbm_{b} to an order-2 tensor 1hmb\mathbf{1}_{h} m_{b}, as the mean is invariant to the hidden dimension, and we want to subtract it from the order-2 tensor xbhx_{b h}.

Now, we move to the definition of y.γ\mathrm{y} . \gamma and β\beta are learnable vector weights with length equal to the hidden size, which means they are an order-1 tensor with length HH. Using index notation, we can represent yy as:

ybh=xbh1hmbvb+ϵγh+1bβhy_{b h}=\frac{x_{b h}-\mathbf{1}_{h} m_{b}}{\sqrt{v_{b}+\epsilon}} \gamma_{h}+\mathbf{1}_{b} \beta_{h}

We use the one-tensor twice to broadcast mbm_{b} and βh\beta_{h} to an order-2 tensors.

Tensor Calculus

Tensor calculus allows us to consider how to differentiate a tensor with respect to a tensor.

Provided a function Y=F(X)\mathcal{Y}=\mathcal{F}(\mathcal{X}), which takes as input an order-MM tensor X\mathcal{X} with components xpqx_{p \ldots q} and outputs a order-NN tensor Y\mathcal{Y} with components yijy_{i \ldots j}; the derivative or Jacobian tensor is then of order-(M+N)(M+N):

YX=[YX]ijpqe^ijpq=yijxpqe^ijpq\frac{\partial \mathcal{Y}}{\partial \mathcal{X}}=\left[\frac{\partial \mathcal{Y}}{\partial \mathcal{X}}\right]_{i \ldots j p \ldots q} \hat{e}_{i \ldots j p \ldots q}=\frac{\partial y_{i \ldots j}}{\partial x_{p \ldots q}} \hat{e}_{i \ldots j p \ldots q}

We are going to focus on index notation, so the derivative of a tensor with respect to a tensor is the quantity yijxpq\frac{\partial y_{i \ldots j}}{\partial x_{p \ldots q}} and xpq\frac{\partial}{\partial x_{p \ldots q}} is the tensor derivative operator. Below, we list the rules for applying the operator, and then we can get onto some examples.

Tensor Calculus Rulebook

Tensor Derivative Operator

  • The derivative operator should always use new free indices. For example, given the equation:
yi..j=xijy_{i . . j}=x_{i \ldots j}

We pick the new free indices aba \ldots b when applying the derivative operator:

yijxab=xijxab\frac{\partial y_{i \ldots j}}{\partial x_{a \ldots b}}=\frac{\partial x_{i \ldots j}}{\partial x_{a \ldots b}}
  • The derivative of a tensor with respect to itself is a product of Kronecker deltas. For example, for an order-1 tensor:
xixp=δip\frac{\partial x_{i}}{\partial x_{p}}=\delta_{i p}

This is because when ip,xii \neq p, x_{i} is constant to xpx_{p}. In general, for the tensor of an order-NN tensor:

xi...jxp.=δipδjqN\frac{\partial x_{i . . . j}}{\partial x_{p \ldots .}}=\underbrace{\delta_{i p} \ldots \delta_{j q}}_{N}

Product and Quotient Rule

  • We can use the product rule to obtain the tensor derivative of a product:
uabvcdxpq=uabxpqvcd+uabvcdxpq\frac{\partial u_{a b} v_{c d}}{\partial x_{p q}}=\frac{\partial u_{a b}}{\partial x_{p q}} v_{c d}+u_{a b} \frac{\partial v_{c d}}{\partial x_{p q}}

This can also be applied to a contraction (bb is a dummy index):

uabvbcxpq=uabxpqvbc+uabvbcxpq\frac{\partial u_{a b} v_{b c}}{\partial x_{p q}}=\frac{\partial u_{a b}}{\partial x_{p q}} v_{b c}+u_{a b} \frac{\partial v_{b c}}{\partial x_{p q}}

And the quotient rule for division:

xpq(uabvcd)=1vcd2(uabxpqvcduabvcdxpq)\frac{\partial}{\partial x_{p q}}\left(\frac{u_{a b}}{v_{c d}}\right)=\frac{1}{v_{c d}^{2}}\left(\frac{\partial u_{a b}}{\partial x_{p q}} v_{c d}-u_{a b} \frac{\partial v_{c d}}{\partial x_{p q}}\right)

This also works for contractions (uu and vv can include dummy indices).

Chain Rule

  • We can also use the chain rule. For example, if a function outputs yy and takes uu as input and uu is a function of xx i.e. yab(ucd(xij))y_{a \ldots b}\left(u_{c \ldots d}\left(x_{i \ldots j}\right)\right), then:
yabxpq=yabucducdxpq\frac{\partial y_{a \ldots b}}{\partial x_{p \ldots q}}=\frac{\partial y_{a \ldots b}}{\partial u_{c \ldots d}} \frac{\partial u_{c \ldots d}}{\partial x_{p \ldots q}}

Importantly, the two derivatives are contracted, i.e. ucdu_{c \ldots d} have the same indices in the first and second derivative terms. This mimics the summation shown in the multivariable calculus section, providing the weighted sum of all xx contributions to the change in yy. Also note, we apply the derivative operator with new free indices xpqx_{p \ldots q}.

  • For the more general case of nn arbitrary nested functions yab(ucd(vef(xij)))y_{a \ldots b}\left(u_{c \ldots d}\left(\ldots v_{e \ldots f}\left(x_{i \ldots j}\right)\right)\right), the chain rule is:
yabxpq=yabucducdvefvefxpq\frac{\partial y_{a \ldots b}}{\partial x_{p \ldots q}}=\frac{\partial y_{a \ldots b}}{\partial u_{c \ldots d}} \frac{\partial u_{c \ldots d}}{\partial v_{e \ldots f}} \frac{\partial v_{e \ldots f}}{\partial x_{p \ldots q}}

Importantly, the two contractions in the expression use different dummy indices and are separate contractions.

  • If a function takes multiple tensors as input, then this must be considered when applying the chain rule. For example, provided with the function that takes two tensors yab(ucd(xij),vef(xij))y_{a \ldots b}\left(u_{c \ldots d}\left(x_{i \ldots j}\right), v_{e \ldots f}\left(x_{i \ldots j}\right)\right), the chain rule would be:
yabxpq=yabucducdxpq+yabvefvefxpq\frac{\partial y_{a \ldots b}}{\partial x_{p \ldots q}}=\frac{\partial y_{a \ldots b}}{\partial u_{c \ldots d}} \frac{\partial u_{c \ldots d}}{\partial x_{p \ldots q}}+\frac{\partial y_{a \ldots b}}{\partial v_{e \ldots f}} \frac{\partial v_{e \ldots f}}{\partial x_{p \ldots q}}

Summary

We are now at the point where we can combine tensor calculus with backpropagation! We have learned how to compute derivatives and apply the chain rule to tensors of any order.

In summary, to derive the backpropagated gradients of a tensor function, the steps are:

  1. Translating the function into index notation
  2. Calculating the derivative tensor of the function with respect to all its inputs
  3. Assume the outputs are a dependency of a downstream scalar function ll and we are provided with the gradients of ll with respect to each of the outputs
  4. Use the chain rule to determine the gradients of ll with respect to each of the inputs of the function.

Great! With this foundation, we’re ready to move on to some practical examples.

Example: Element-Wise Functions

An element-wise or pointwise function applies an univariable function to all tensor components. For example, we could apply the trigonometric function sine to all tensor values. Let’s consider an arbitrary pointwise function point:

yi=point(xi)y_{i}=\operatorname{point}\left(x_{i}\right)

Above xix_{i} and yiy_{i} are order-1 tensors, but the following maths would apply to any order-NN tensor.

Firstly, to calculate the derivative of yiy_{i} with respect to xpx_{p}, we differentiate using a new free index pp:

yixp=point(xi)xp=point(xi)xixixp=point(xi)δip\begin{aligned} \frac{\partial y_{i}}{\partial x_{p}} & =\frac{\partial \operatorname{point}\left(x_{i}\right)}{\partial x_{p}} \\ & =\frac{\partial \operatorname{point}\left(x_{i}\right)}{\partial x_{i}} \frac{\partial x_{i}}{\partial x_{p}} \\ & =\operatorname{point}^{\prime}\left(x_{i}\right) \delta_{i p} \end{aligned}

Where point\operatorname{point}^{\prime} is the derivative of the pointwise function, for example, if the pointwise function is sine, then its derivative is cos.

Secondly, we assume yiy_{i} is a dependency of a downstream scalar function ll and that we are provided with the gradient l/yi\partial l / \partial y_{i}. We then use the chain rule to derive the backpropagated gradient of the input l/xp\partial l / \partial x_{p}:

lxp=lyiyixplxp=lyipoint(xi)δip=lyppoint(xp)\begin{aligned} \frac{\partial l}{\partial x_{p}} & =\frac{\partial l}{\partial y_{i}} \frac{\partial y_{i}}{\partial x_{p}} \\ \frac{\partial l}{\partial x_{p}} & =\frac{\partial l}{\partial y_{i}} \operatorname{point}^{\prime}\left(x_{i}\right) \delta_{i p} \\ & =\frac{\partial l}{\partial y_{p}} \operatorname{point}^{\prime}\left(x_{p}\right) \end{aligned}

Example: Matrix Multiplication

Given Y=XAY=X A, we first need to convert it to index notation:

yij=xikakjy_{i j}=x_{i k} a_{k j}

We have two gradients to derive as the function has two matrix inputs: XX and AA.

Gradient of XX

Let’s first obtain the derivative with respect to XX, remember to use new free indices for the derivative operator:

yijxpq=(xikakj)xpq=xikxpqakj=δipδkqakj=δipaqj\begin{aligned} \frac{\partial y_{i j}}{\partial x_{p q}} & =\frac{\partial\left(x_{i k} a_{k j}\right)}{\partial x_{p q}} \\ & =\frac{\partial x_{i k}}{\partial x_{p q}} a_{k j} \\ & =\delta_{i p} \delta_{k q} a_{k j} \\ & =\delta_{i p} a_{q j} \end{aligned}

Then, we obtain the backpropagated gradient using the chain rule:

lxpq=lyijyijxpq=lyijδipaqj=lypjaqj\begin{aligned} \frac{\partial l}{\partial x_{p q}} & =\frac{\partial l}{\partial y_{i j}} \frac{\partial y_{i j}}{\partial x_{p q}} \\ & =\frac{\partial l}{\partial y_{i j}} \delta_{i p} a_{q j} \\ & =\frac{\partial l}{\partial y_{p j}} a_{q j} \end{aligned}

We can then covert this back to matrix notation:

[lX]pq=[lY]pj[AT]jqlX=lYAT\begin{aligned} {\left[\frac{\partial l}{\partial X}\right]_{p q} } & =\left[\frac{\partial l}{\partial Y}\right]_{p j}\left[A^{T}\right]_{j q} \\ \frac{\partial l}{\partial X} & =\frac{\partial l}{\partial Y} A^{T} \end{aligned}

Note that we have to transpose the matrix AA so the indices of the two expressions resemble a matrix multiplication.

Gradient of AA

First, obtain the derivative with respect to AA:

yijapq=xikakjapq=xikakjapq=xikδkpδjq=xipδjq\begin{aligned} \frac{\partial y_{i j}}{\partial a_{p q}} & =\frac{\partial x_{i k} a_{k j}}{\partial a_{p q}} \\ & =x_{i k} \frac{\partial a_{k j}}{\partial a_{p q}} \\ & =x_{i k} \delta_{k p} \delta_{j q} \\ & =x_{i p} \delta_{j q} \end{aligned}

Then, we obtain the backpropagated gradient:

lapq=lyijyijapq=lyijxipδjq=lyiqxip\begin{aligned} \frac{\partial l}{\partial a_{p q}} & =\frac{\partial l}{\partial y_{i j}} \frac{\partial y_{i j}}{\partial a_{p q}} \\ & =\frac{\partial l}{\partial y_{i j}} x_{i p} \delta_{j q} \\ & =\frac{\partial l}{\partial y_{i q}} x_{i p} \end{aligned}

We can then convert this back to matrix notation:

[lA]pq=[XT]pi[lY]iqlA=XTlY\begin{aligned} {\left[\frac{\partial l}{\partial A}\right]_{p q} } & =\left[X^{T}\right]_{p i}\left[\frac{\partial l}{\partial Y}\right]_{i q} \\ \frac{\partial l}{\partial A} & =X^{T} \frac{\partial l}{\partial Y} \end{aligned}

Next

Other examples of using tensor calculus to calculate gradients:

References

Appendix

Commutative and Associative Properties of Index Notation

The following usual commutative and associative properties for addition and multiplication apply to index notation:

ai+bi=bi+ai(ai+bi)+ci=ai+(bi+ci)aibj=bjaiaibi=biai(aibj)ck=ai(bjck)(aibj)cj=ai(bjcj)\begin{aligned} a_{i}+b_{i} & =b_{i}+a_{i} \\ \left(a_{i}+b_{i}\right)+c_{i} & =a_{i}+\left(b_{i}+c_{i}\right) \\ a_{i} b_{j} & =b_{j} a_{i} \\ a_{i} b_{i} & =b_{i} a_{i} \\ \left(a_{i} b_{j}\right) c_{k} & =a_{i}\left(b_{j} c_{k}\right) \\ \left(a_{i} b_{j}\right) c_{j} & =a_{i}\left(b_{j} c_{j}\right) \end{aligned}
This post is copyrighted by Josh Levy-kramer.