Fusing Epilog Operations with Matrix Multiplication using nvmath-python

Optimizing the Forward Pass with the RELU_BIAS Epilog

In this section, I demonstrate how to use epilogs to implement a forward pass of a simple linear layer. This layer first multiplies the input vectors by a weights matrix, then adds a bias to each element of the resulting matrix, and finally applies the ReLU activation function.

ReLU, short for Rectified Linear Unit, is a commonly used activation function that replaces negative values with zeros while leaving positive values unchanged.

In terms of matrix operations, the layer can be expressed as follows:

relu(Wx + B)

In the equation, the following definitions are true:

W represents the weights matrix
x represents the input vector
B represents the bias vector
relu represents the ReLU activation function

Assume that you have your inputs, weights, and bias as CuPy arrays:

num_inputs, num_outputs = 784, 100
batch_size = 256

weights = cupy.random.rand(num_outputs, num_inputs)
bias = cupy.random.rand(num_outputs)
x = cupy.zeros((num_inputs, batch_size))

In the most basic version, you can implement this linear layer by using nvmath-python for calculating Wx, and then handling bias and ReLU manually, as in the following code example.

mm = Matmul(weights, x)
mm.plan()

def forward():
    y = mm.execute()
    y += bias[:, cupy.newaxis]
    y[y < 0] = 0
    return y

To improve the performance of the code, take advantage of the RELU_BIAS epilog to perform all three operations in a single, fused cuBLAS operation. This epilog first adds the bias to the result of the multiplication and then applies the ReLU function.

You can specify the epilog using the `epilog` argument of the `Matmul.plan` method. Some epilogs, including RELU_BIAS, take extra inputs, which can be specified in the `epilog_inputs` dictionary. For more information about epilogs, see nvmath.linalg.advanced.Matmul.

from nvmath.linalg.advanced import MatmulEpilog

mm = Matmul(weights, x)
mm.plan(epilog=MatmulEpilog.RELU_BIAS, epilog_inputs={"bias": bias})

def forward():
    y = mm.execute()
    return y

Optimizing the Backward Pass with the DRELU_BGRAD Epilog

In backpropagation, when you know how the loss function L is affected by t_3, which is \frac{\partial L}{\partial t_3}, it is possible to calculate the gradients with respect to other parameters.

For more information about the derivations of the formulas used to compute the gradients, see Automatic Differentiation and Neural Networks.

The operations required to compute \frac{\partial L}{\partial B} and \frac{\partial L}{\partial t_1} can be naively implemented by using Matmul just for matrix multiplication, and then handling masking and batch sum manually:

mm = Matmul(weights.T, grad)
mm.plan()

def backward():
    grad_t1 = mm.execute()
    grad_t1[mask] = 0  # assuming that `mask = (t1 < 0)`
    grad_bias = cupy.sum(grad_t1, axis=1)
    return grad_t1, grad_bias

To optimize your backward pass, use the DRELU_BGRAD epilog. Assume that the gradient \frac{\partial L}{\partial t_3} is available in a CuPy array grad. The DRELU_BGRAD epilog expects one input, `relu_aux`, containing the mask returned from RELU_AUX_BIAS epilog. It applies this mask to the result of the multiplication. It also returns an auxiliary output with the column-wise sum of the result, which happens to be \frac{\partial L}{\partial B}.

mm = Matmul(weights.T, grad)
mm.plan(epilog=MatmulEpilog.DRELU_BGRAD, epilog_inputs={"relu_aux": relu_mask})

def backward():
    grad_t1, aux_outputs = mm.execute()
    grad_bias = aux_outputs["drelu_bgrad"]
    return grad_t1, grad_bias

Conclusion

With the epilogs of nvmath-python, you can fuse common deep learning computations together in your Python code, which enables you to greatly improve the performance. For more information, see the nvmath-python: Unleashing the Full Capabilities of NVIDIA Math Libraries within Python documentation. For an example of end-to-end implementation of a simple neural network with nv-math python, see the Backpropagation Jupyter notebook on GitHub.

We are an open-source library, so feel free to visit the /NVIDIA/nvmath-python GitHub repo and reach out to us there.

Frequently Asked Questions

Q1: What is nvmath-python?

nvmath-python is an open-source Python library that provides Python programmers with access to high-performance mathematical operations from NVIDIA CUDA-X math libraries.

Q2: What is an epilog?

An epilog is an operation that can be fused with a mathematical operation being performed, like FFT or matrix multiplication. Available epilogs cover the most common deep-learning computations.

Q3: How do I use epilogs?

You can use epilogs by specifying the epilog argument of the Matmul.plan method. Some epilogs, including RELU_BIAS, take extra inputs, which can be specified in the epilog_inputs dictionary.

Q4: What is the benefit of using epilogs?

The benefit of using epilogs is that they enable you to fuse common deep-learning computations together in your Python code, which can greatly improve the performance.

Q5: Where can I find more information about nvmath-python?

You can find more information about nvmath-python in the nvmath-python: Unleashing the Full Capabilities of NVIDIA Math Libraries within Python documentation and the Backpropagation Jupyter notebook on GitHub.

Post Views: 65

Fusing Epilog Operations with Matrix Multiplication using nvmath-python

Optimizing the Forward Pass with the RELU_BIAS Epilog

Optimizing the Backward Pass with the DRELU_BGRAD Epilog

Conclusion

Frequently Asked Questions

Q1: What is nvmath-python?

Q2: What is an epilog?

Q3: How do I use epilogs?

Q4: What is the benefit of using epilogs?

Q5: Where can I find more information about nvmath-python?

Generate single title from this title The Impact of Generative AI in Business: Key Insights in 100 -150 characters. And it must return only...

Generate single title from this title The most interesting startups right now want to get you off your phone in 100 -150 characters. And...

Generate single title from this title My students need real connection, not AI feedback in 100 -150 characters. And it must return only title...

Startup helps retailers track their products in real-time | MIT News

Generate single title from this title Dozens of Red Hat packages backdoored through its official NPM channel in 100 -150 characters. And it must...

Generate single title from this title The Impact of Generative AI in Business: Key Insights in 100 -150 characters. And it must return only...

Generate single title from this title The most interesting startups right now want to get you off your phone in 100 -150 characters. And...

Generate single title from this title My students need real connection, not AI feedback in 100 -150 characters. And it must return only title...

Startup helps retailers track their products in real-time | MIT News

Generate single title from this title Dozens of Red Hat packages backdoored through its official NPM channel in 100 -150 characters. And it must...

Ambassadors of STEM | MIT News

Generate single title from this title How districts can build a shared AI structure in 100 -150 characters. And it must return only title...

Generate single title from this title Training Azerbaijani language models on Amazon SageMaker AI in 100 -150 characters. And it must return only title...

LEAVE A REPLY Cancel reply

Latest

Generate single title from this title The Impact of Generative AI in Business: Key Insights in 100 -150 characters. And it must return only...

Generate single title from this title The most interesting startups right now want to get you off your phone in 100 -150 characters. And...

Generate single title from this title My students need real connection, not AI feedback in 100 -150 characters. And it must return only title...

Categories

Useful Links

Our Newsletter