Computation of the Moore-Penrose inverse with CUDA / C ++


The Moore-Penrose inverse matrix or pseudo-inverse matrix is defined by:

It especially used in the normal equation to determine the coefficients of a linear regression.

In this context, it is possible to calculate the results in different ways by using an optimizer algorithm that approaches by iteration  the optimal result or by using the normal equation, which gives us the result directly.

so far, it was rarely used, because  computation was time-consuming with the methods of sequential computation resulting from the architecture of the CPU.

On the other hand, it is particularly suitable and easy to optimize for parallel computing on GPUs. Indeed the Moore-Penrose Matrix is composed of a number of independent matrix operations, such as (matrix multiplication, transpose, ect …) which are particularly optimizable on GPU.

To understand intuitively the interest of using the GPU to do this type of calculation. Here is an example of how a simple sum of two vectors is performed with a CPU in a sequential manner, and with a GPU (parallel).


GPU parallel programming follows the SIMT (Single Instruction Multiple Thread) model. This means that thread groups execute the same instructions at the same time.


Compute Unified Device Architecture (CUDA) is a C / C ++ extension that allows you to code directly to a graphics card and take advantage of its parallelization capabilities. Released in 2007, the language has evolved a lot. In this post, the code is executed with the version of CUDA 8.0.

First, we will code the Moore-Penrose matrix in a fairly classical way in C ++, before optimizing the code in CUDA.
In order to simplify the readability of the code, we will content ourselves initially with a small square matrix. Also in order to facilitate the passage in CUDA, it is necessary to linearize the matrices. Let us take the example of a matrix M(m * n) usually transcribed in the practical form Array [I] [J] in C ++, it will be transformed into a 1D array of equivalent dimension Array [I * n + J].

  • Moore–Penrose inverse with C++
#include <iostream>
#include <cstdlib>
#include <time.h>
using namespace std;

// C++ program to find Moore-Penrose inverse  matrix

#define N 7

void Trans_2D_1D(float matrix_2D[N][N], float *matrix) {

    for (int i = 0; i < N; i++){
        for (int j = 0; j < N; j++){
            matrix[i * N + j] = matrix_2D[i][j];



void Transpose(float *matrix, float *t_matrix) {

	for (int i = 0; i<N; i++) {
		for (int j = 0; j<N; j++) {
			 t_matrix[j * N + i]=matrix[i * N + j];




void MatrixMult(float *matrix_1, float *matrix_2, float *matrix_product) {
	int k;
	for (int i = 0; i<N; i++) {
		for (int j = 0; j<N; j++) {             // not j<M
			matrix_product[i * N + j] = 0;
			for (k = 0; k<N; k++) {
				matrix_product[i * N + j] += matrix_1[i * N + k] * matrix_2[k * N + j];


// Function to get cofactor 
void getCofactor(float *A, float *temp, int p, int q, int n)
	int i = 0, j = 0;

	// Looping for each element of the matrix
	for (int row = 0; row < n; row++)
		for (int col = 0; col < n; col++)
			// Copying into temporary matrix only those element
			// which are not in given row and column
			if (row != p && col != q)
				temp[i * N + j++] = A[row * N + col];

				// Row is filled, so increase row index and
				// reset col index
				if (j == n - 1)
					j = 0;

// Recursive function for finding determinant of matrix.
int determinant(float *A, int n)
	int D = 0; // Initialize result

	// Base case : if matrix contains single element
	if (n == 1)
		return A[0];

	float temp[N*N]; // To store cofactors

	int sign = 1; // To store sign multiplier

	// Iterate for each element of first row
	for (int f = 0; f < n; f++)
		// Getting Cofactor of A[0][f]
		getCofactor(A, temp, 0, f, n);
		D += sign * A[0 * N + f] * determinant(temp, n - 1);

		// terms are to be added with alternate sign
		sign = -sign;

	return D;

// Function to get adjoint
void adjoint(float *A,float *adj)
	if (N == 1)
		adj[0] = 1;

	// temp is used to store cofactors 
	int sign = 1;
	float temp[N*N];

	for (int i=0; i<N; i++)
		for (int j=0; j<N; j++)
			// Get cofactor
			getCofactor(A, temp, i, j, N);

			// sign of adj positive if sum of row
			// and column indexes is even.
			sign = ((i+j)%2==0)? 1: -1;

			// Interchanging rows and columns to get the
			// transpose of the cofactor matrix
			adj[j * N + i] = (sign)*(determinant(temp, N-1));

// Function to calculate and store inverse, returns false if
// matrix is singular
bool inverse(float *A, float *inverse)
	// Find determinant of A[][]
	int det = determinant(A, N); 
	if (det == 0)
		cout << "Singular matrix, can't find its inverse";
		return false;

	// Find adjoint
	float adj[N*N];
	adjoint(A, adj);

	// Find Inverse using formula "inverse(A) = adj(A)/det(A)"
	for (int i=0; i<N; i++)
		for (int j=0; j<N; j++)
			inverse[i * N + j]= adj[i * N + j]/float(det);

	return true;

// Generic function to display the matrix. We use it to display
// both adjoin and inverse. adjoin is integer matrix and inverse
// is a float.
template<class T>
void display(T *A)
	for (int i=0; i<N; i++)
		for (int j=0; j<N; j++)
			cout << A[i * N + j] << " ";
		cout << endl;

// Driver program
int main()
	float A[N][N] = { {5, -2, 2, 7,9,8,0},
                          {1, 0, 0, 3,1,0,9},
		          {-3, 1, 5, 0,2,1,7},
		          {3, -1, -9, 4,6,5,2},
			  {1, 0, 4, 4,1,0,9},
		          {8, 0, 3, 8,6,5,2},
		          {5, 6, 4, 1,3,2,0}

    float *matrix = new float[N*N];
	float *t_matrix= new float[N*N];
	float *matrix_mult= new float[N*N];
	float *pseudoinverse= new float[N*N];
	float *adj= new float[N*N]; // To store adjoint 
	float *inv= new float[N*N]; // To store inverse 

	cout << "\nThe Transpose is :\n";

    cout << "The product of the matrix is: " << endl;

	cout << "\nThe Inverse is :\n";
	if (inverse(matrix_mult, inv))

	cout << "\nThe Monroe-penrose inverse is :\n";

	return 0;

Memory management
Graphics cards have their own memory. it is therefore necessary to explicitly organize the data transfers between the RAM and the GPU. These transfers have been greatly simplified since CUDA 6, using ‘Unified Memory’ a simple declaration allows to reference the same pointer and consequently to have access to the data both from the GPU and the CPU.

Example of a pointer declaration in CUDA:

float *x, 
cudaMallocManaged(&x, N*sizeof(float));

While this is convenient, it is good practice  to develop using cudaMallocManaged and to optimize memory management in a finer way by using cudamalloc / cudaMemcpy

Finally,to call the kernel

Myfunction<<<NUMBER OF BLOCKS, NUMBER OF THREAD>>>(var1, var2 , var…);

To use all the cores of the GPU it is necessary to use the blocks and the threads simultaneously. The values to apply depend on the problem to be solved but also on the material used. According to the cuda compute capability of your GPU the max number of threads per block is 512 or 1024.

In our case two functions are quite convenient to optimize matrix transpose and matrix multiplication.

Optimization of the transpose matrix function.

void Transpose(float *matrix, float *t_matrix) {

int i = blockIdx.x*blockDim.x + threadIdx.x; //index row
int j =  blockIdx.y * blockDim.y  + threadIdx.y; //index col
int stride = gridDim.x *blockDim.x;

while(i<N && j<N ){
		while(j<N && i<N){ // loop in case memory is not enough to do it one shot
			t_matrix[i*N +j] = matrix[j*N+ i];
	j=blockIdx.y*blockDim.y + threadIdx.y;

#check if transpose has been performed correctly and return % of error
 void CheckErrorTranspose(float *matrix, float *t_matrix,int PerToCheck){

int Nb_test = N*N * PerToCheck/ 100;
int index_x;
int index_y;
int cmpt=0;

	for (int i = 0;  i< Nb_test;  i++){
		index_x=(float)(rand()%N); #check random index
		if ((t_matrix[index_x*N +index_y] != matrix[index_y*N+ index_x]) || t_matrix[index_x*N +index_y]==0 ){
	if (cmpt==0){
			printf("Estimated Error %d%%\n",0);
		printf("Estimated error %f%\n",(float)cmpt/Nb_test*100);


Stride usage can be especially useful when the size of the array is larger than the size of the grid

  • Benchmark performance
Matrix size100 * 1001000 *100010.000 * 10.000
CUDA/C++3 µs 90 µs41 ms
C++70 µs 4,2 ms840 ms
R180 ms400 ms10,5 s

*average after running ten time

As can be seen here the performances are far better by using CUDA. This benchmark does not take into account, however, transfer times between the motherboard and the GPU. This time must of course be taken into account to measure the relevance of its use.

Optimization of the matrix multiplication function

void Matrix_Mul(float *matrix_1, float *matrix_2, float *matrix_m) {
int i = blockIdx.y * blockDim.y + threadIdx.y;
int j = blockIdx.x * blockDim.x + threadIdx.x;
if(i > N || j > N) return;
	for (int k = 0; k < N; ++e){
		matrix_m[i * N + j] =+ (matrix_1[i * N + k]) * (matrix_2[k * N + j]);



This technology opens interesting perspectives, allowing to open or reopen chapters so far closed for computation time performance issues. The optimization presented here is nevertheless relatively basic but quite simple to implement. There is other way, more effective to optimize this type of code, but requires to introduce other notion. Just like there is a very efficient library that implements most linear algebra operations such as cuBLAS or most recently CUTLASS



[1] CUDA Toolkit Documentation [link]

[2] COURRIEU, Pierre. Fast computation of Moore-Penrose inverse matrices. arXiv preprint arXiv:0804.4809, 2008. [link] 

[3] RYOO, Shane, RODRIGUES, Christopher I., BAGHSORKHI, Sara S., et al. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In : Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming. ACM, 2008. p. 73-82.”>[link] 


Leave a Reply