The MoorePenrose inverse matrix or pseudoinverse matrix is defined by:
It especially used in the normal equation to determine the coefficients of a linear regression.
In this context, it is possible to calculate the results in different ways by using an optimizer algorithm that approaches by iteration the optimal result or by using the normal equation, which gives us the result directly.
so far, it was rarely used, because computation was timeconsuming with the methods of sequential computation resulting from the architecture of the CPU.
On the other hand, it is particularly suitable and easy to optimize for parallel computing on GPUs. Indeed the MoorePenrose Matrix is composed of a number of independent matrix operations, such as (matrix multiplication, transpose, ect …) which are particularly optimizable on GPU.
To understand intuitively the interest of using the GPU to do this type of calculation. Here is an example of how a simple sum of two vectors is performed with a CPU in a sequential manner, and with a GPU (parallel).
Compute Unified Device Architecture (CUDA) is a C / C ++ extension that allows you to code directly to a graphics card and take advantage of its parallelization capabilities. Released in 2007, the language has evolved a lot. In this post, the code is executed with the version of CUDA 8.0.
First, we will code the MoorePenrose matrix in a fairly classical way in C ++, before optimizing the code in CUDA.
In order to simplify the readability of the code, we will content ourselves initially with a small square matrix. Also in order to facilitate the passage in CUDA, it is necessary to linearize the matrices. Let us take the example of a matrix M(m * n) usually transcribed in the practical form Array [I] [J] in C ++, it will be transformed into a 1D array of equivalent dimension Array [I * n + J].

Moore–Penrose inverse with C++
#include <iostream> #include <cstdlib> #include <time.h> using namespace std; // C++ program to find MoorePenrose inverse matrix #define N 7 void Trans_2D_1D(float matrix_2D[N][N], float *matrix) { for (int i = 0; i < N; i++){ for (int j = 0; j < N; j++){ matrix[i * N + j] = matrix_2D[i][j]; } cout<<endl; } return; } void Transpose(float *matrix, float *t_matrix) { for (int i = 0; i<N; i++) { for (int j = 0; j<N; j++) { t_matrix[j * N + i]=matrix[i * N + j]; } cout<<endl; } return; } void MatrixMult(float *matrix_1, float *matrix_2, float *matrix_product) { int k; for (int i = 0; i<N; i++) { for (int j = 0; j<N; j++) { // not j<M matrix_product[i * N + j] = 0; for (k = 0; k<N; k++) { matrix_product[i * N + j] += matrix_1[i * N + k] * matrix_2[k * N + j]; } } } return; } // Function to get cofactor void getCofactor(float *A, float *temp, int p, int q, int n) { int i = 0, j = 0; // Looping for each element of the matrix for (int row = 0; row < n; row++) { for (int col = 0; col < n; col++) { // Copying into temporary matrix only those element // which are not in given row and column if (row != p && col != q) { temp[i * N + j++] = A[row * N + col]; // Row is filled, so increase row index and // reset col index if (j == n  1) { j = 0; i++; } } } } } // Recursive function for finding determinant of matrix. int determinant(float *A, int n) { int D = 0; // Initialize result // Base case : if matrix contains single element if (n == 1) return A[0]; float temp[N*N]; // To store cofactors int sign = 1; // To store sign multiplier // Iterate for each element of first row for (int f = 0; f < n; f++) { // Getting Cofactor of A[0][f] getCofactor(A, temp, 0, f, n); D += sign * A[0 * N + f] * determinant(temp, n  1); // terms are to be added with alternate sign sign = sign; } return D; } // Function to get adjoint void adjoint(float *A,float *adj) { if (N == 1) { adj[0] = 1; return; } // temp is used to store cofactors int sign = 1; float temp[N*N]; for (int i=0; i<N; i++) { for (int j=0; j<N; j++) { // Get cofactor getCofactor(A, temp, i, j, N); // sign of adj positive if sum of row // and column indexes is even. sign = ((i+j)%2==0)? 1: 1; // Interchanging rows and columns to get the // transpose of the cofactor matrix adj[j * N + i] = (sign)*(determinant(temp, N1)); } } } // Function to calculate and store inverse, returns false if // matrix is singular bool inverse(float *A, float *inverse) { // Find determinant of A[][] int det = determinant(A, N); if (det == 0) { cout << "Singular matrix, can't find its inverse"; return false; } // Find adjoint float adj[N*N]; adjoint(A, adj); // Find Inverse using formula "inverse(A) = adj(A)/det(A)" for (int i=0; i<N; i++) for (int j=0; j<N; j++) inverse[i * N + j]= adj[i * N + j]/float(det); return true; } // Generic function to display the matrix. We use it to display // both adjoin and inverse. adjoin is integer matrix and inverse // is a float. template<class T> void display(T *A) { for (int i=0; i<N; i++) { for (int j=0; j<N; j++) cout << A[i * N + j] << " "; cout << endl; } } // Driver program int main() { float A[N][N] = { {5, 2, 2, 7,9,8,0}, {1, 0, 0, 3,1,0,9}, {3, 1, 5, 0,2,1,7}, {3, 1, 9, 4,6,5,2}, {1, 0, 4, 4,1,0,9}, {8, 0, 3, 8,6,5,2}, {5, 6, 4, 1,3,2,0} }; float *matrix = new float[N*N]; float *t_matrix= new float[N*N]; float *matrix_mult= new float[N*N]; float *pseudoinverse= new float[N*N]; float *adj= new float[N*N]; // To store adjoint float *inv= new float[N*N]; // To store inverse Transpose(matrix,t_matrix); cout << "\nThe Transpose is :\n"; display(t_matrix); cout << "The product of the matrix is: " << endl; MatrixMult(t_matrix,matrix,matrix_product); display(matrix_product); cout << "\nThe Inverse is :\n"; if (inverse(matrix_mult, inv)) display(inv); MatrixMult(inv,t_matrix,pseudoinverse); cout << "\nThe Monroepenrose inverse is :\n"; display(pseudoinverse); return 0; }
Memory management
Graphics cards have their own memory. it is therefore necessary to explicitly organize the data transfers between the RAM and the GPU. These transfers have been greatly simplified since CUDA 6, using ‘Unified Memory’ a simple declaration allows to reference the same pointer and consequently to have access to the data both from the GPU and the CPU.
Example of a pointer declaration in CUDA:
float *x, cudaMallocManaged(&x, N*sizeof(float));
While this is convenient, it is good practice to develop using cudaMallocManaged and to optimize memory management in a finer way by using cudamalloc / cudaMemcpy
Finally,to call the kernel
Myfunction<<<NUMBER OF BLOCKS, NUMBER OF THREAD>>>(var1, var2 , var…);
To use all the cores of the GPU it is necessary to use the blocks and the threads simultaneously. The values to apply depend on the problem to be solved but also on the material used. According to the cuda compute capability of your GPU the max number of threads per block is 512 or 1024.
In our case two functions are quite convenient to optimize matrix transpose and matrix multiplication.
Optimization of the transpose matrix function.
_global__ void Transpose(float *matrix, float *t_matrix) { int i = blockIdx.x*blockDim.x + threadIdx.x; //index row int j = blockIdx.y * blockDim.y + threadIdx.y; //index col int stride = gridDim.x *blockDim.x; while(i<N && j<N ){ while(j<N && i<N){ // loop in case memory is not enough to do it one shot t_matrix[i*N +j] = matrix[j*N+ i]; j+=stride; } j=blockIdx.y*blockDim.y + threadIdx.y; i+=stride; } return; } #check if transpose has been performed correctly and return % of error void CheckErrorTranspose(float *matrix, float *t_matrix,int PerToCheck){ int Nb_test = N*N * PerToCheck/ 100; int index_x; int index_y; int cmpt=0; for (int i = 0; i< Nb_test; i++){ index_x=(float)(rand()%N); #check random index index_y=(float)(rand()%N); if ((t_matrix[index_x*N +index_y] != matrix[index_y*N+ index_x])  t_matrix[index_x*N +index_y]==0 ){ cmpt++; } } if (cmpt==0){ printf("Estimated Error %d%%\n",0); }else{ printf("Estimated error %f%\n",(float)cmpt/Nb_test*100); } }
Stride usage can be especially useful when the size of the array is larger than the size of the grid
 Benchmark performance
Matrix size  100 * 100  1000 *1000  10.000 * 10.000 

CUDA/C++  3 µs  90 µs  41 ms 
C++  70 µs  4,2 ms  840 ms 
R  180 ms  400 ms  10,5 s 
*average after running ten time
As can be seen here the performances are far better by using CUDA. This benchmark does not take into account, however, transfer times between the motherboard and the GPU. This time must of course be taken into account to measure the relevance of its use.
Optimization of the matrix multiplication function
__global__ void Matrix_Mul(float *matrix_1, float *matrix_2, float *matrix_m) { int i = blockIdx.y * blockDim.y + threadIdx.y; int j = blockIdx.x * blockDim.x + threadIdx.x; if(i > N  j > N) return; for (int k = 0; k < N; ++e){ matrix_m[i * N + j] =+ (matrix_1[i * N + k]) * (matrix_2[k * N + j]); } }
Conclusion
This technology opens interesting perspectives, allowing to open or reopen chapters so far closed for computation time performance issues. The optimization presented here is nevertheless relatively basic but quite simple to implement. There is other way, more effective to optimize this type of code, but requires to introduce other notion. Just like there is a very efficient library that implements most linear algebra operations such as cuBLAS or most recently CUTLASS
References
[1] CUDA Toolkit Documentation [link]
[2] COURRIEU, Pierre. Fast computation of MoorePenrose inverse matrices. arXiv preprint arXiv:0804.4809, 2008. [link]
[3] RYOO, Shane, RODRIGUES, Christopher I., BAGHSORKHI, Sara S., et al. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In : Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming. ACM, 2008. p. 7382.”>[link]