并行使用openMP的SVD分解不如预期的那样执行

Z时代
2024-01-10
分类：问答

我最近编写了一个基于“单侧雅可比旋转”算法的并行SVD分解例程。代码正常工作，但速度非常慢。事实上，它应该利用内部循环的并行性for(int g=0;g<n;g++)，但在注释掉#pragma omp paralell for指令时，我可以看到性能略有下降。换句话说，并行处理没有明显的速度（代码与4个线程并行运行）。注1：几乎所有的工作都集中在涉及矩阵A和V的相对较大的三个后续循环中。并行使用openMP的SVD分解不如预期的那样执行

for(h=0;h<N;h++) 
{ 
    p+=A[h+N*i]*A[h+N*j];//columns dot product:Ai * Aj 
    qi+=A[h+N*i]*A[h+N*i];// ||Ai||^2 
    qj+=A[h+N*j]*A[h+N*j];// ||Aj||^2 
}

和

double Ahi,Vhi; 
for(h=0;h<N;h++)//...rotate Ai & Aj (only columns i & j are changend) 
{ 
    Ahi=A[h+N*i]; 
    A[h+N*i]=cs*A[h+N*i]+sn*A[h+N*j]; 
    A[h+N*j]=-sn*Ahi+cs*A[h+N*j]; 
} 
//store & update rotation matrix V (only columns i & j are updated) 
for(h=0;h<N;h++) 
{ 
    Vhi=V[h+N*i]; 
    V[h+N*i]=cs*V[h+N*i]+sn*V[h+N*j]; 
    V[h+N*j]=-sn*Vhi+cs*V[h+N*j]; 
}

所有并行应该有剥削，但并非如此。我不明白为什么。

注2：在Windows（cygWin编译器）和Linux（GCC）平台上都会发生同样的情况。

注3：矩阵按列重大阵列

所以我希望在查明原因的平行不被剥削一些帮助表示。我错过了什么？平行中有一些隐藏的开销，因为我看不到？

非常感谢您的任何建议

int sweep(double* A,double*V,int N,double tol) 
{ 
static int*I=new int[(int)ceil(0.5*(N-1))]; 
static int*J=new int[(int)ceil(0.5*(N-1))]; 
int ntol=0; 
for(int r=0;r<N;r++) //fill in i,j indexes of parallel rotations in vectors  I & J 
{ 
    int k=r+1; 
    if (k==N) 
    { 
    for(int i=2;i<=(int)ceil(0.5*N);i++){ 
     I[i-2]=i-1; 
     J[i-2]=N+2-i-1; 
     } 
    } 
    else 
    { 
     for(int i=1;i<=(int)ceil(0.5*(N-k));i++)I[i-1]=i-1; 
     for(int i=1;i<=(int)ceil(0.5*(N-k));i++)J[i-1]=N-k+2-i-1; 
     if(k>2) 
     { 
      int j=(int)ceil(0.5*(N-k)); 
      for(int i=N-k+2;i<=N-(int)floor(0.5*k);i++){ 
       I[j]=i-1; 
       J[j]=2*N-k+2-i-1; 
       j++; 
      } 
     } 
    } 
    int n=(k%2==0)?(int)floor(0.5*(N-1)):(int)floor(0.5*N); 
    #pragma omp parallel for schedule(dynamic,5) reduction(+:ntol) default(none) shared(std::cout,I,J,A,V,N,n,tol) 
    for(int g=0;g<n;g++) 
    { 
     int i=I[g]; 
     int j=J[g]; 
     double p=0; 
     double qi=0; 
     double qj=0; 
     double cs,sn,q,c; 
     int h; 
     for(h=0;h<N;h++) 
     { 
      p+=A[h+N*i]*A[h+N*j];//columns dot product:Ai * Aj 
      qi+=A[h+N*i]*A[h+N*i];// ||Ai||^2 
      qj+=A[h+N*j]*A[h+N*j];// ||Aj||^2 
     } 
     q=qi-qj; 
     if(p*p/(qi*qj)<tol) ntol++; //if Ai & Aj are orthogonal enough... 
     else      //if Ai & Aj are not orthogonal enough then... rotate them 
     { 
      c=sqrt(4*p*p+q*q); 
      if(q>=0){ 
       cs=sqrt((c+q)/(2*c)); 
       sn=p/(c*cs); 
      } 
      else{ 
       sn=(p>=0)?sqrt((c-q)/2/c):-sqrt((c-q)/2/c); 
       cs=p/(c*sn); 
      } 
      //...rotate Ai & Aj (only columns i & j are changend) 
      double Ahi,Vhi; 
      for(h=0;h<N;h++) 
      { 
       Ahi=A[h+N*i]; 
       A[h+N*i]=cs*A[h+N*i]+sn*A[h+N*j]; 
       A[h+N*j]=-sn*Ahi+cs*A[h+N*j]; 
      } 
      //store & update rotation matrix V (only columns i & j are updated) 
      for(h=0;h<N;h++) 
      { 
       Vhi=V[h+N*i]; 
       V[h+N*i]=cs*V[h+N*i]+sn*V[h+N*j]; 
       V[h+N*j]=-sn*Vhi+cs*V[h+N*j]; 
      } 
     } 
    } 
} 
if(2*ntol==(N*(N-1)))return(1);//if each columns of A is orthogonal enough  to each other stop sweep 
return(0); 
}

回答：

的FLOPS的奇异值分解（SVD）应该为O(N^3)而读为O(N^2)。这意味着可以对算法进行并行化并使其与核心数量保持一致。

但是，您的SVD实现是内存带宽限制，这意味着它无法与单个套接字系统的内核数量成正比。目前，您的三个嵌套循环遍历整个范围N，这会导致大量数据在下次需要重用时从缓存中逐出。

为了计算绑定你将不得不改变你的算法，这样，而不是在长条工作/尺寸N的点产品，它采用循环平铺最大化n^3（其中n是大小的块）计算在大小为n^2的缓存中。

这里是paper for parallel SVD computation来自谷歌搜索其说，在我们提出的块JRS算法的抽象

的关键点是通过在矩阵的块执行计算（B行）重新使用加载的数据到高速缓冲存储器中，而不是如JRS迭代算法中的矢量条带。

I used loop tiling/blocks to achieve good scaling with the number of cores for cholesky decomposition。

请注意，使用tiles/blocks也会提高单线程代码的性能。

回答：

感谢Z boson的评论，我设法写出了一个更好的执行并行SVD分解。它的运行速度比原来的要快很多倍，我猜想通过使用SIMD指令仍然可以改善它。我发布的代码，在任何人都应该找到它的使用情况。我进行的所有测试都给了我正确的结果，在任何情况下都不保证其使用。我真的很抱歉没有时间正确地评论代码以获得更好的理解。

int sweep(double* A,double*V,int N,double tol,int M, int n) 
{ 
/******************************************************************************************** 
This routine performs a parallel "sweep" of the SVD factorization algorithm for the matrix A. 
It implements a single sided Jacobi rotation algorithm, described by Michael W. Berry, 
Dani Mezher, Bernard Philippe, and Ahmed Sameh in "Parallel Algorithms for the Singular Value 
Decomposition". 
At each sweep the A matrix becomes a little more orthogonal, until each column of A is orthogonal 
to each other within a give tolerance. At this point the sweep() routine returns 1 and convergence 
is attained. 
Arguments: 
A : on input the square matrix to be orthogonalized, on exit a more "orthogonal" matrix 
V : on input the accumulated rotation matrix, on exit it is updated with the current rotations 
N : dimension of the matrices. 
tol :tolerance for convergence of orthogonalization. See the explainations for SVD() routine 
M : number of blocks (it is calculated from the given block size n) 
n : block size (number of columns on each block). It should be an optimal value according to the hardware available 
returns : 1 if in the last sweep convergence is attained, 0 if not and one more sweep is necessary 
Author : Renato Talucci 2015 
*************************************************************************************************/ 
#include <math.h> 
int ntol=0; 
//STEP 1 : INTERNAL BLOCK ORTHOGONALISATION 
#pragma omp paralell for reduction(+:ntol) shared(A,V,n,tol,N) default(none) 
for(int a=0;a<M;a++) 
{ 
    for(int i=a*n;i<a*n+imin(n,N-a*n)-1;i++) 
    { 
     for(int j=i+1;j<a*n+imin(n,N-a*n);j++) 
     { 
      double p=0; 
      double qi=0; 
      double qj=0; 
      double cs,sn,q,c; 
      for(int h=0;h<N;h++) 
      { 
       p+=A[h+N*i]*A[h+N*j];//columns dot product:Ai * Aj 
       qi+=A[h+N*i]*A[h+N*i];// ||Ai||^2 
       qj+=A[h+N*j]*A[h+N*j];// ||Aj||^2 
      } 
      q=qi-qj; 
      if((p*p/(qi*qj)<tol)||(qi<tol)||(qj<tol))ntol++; //if Ai & Aj are orthogonal enough... 
      else      //if Ai & Aj are not orthogonal enough then... rotate them 
      { 
       c=sqrt(4*p*p+q*q); 
       if(q>=0){ 
        cs=sqrt((c+q)/(2*c)); 
        sn=p/(c*cs); 
       } 
       else{ 
        sn=(p>=0)?sqrt((c-q)/2/c):-sqrt((c-q)/2/c); 
        cs=p/(c*sn); 
       } 
       //...rotate Ai & Aj 
       double Ahi,Vhi; 
       for(int h=0;h<N;h++) 
       { 
        Ahi=A[h+N*i]; 
        A[h+N*i]=cs*A[h+N*i]+sn*A[h+N*j]; 
        A[h+N*j]=-sn*Ahi+cs*A[h+N*j]; 
       } 
       //store & update rotation matrix V (only columns i & j atre updated) 
       for(int h=0;h<N;h++) 
       { 
        Vhi=V[h+N*i]; 
        V[h+N*i]=cs*V[h+N*i]+sn*V[h+N*j]; 
        V[h+N*j]=-sn*Vhi+cs*V[h+N*j]; 
       } 
      } 
     } 
    } 
} 
//STEP 2 : PARALLEL BLOCK MUTUAL ORTHOGONALISATION 
static int*I=new int[(int)ceil(0.5*(M-1))]; 
static int*J=new int[(int)ceil(0.5*(M-1))]; 
for(int h=0;h<M;h++) 
{ 
    //fill in i,j indexes of blocks to be mutually orthogonalized at each turn 
    int k=h+1; 
    if (k==M) 
    { 
     for(int i=2;i<=(int)ceil(0.5*M);i++){ 
      I[i-2]=i-1; 
      J[i-2]=M+2-i-1; 
     } 
    } 
    else 
    { 
     for(int i=1;i<=(int)ceil(0.5*(M-k));i++)I[i-1]=i-1; 
     for(int i=1;i<=(int)ceil(0.5*(M-k));i++)J[i-1]=M-k+2-i-1; 
     if(k>2) 
     { 
      int j=(int)ceil(0.5*(M-k)); 
      for(int i=M-k+2;i<=M-(int)floor(0.5*k);i++){ 
       I[j]=i-1; 
       J[j]=2*M-k+2-i-1; 
       j++; 
      } 
     } 
    } 
    int ng=(k%2==0)?(int)floor(0.5*(M-1)):(int)floor(0.5*M); 
    #pragma omp parallel for schedule(static,5) shared(A,V,I,J,n,tol,N,ng) reduction(+:ntol) default(none) 
    for(int g=0;g<ng;g++) 
    { 
     int block_i=I[g]; 
     int block_j=J[g]; 
     for(int i=block_i*n;i<block_i*n+imin(n,N-block_i*n);i++) 
     { 
      for(int j=block_j*n;j<block_j*n+imin(n,N-block_j*n);j++) 
      { 
       double p=0; 
       double qi=0; 
       double qj=0; 
       double cs,sn,q,c; 
       int h; 
       for(h=0;h<N;h++) 
       { 
        p+=A[h+N*i]*A[h+N*j];//columns dot product:Ai * Aj 
        qi+=A[h+N*i]*A[h+N*i];// ||Ai||^2 
        qj+=A[h+N*j]*A[h+N*j];// ||Aj||^2 
       } 
       q=qi-qj; 
       if((p*p/(qi*qj)<tol)||(qi<tol)||(qj<tol))ntol++; //if Ai & Aj are orthogonal enough... 
       else      //if Ai & Aj are not orthogonal enough then... rotate them 
       { 
        c=sqrt(4*p*p+q*q); 
        if(q>=0){ 
         cs=sqrt((c+q)/(2*c)); 
         sn=p/(c*cs); 
        } 
        else{ 
         sn=(p>=0)?sqrt((c-q)/2/c):-sqrt((c-q)/2/c); 
         cs=p/(c*sn); 
        } 
        //...rotate Ai & Aj 
        double Ahi,Vhi; 
        for(h=0;h<N;h++) 
        { 
         Ahi=A[h+N*i]; 
         A[h+N*i]=cs*A[h+N*i]+sn*A[h+N*j]; 
         A[h+N*j]=-sn*Ahi+cs*A[h+N*j]; 
        } 
        //store & update rotation matrix V (only columns i & j atre updated) 
        for(h=0;h<N;h++) 
        { 
         Vhi=V[h+N*i]; 
         V[h+N*i]=cs*V[h+N*i]+sn*V[h+N*j]; 
         V[h+N*j]=-sn*Vhi+cs*V[h+N*j]; 
        } 
       } 
      } 
     } 
    } 
} 
if(2*ntol==(N*(N-1)))return(1);//if each columns of A is orthogonal enough to each other stop sweep 
return(0); 
} 
int SVD(double* A,double* U,double*V,int N,double tol,double* sigma) 
{ 
/******************************************************************************************** 
This routine calculates the SVD decomposition of the square matrix A [NxN] 
           A = U * S * V' 
Arguments : 
A  : on input NxN square matrix to be factorized, on exit contains the 
      rotated matrix A*V=U*S. 
V  : on input an identity NxN matrix, on exit is the right orthogonal matrix 
      of the decomposition A = U*S*V' 
U  : NxN matrix, on exit is the left orthogonal matrix of the decomposition A = U*S*V'. 
sigma : array of dimension N. On exit contains the singular values, i.e. the diagonal 
      elements of the matrix S. 
N  : Dimension of the A matrix. 
tol  : Tolerance for the convergence of the orthogonalisation of A. Said Ai and Aj any two 
      columns of A, the convergence is attained when Ai*Aj/(|Ai|*|Aj|) < tol for each i,j=0,..,N-1 (i!=j) 
The software returns the number of sweeps needed for convergence. 
NOTE : ALL MATRICES ARE ASSUMED TO HAVE COLUMN MAJOR ORDERING I.E. M(i,j)=M[i+N*j] 
Author: Renato Talucci 2015 
*************************************************************************************************/ 
int n=24;//this is the dimension of block submatrices, you shall enter an optimal value for your hardware 
int M=N/n+int(((N%n)!=0)?1:0); 
int swp=0;//sweeps counter 
int converged=0; 
while(converged==0) { 
    converged=sweep(A,V,N,tol,M,n); 
    swp++; 
} 
#pragma omp parallel for default(none) shared(sigma,A,U,N) 
for(int i=0;i<N;i++) 
{ 
    double si=0; 
    for(int j=0;j<N;j++) si+=A[j+N*i]*A[j+N*i]; 
    si=sqrt(si); 
    for(int k=0;k<N;k++) U[k+N*i]=A[k+N*i]/si; 
    sigma[i]=si; 
} 
return(swp); 
}

注意：如果一些用户更喜欢到左在退出甲不变，它足以计算U =甲* V进入while循环和，而不是A矩阵传递给扫描之前（）例程，通过获得的U矩阵。在收敛sweep（）之后，U矩阵应该正交归一化，并且#pragma omp指令必须在共享变量中包含U而不是A.

注2：如果您有（因为我有）因数分解一个A（k）矩阵的序列可以被认为是一个多步牛顿求解器，其中A（k）可以被认为是前一个A（k-1）的一个摄动，作为一个雅可比矩阵，司机可以很容易地修改为从A（k-1）更新为A（k），而不是从开始计算A（k）。下面是代码：

int updateSVD(double* DA,double* U,double*V,int N,double tol,double* sigma) 
{ 
/******************************************************************************************** 
Given a previously factorization 
         A(k-1) = U(k-1) * S(k-1) * V(k-1)' 
and given a perturbation DA(k) of A(k-1), i.e. 
            A(k) = A(k-1) + DA(k) 
this routine calculates the SVD factorization of A(k), starting from the factorization of A(k-1) 
Arguments: 
DA  : on input NxN perturbation matrix, unchanged on exit 
U  : on input NxN orthonormal left matrix of the previous (k-1) factorization, on exit 
      orthonormal right matrix of the current factorization 
V  : on input NxN orthonormal right matrix of the previous (k-1) factorization, on exit 
      orthonormal right matrix of the current factorization 
N  : dimension of the matrices 
tol  : Tolerance for the convergence of the orthogonalisation of A. Said Ai and Aj any two 
      columns of A, the convergence is attained when Ai*Aj/(|Ai|*|Aj|) < tol for each i,j=0,..,N-1 (i!=j) 
sigma : on input, array with the N singular values of the previuos factorization, on exit 
      array with the N singular values of the current factorization 
NOTE : ALL MATRICES ARE ASSUMED TO HAVE COLUMN MAJOR ORDERING I.E. M(i,j)=M[i+N*j] 
Author: Renato Talucci 2015 
*************************************************************************************************/ 
for(int i=0;i<N;i++) for(int j=0;j<N;j++) U[i+N*j]*=sigma[j]; 
int n=24; //this is the dimension of block submatrices, you shall enter an optimal value for your hardware 
matmat_col_col(DA,V,U,N,n); //U =U(k-1)*S(k-1) + DA(k)*V(k-1) = A(k)*V(k-1) 
int M=N/n+int(((N%n)!=0)?1:0); //number of blocks 
int swp=0;//sweeps counter 
int converged=0; 
while(converged==0) { 
    converged=sweep(U,V,N,tol,M,n); 
    swp++; 
} 
#pragma omp parallel for default(none) shared(sigma,U,N) 
for(int i=0;i<N;i++) 
{ 
    double si=0; 
    for(int j=0;j<N;j++) si+=U[j+N*i]*U[j+N*i]; 
    si=sqrt(si); 
    for(int k=0;k<N;k++) U[k+N*i]=U[k+N*i]/si; 
    sigma[i]=si; 
} 
return(swp); 
}

最后，例程matmat_col_col(DA,V,U,N,n)是paralell块矩阵乘积。下面是代码：

inline int imin(int a,int b) {return((a<=b)?a:b);} 
void matmat_col_col(double* A,double* B,double*C,int N,int n) 
/******************************************************************************************** 
square matrix block product NxN : 
          C = C + A * B 
n is the optimal block dimension 
N is the dimension of the matrices 
NOTE : ALL MATRICES ARE ASSUMED TO HAVE COLUMN MAJOR ORDERING M(i,j) = M[i+N*j] 
Author: Renato Talucci 2015 
*************************************************************************************************/ 
{ 
int M=N/n+int(((N%n)!=0)?1:0); 
#pragma omp parallel for shared(M,A,B,C) 
for(int a=0;a<M;a++) 
{ 
    for(int b=0;b<M;b++) 
    { 
     for(int c=0;c<M;c++) 
     { 
      for(int i=a*n;i<imin((a+1)*n,N);i++) 
      { 
       for(int j=b*n;j<imin((b+1)*n,N);j++) 
       { 
        for(int k=c*n;k<imin((c+1)*n,N);k++) 
        { 
          C[i+N*j]+=A[i+N*k]*B[k+N*j]; 
        } 
       } 
      } 
     } 
    } 
} 
return; 
}