in

H44134794, 1. SAS Code for calculating a covariance matrix, 2. Variance vs Eigenvalue and Eigenvector

Last post 09-21-2008 8:16 AM by pwestfal. 1 replies.
Page 1 of 1 (2 items)
Sort Posts: Previous Next
  • 09-19-2008 11:48 PM

    H44134794, 1. SAS Code for calculating a covariance matrix, 2. Variance vs Eigenvalue and Eigenvector

    1. (Specific Q.) SAS Code for calculating a covariance matrix
    I am trying to understand the following code for drawing Q-Q plot graph. At the 9th line, Xd is defined with X – (one / n)` * X. I think this calculation is strange because the matrix rank is not equal for the first two matrixes. For instance, X is 55x7 matrix, and (one/n) is 55x55 containing 1/55 for each element. The transpose matrix of (one/n) is also 55x55 containing 1/55 for each element. Thus, the matrix calculation is (55x7) – (55x55) * (55x7). However, in the (55x7) – (55-55) calculation, how about the other 55x48 elements? 55x7 matrix has the difference between X and (1/55), but another 55x48 has -1/55.

    X – (one/n)` * X
    (55x7) – (55x55) * (55x7)

    In addition, I want to know what the meaning of Xd. From your comment of “mean-centered data matrix”, I expect the deviation of each element from the mean. However, I have different values. For example, I imagine X matrix as (7x3) matrix.

    X =
    1 3 6
    2 2 1
    3 4 4
    4 6 2
    5 2 9
    6 5 4
    7 9 5

    (one/n)` =
    1/7    1/7    1/7    1/7    1/7    1/7    1/7
    1/7    1/7    1/7    1/7    1/7    1/7    1/7
    1/7    1/7    1/7    1/7    1/7    1/7    1/7
    1/7    1/7    1/7    1/7    1/7    1/7    1/7
    1/7    1/7    1/7    1/7    1/7    1/7    1/7
    1/7    1/7    1/7    1/7    1/7    1/7    1/7
    1/7    1/7    1/7    1/7    1/7    1/7    1/7

    X – (one/n) =
    0.86  2.86  5.86  -0.14  -0.14  -0.14  -0.14
    1.86  1.86  0.86  -0.14  -0.14  -0.14  -0.14
    2.86  3.86  3.86  -0.14  -0.14  -0.14  -0.14
    3.86  5.86  1.86  -0.14  -0.14  -0.14  -0.14
    4.86  1.86  8.86  -0.14  -0.14  -0.14  -0.14
    5.86  4.86  3.86  -0.14  -0.14  -0.14  -0.14
    6.86  8.86  4.86  -0.14  -0.14  -0.14  -0.14

    X – (one/n) * X =
    21.00  28.57  28.57
    5.00  9.57  13.57
    19.00  28.57  33.57
    18.00  27.57  33.57
    32.00  50.57  63.57
    24.00  39.57  52.57
    36.00  54.57  66.57

    However, I expect the following the deviation values.
    -3 -1.428571429 1.571428571
    -2 -2.428571429 -3.428571429
    -1 -0.428571429 -0.428571429
    0 1.571428571 -2.428571429
    1 -2.428571429 4.571428571
    2 0.571428571 -0.428571429
    3 4.571428571 0.571428571

    Also, I saw two kinds of SAS code to calculate covariance matrix. One is this direct method to calculate covariance using Xd and S. The other is the indirect method to borrow the covariance matrix of result from “proc corr”. Why do you directly calculate a covariance matrix rather than borrow the result from “proc corr” in this example?

    /* sas program for generating data for chi-square q-q plots */
    1  %let inputdata = isqs6348.t1_7;  /* this line must be edited */
    2  %let varlist   = m100 m200 m400 m800 m1500 m3000 marathon ;  /* this line must be edited */
    3  proc iml;
    4    use &inputdata; 
    5    read all var { &varlist } into X;
    6    n = nrow(X);
    7    p = ncol(X);
    8    One = J(n,n,1);           /* just a n x n square matrix full of 1s (nxn)*/
    9    Xd = X - (One / n)` * X;  /* mean-centered data matrix (nxp)*/
    10   S = (1 / (n-1)) * Xd`*Xd; /* covariance matrix  (pxp) */
    11   Sinv = inv(S);
    12   chisq = j(n,1,0);
    13     do i = 1 to n;
    14     chisqIdea = Xd[i,] * Sinv * Xd[i,]`;  /*Distance from obs i to the mean */
    15     end;
    16   probs = (rank(chisq) - j(n,1,.5))/n;   /* contains (r-.5)/n  values */
    17   quants = 2*gaminv(probs, p/2);      /* contains chi-square quantiles */
    18   plotdata = quants||chisq;
    19   create chisqqdata(rename=(col1=chiquant col2=distsq)) from plotdata;  
    20   append from plotdata;
    30  quit;


    2. (General Q.) Variance vs Eigenvalue and Eigenvector
    In the last class, we studied that variance has a crucial role for two-sample test. For example, in the univariate two-sample test, we suppose that group 1 and group 2 have the same variance, sigma. Also, in the multivariate two-sample test, we suppose that two groups have the same number of variables and the same variance, capital sigma. However, I think that this kind of test has a problem. For example, consider a univariate case and the following graph which (a) has mean = 5 and variance = 0.4, (b) has mean = 5 and variance = 0.4, so two groups have the same mean and variance except for the direction.


    Although two samples have different shapes, the null hypothesis cannot be rejected because sample-test only considers the difference of means of two groups. I think it will be the same result in the case of multivariate sample test. On the contrary, eigenvalue and eigenvector of covariance matrix can show the length and direction of the matrix. So, is there any way or test to compare eigenvalues between two groups, and to compare eigenvectors between two groups? Or, is there any test using engienvalue and engenvector? I think that the test using eigenvalue and eigenvector would be more robust than the test using variance only.

     

  • 09-21-2008 8:16 AM In reply to

    Re: H44134794, 1. SAS Code for calculating a covariance matrix, 2. Variance vs Eigenvalue and Eigenvector

    Anonymous:

    1. (Specific Q.) SAS Code for calculating a covariance matrix
    I am trying to understand the following code for drawing Q-Q plot graph. At the 9th line, Xd is defined with X – (one / n)` * X. I think this calculation is strange because the matrix rank is not equal for the first two matrixes. For instance, X is 55x7 matrix, and (one/n) is 55x55 containing 1/55 for each element. The transpose matrix of (one/n) is also 55x55 containing 1/55 for each element. Thus, the matrix calculation is (55x7) – (55x55) * (55x7). However, in the (55x7) – (55-55) calculation, how about the other 55x48 elements? 55x7 matrix has the difference between X and (1/55), but another 55x48 has -1/55.

    X – (one/n)` * X
    (55x7) – (55x55) * (55x7)

    In addition, I want to know what the meaning of Xd. From your comment of “mean-centered data matrix”, I expect the deviation of each element from the mean. However, I have different values. For example, I imagine X matrix as (7x3) matrix.

    X =
    1 3 6
    2 2 1
    3 4 4
    4 6 2
    5 2 9
    6 5 4
    7 9 5

    (one/n)` =
    1/7    1/7    1/7    1/7    1/7    1/7    1/7
    1/7    1/7    1/7    1/7    1/7    1/7    1/7
    1/7    1/7    1/7    1/7    1/7    1/7    1/7
    1/7    1/7    1/7    1/7    1/7    1/7    1/7
    1/7    1/7    1/7    1/7    1/7    1/7    1/7
    1/7    1/7    1/7    1/7    1/7    1/7    1/7
    1/7    1/7    1/7    1/7    1/7    1/7    1/7

    X – (one/n) =
    0.86  2.86  5.86  -0.14  -0.14  -0.14  -0.14
    1.86  1.86  0.86  -0.14  -0.14  -0.14  -0.14
    2.86  3.86  3.86  -0.14  -0.14  -0.14  -0.14
    3.86  5.86  1.86  -0.14  -0.14  -0.14  -0.14
    4.86  1.86  8.86  -0.14  -0.14  -0.14  -0.14
    5.86  4.86  3.86  -0.14  -0.14  -0.14  -0.14
    6.86  8.86  4.86  -0.14  -0.14  -0.14  -0.14

    X – (one/n) * X =
    21.00  28.57  28.57
    5.00  9.57  13.57
    19.00  28.57  33.57
    18.00  27.57  33.57
    32.00  50.57  63.57
    24.00  39.57  52.57
    36.00  54.57  66.57

    However, I expect the following the deviation values.
    -3 -1.428571429 1.571428571
    -2 -2.428571429 -3.428571429
    -1 -0.428571429 -0.428571429
    0 1.571428571 -2.428571429
    1 -2.428571429 4.571428571
    2 0.571428571 -0.428571429
    3 4.571428571 0.571428571

    Also, I saw two kinds of SAS code to calculate covariance matrix. One is this direct method to calculate covariance using Xd and S. The other is the indirect method to borrow the covariance matrix of result from “proc corr”. Why do you directly calculate a covariance matrix rather than borrow the result from “proc corr” in this example?

    /* sas program for generating data for chi-square q-q plots */
    1  %let inputdata = isqs6348.t1_7;  /* this line must be edited */
    2  %let varlist   = m100 m200 m400 m800 m1500 m3000 marathon ;  /* this line must be edited */
    3  proc iml;
    4    use &inputdata; 
    5    read all var { &varlist } into X;
    6    n = nrow(X);
    7    p = ncol(X);
    8    One = J(n,n,1);           /* just a n x n square matrix full of 1s (nxn)*/
    9    Xd = X - (One / n)` * X;  /* mean-centered data matrix (nxp)*/
    10   S = (1 / (n-1)) * Xd`*Xd; /* covariance matrix  (pxp) */
    11   Sinv = inv(S);
    12   chisq = j(n,1,0);
    13     do i = 1 to n;
    14     chisq = Xd[i,] * Sinv * Xd[i,]`;  /*Distance from obs i to the mean */
    15     end;
    16   probs = (rank(chisq) - j(n,1,.5))/n;   /* contains (r-.5)/n  values */
    17   quants = 2*gaminv(probs, p/2);      /* contains chi-square quantiles */
    18   plotdata = quants||chisq;
    19   create chisqqdata(rename=(col1=chiquant col2=distsq)) from plotdata;  
    20   append from plotdata;
    30  quit; 

    You said,

    " At the 9th line, Xd is defined with X – (one / n)` * X. I think this calculation is strange because the matrix rank is not equal for the first two matrixes."

    Corrections:  1.  "matrices"  not "matrixes".  2.  "dimension" not "rank".  

    Operator precedence works the ame way in ordinary math as it does for matrices.  For example, if I write

    100-6*2

    I mean

    100-12 = 88.

    This is because operator precedence implies that the "multiplication" operation precedes the "subtraction" operation.  We understand that

    100-6*2 does not mean 

    94*2=188.

    It works the same for matrices.

    A-B*C

    means

    A-(B*C)

    not

    (A-B)*C.

    so

    X – (one/n)` * X
    (55x7) – (55x55) * (55x7) 

    is equal to

    X – { (one/n)` * X }

    which is a difference of two 55x7 matrices.

    I think that answers your other question as well.

    As far as "why use PROC CORR to calculate the cov" versus "why not calculate it directly in PROC IML?", there is no particular reason.  Understand them both! They just show different modes of operation.  PROC IML shows you the formula more clearly, and it helps you to understand what PROC CORR is doing "behind the scenes."

    Also, recall the discussion of missing values from class.  The hand-calculation of IML will give the identical result as PROC CORR when there are no missing values, but not otherwise.  

    100 100 100 100 100  

     

    2. (General Q.) Variance vs Eigenvalue and Eigenvector
    In the last class, we studied that variance has a crucial role for two-sample test. For example, in the univariate two-sample test, we suppose that group 1 and group 2 have the same variance, sigma. Also, in the multivariate two-sample test, we suppose that two groups have the same number of variables and the same variance, capital sigma. However, I think that this kind of test has a problem. For example, consider a univariate case and the following graph which (a) has mean = 5 and variance = 0.4, (b) has mean = 5 and variance = 0.4, so two groups have the same mean and variance except for the direction.


    Although two samples have different shapes, the null hypothesis cannot be rejected because sample-test only considers the difference of means of two groups. I think it will be the same result in the case of multivariate sample test. On the contrary, eigenvalue and eigenvector of covariance matrix can show the length and direction of the matrix. So, is there any way or test to compare eigenvalues between two groups, and to compare eigenvectors between two groups? Or, is there any test using engienvalue and engenvector? I think that the test using eigenvalue and eigenvector would be more robust than the test using variance only.

    You said, "the same variance, capital sigma. "    Correction:  "the same covariance matrix, capital sigma."

    You example is a little confused.  Your graphs show two samples of bivariate data, yet you say "univariate".   

    Anyway, yes, it is possible that the assumption of equal covariance is violated.  You can indeed compare the cov matrices through eigenstructures, or directly through comparisions of correlations (your picture suggest to me that we might want to compare the correlation between the two variables for the two different groups.  In group 1 the correlation is neagtive and in group 2 it is positive.) This is all pretty easy to do, but it is likely that we won't discuss it.  If you have a need to do this for some project, just ask, and I can show you how.

    Interesting discussion and nice pictures, but an external context is missing here.  What external material are you tying this to?   What variables are you thinking of when you draw those pictures?  What kind of study?  Make it more concrete. 

    90 90 70 100

    Professor
Page 1 of 1 (2 items)
Powered by Community Server (Commercial Edition), by Telligent Systems