By Dr. Mike Clark, Research and Statistical Support Services Consultant
Many in the social sciences often employ multiple regression (MR) to solve the problem of how several variables predict another variable. A linear combination of the independent variables (IVs) is created that will have the minimum squared errors in prediction. The square of that correlation between the linear combination and the dependent variable (DV) is the amount of variance in the dependent variable accounted for by the predictors.
Although it is easy to think of the independent variables as a set that one believes has some relation to the dependent variable, many do not as often think of a set of dependent variables that one wishes to predict. Canonical correlation analyzes the relationship between sets of variables, with one set of variables typically seen as the independent set and another as the dependent set, though the causal arrow is not necessarily specified. In a sense it can be thought of multivariate regression though multiple regression is actually a special case of canonical correlation.
To begin with, it helps to visualize what we’re about to do. The figure below gives us an idea of what is going to happen.
Just like in MR we want to create linear combinations of the set of IVs (X1-X3). However, now we have a set of DVs and will want to create a linear combination of those also (Y1-Y3). Canonical correlation analysis will create linear combinations (variates, X* and Y* above) of the two sets that will have maximum correlation with one another.
The advantage that canonical correlation has over typical MR is that it can take into account the complex nature of data: we don’t have to restrict ourselves to one DV, and it also allows for the possibility that the two sets of variables have a relationship along more than one dimension. In other words we may find that there are other linear combinations of the two sets of variables such that would result in the variates having a sizable (though lesser) correlation that also would be of practical significance. In a given analysis you will be provided with X number of canonical correlations equal to the number of variables in the smaller set.
The mechanics of canonical correlation are covered in many multivariate texts (see references below for some examples). Our focus here will regard its utilization in SPSS. To begin with, the menu system will not be able to assist us this time. The macro involved must be called via syntax, however, there isn’t much to it. Once we specify the macro to be used (it is available in the SPSS folder), we then just note which variables go with each set (one can think of set 1 as the IVs). The general format is as follows:
include file 'c:\Program Files\SPSS\Canonical correlation.sps'.
The example provided here regards the association between a set of job characteristics and measures of employee satisfaction. The raw data can be found by following the SAS example link below.
Three variables associated with job characteristics are:
task variety: degree of variety involved in tasks, expressed as a percent
feedback: degree of feedback required in job tasks, expressed as a percent
autonomy: degree of autonomy required in job tasks, expressed as a percent
Three variables associated with job satisfaction are:
career track satisfaction: employee satisfaction with career direction and the possibility of future advancement, expressed as a percent
management and supervisor satisfaction: employee satisfaction with supervisor's communication and management style, expressed as a percent
financial satisfaction: employee satisfaction with salary and other benefits, using a scale measurement from 1 to 10 (1=unsatisfied, 10=satisfied)
So our syntax will look something like:
include file 'c:\Program Files\SPSS\Canonical correlation.sps'.
set1= Variety Feedback Autonomy/
set2= Career Supervisor Financial.
Unfortunately our output in SPSS is not in the familiar neat table form but rather regular text format. As such I often paste it into MS Word to make it a little easier to move around in. So what are we looking at?
Correlation: we get the correlation coefficients for items within each set, and also the correlations among all the variables involved.
Canonical Correlation: depending on the number of variables involved, we will see two or more canonical correlations between the variates created for each set.
Significance test: Bartlett’s chi-square based on Wilks’ lambda. Note that these tests are not respective of each canonical correlation, but instead regard all the canonical correlations, minus any previous larger ones, at the same time. Essentially it is a test of whether the eigenvalues are greater than zero. However also be aware that like regular correlation coefficients, we are typically more interested in the size of the correlation than statistical significance. Here it looks like the first solution is both very large and statistically significant (R = .92, p = .02).
Coefficients: Standardized and raw coefficients used to create the linear combinations. The true ‘raw’ coefficients, the eigenvectors, are not provided.
Loadings: these are the structure coefficients (be sure when seeing the term ‘loading’ it is clear what coefficients are being interpreted). They are the correlation between the variables in the set and the variate created from linear combination. Here you have regular and cross loadings (loadings regarding the other variate). Our largest loadings for this correlation for job satisfaction are autonomy and feedback, and for job satisfaction are career track satisfaction and management/supervisor satisfaction. Note that financial considerations do not seem as important in the relationship of these sets of variables.
Redundancy: each set gets a pair of output with regard to ‘redundancy’. The proportion of variance of each set explained by its own canonical variate will add to 1. In other words, the entire canonical solution extracts all the variance seen with respect to the variables involved. Regarding the individual values (i.e. ‘Explained by its own Can. Var.’), these are adequacy coefficients, or average squared loading for that particular variate on that dimension (e.g. with set 1, squaring and averaging the loadings = .446). The amount explained by the opposite variate is the redundancy, which can be seen in some sense as a measure of predictive validity. However, some caution should be exercised regarding its interpretation as it has limited utility within the canonical correlation framework. Canonical correlation does not try to maximize this value, but instead the correlation among the variates. If one is more interested in redundancy, one should instead perform ‘redundancy analysis’, which searches for linear combinations of variables in one group that maximizes the variance of the other group that is explained by the linear combination. Such a procedure is available in SAS and R. See the Thompson references for more on this matter.
So there you have a basic introduction to canonical correlation; one can find the procedures in other packages below. The analysis is often thought of as exploratory, but if your hypotheses regard sets of continuous variables, canonical correlation may be a more suitable alternative to running a multiple regression for each DV under consideration, and so well worth utilizing.
- Proc cancorr in SAS (includes data set used above)
- Lattill, Carroll, & Green (2003). Analyzing Multivariate Data.
- Tastuoka (1971). Multivariate analysis
- Thompson, B. (1984). Canonical Correlation Analysis.
- Thompson (1991). A primer on the logic and use of canonical correlation analysis.