Note
Click here to download the full example code
Different CCA methods¶
Exempliefies different CCA methods
Import necessary libraries.
import pandas as pd
import numpy as np
from numpy.linalg import svd
from statsmodels.multivariate.cancorr import CanCorr
from sparsecca import cca_ipls
from sparsecca import cca_pmd
from sparsecca import multicca_pmd
from sparsecca import pmd
Get toy data example from seaborn
path = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv"
df = pd.read_csv(path)
df = df.dropna()
X = df[['bill_length_mm', 'bill_depth_mm']]
Z = df[['flipper_length_mm', 'body_mass_g']]
X = ((X - np.mean(X)) / np.std(X)).to_numpy()
Z = ((Z - np.mean(Z)) / np.std(Z)).to_numpy()
Define function for printing weights
def print_weights(name, weights):
first = weights[:, 0] / np.max(np.abs(weights[:, 0]))
print(name + ': ' + ', '.join(['{:.3f}'.format(item) for item in first]))
First, let’s try CanCorr function from statsmodels package.
stats_cca = CanCorr(Z, X)
print(stats_cca.corr_test().summary())
print_weights('X', stats_cca.x_cancoef)
print_weights('Z', stats_cca.y_cancoef)
Out:
Cancorr results
=====================================================================
Canonical Correlation Wilks' lambda Num DF Den DF F Value Pr > F
---------------------------------------------------------------------
0 0.7876 0.3768 4.0000 658.0000 103.4837 0.0000
1 0.0864 0.9925 1.0000 330.0000 2.4812 0.1162
---------------------------------------------------------------------
---------------------------------------------------------------------
Multivariate Statistics and F Approximations
-----------------------------------------------------------------------
Value Num DF Den DF F Value Pr > F
-----------------------------------------------------------------------
Wilks' lambda 0.3768 4.0000 658.0000 103.4837 0.0000
Pillai's trace 0.6278 4.0000 660.0000 75.4943 0.0000
Hotelling-Lawley trace 1.6416 4.0000 393.7623 134.8873 0.0000
Roy's greatest root 1.6341 2.0000 330.0000 269.6262 0.0000
=====================================================================
X: -1.000, 0.826
Z: -1.000, 0.027
Next, use CCA algorithm from Witten et al.
U, V, D = cca_pmd(X, Z, penaltyx=1.0, penaltyz=1.0, K=2, standardize=False)
x_weights = U[:, 0]
z_weights = V[:, 0]
corrcoef = np.corrcoef(np.dot(x_weights, X.T), np.dot(z_weights, Z.T))[0, 1]
print("Corrcoef for comp 1: " + str(corrcoef))
print_weights('X', U)
print_weights('Z', V)
Out:
Corrcoef for comp 1: 0.7629782437780296
X: -1.000, 0.848
Z: -1.000, -0.866
As the CCA algorithm in Witten et al is faster version of computing SVD of X.T @ Z, try that.
U, D, V = svd(X.T @ Z)
x_weights = U[:, 0]
z_weights = V[0, :]
corrcoef = np.corrcoef(np.dot(x_weights, X.T), np.dot(z_weights, Z.T))[0, 1]
print("Corrcoef for comp 1: " + str(corrcoef))
print_weights('X', U)
print_weights('V', V.T)
Out:
Corrcoef for comp 1: 0.7629782437780295
X: -1.000, 0.848
V: -1.000, -0.866
The novelty in Witten et al is developing matrix decomposition similar to SVD, but which allows to add convex penalties (here lasso). Using that to X.T @ Z without penalty results to same as above.
U, V, D = pmd(X.T @ Z, K=2, penaltyu=1.0, penaltyv=1.0, standardize=False)
x_weights = U[:, 0]
z_weights = V[:, 0]
corrcoef = np.corrcoef(np.dot(x_weights, X.T), np.dot(z_weights, Z.T))[0, 1]
print("Corrcoef for comp 1: " + str(corrcoef))
print_weights('X', U)
print_weights('Z', V)
Out:
Corrcoef for comp 1: 0.7629782437780296
X: -1.000, 0.848
Z: -1.000, -0.866
However, when you add penalties, you get a sparse version of CCA.
U, V, D = pmd(X.T @ Z, K=2, penaltyu=0.8, penaltyv=0.9, standardize=False)
x_weights = U[:, 0]
z_weights = V[:, 0]
corrcoef = np.corrcoef(np.dot(x_weights, X.T), np.dot(z_weights, Z.T))[0, 1]
print("Corrcoef for comp 1: " + str(corrcoef))
print_weights('X', U)
print_weights('Z', V)
Out:
Corrcoef for comp 1: 0.7038297679392981
X: -1.000, 0.143
Z: -1.000, -0.347
PMD is really fantastically simple and powerful idea, and as seen, can be used to implement sparse CCA. However, for SVD(X.T @ Z) to be equivalent to CCA, cov(X) and cov(Z) should be diagonal, which can sometimes give problems. Another CCA algorithm allowing convex penalties that does not require cov(X) and cov(Z) to be diagonal, was presented in Mai et al (2019). It is based on iterative least squares formulation, and as it is solved with GLM, it allows elastic net -like weighting of L1 and L2 -norms for both datasets separately.
X_weights, Z_weights = cca_ipls(X, Z, alpha_lambda=0.0, beta_lambda=0.0, standardize=False,
n_pairs=2, glm_impl='glmnet_python')
x_weights = X_weights[:, 0]
z_weights = Z_weights[:, 0]
corrcoef = np.corrcoef(np.dot(x_weights, X.T), np.dot(z_weights, Z.T))[0, 1]
print("Corrcoef for comp 1: " + str(corrcoef))
print_weights("X", X_weights)
print_weights("Z", Z_weights)
Out:
Corrcoef for comp 1: 0.7876314577102576
X: -1.000, 0.825
Z: -1.000, 0.026
Total running time of the script: ( 0 minutes 0.882 seconds)