.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/plot_cca.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_plot_cca.py: Different CCA methods ===================== Exempliefies different CCA methods .. GENERATED FROM PYTHON SOURCE LINES 10-11 Import necessary libraries. .. GENERATED FROM PYTHON SOURCE LINES 11-22 .. code-block:: default import pandas as pd import numpy as np from numpy.linalg import svd from statsmodels.multivariate.cancorr import CanCorr from sparsecca import cca_ipls from sparsecca import cca_pmd from sparsecca import multicca_pmd from sparsecca import pmd .. GENERATED FROM PYTHON SOURCE LINES 23-24 Get toy data example from seaborn .. GENERATED FROM PYTHON SOURCE LINES 24-35 .. code-block:: default path = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv" df = pd.read_csv(path) df = df.dropna() X = df[['bill_length_mm', 'bill_depth_mm']] Z = df[['flipper_length_mm', 'body_mass_g']] X = ((X - np.mean(X)) / np.std(X)).to_numpy() Z = ((Z - np.mean(Z)) / np.std(Z)).to_numpy() .. GENERATED FROM PYTHON SOURCE LINES 36-37 Define function for printing weights .. GENERATED FROM PYTHON SOURCE LINES 37-41 .. code-block:: default def print_weights(name, weights): first = weights[:, 0] / np.max(np.abs(weights[:, 0])) print(name + ': ' + ', '.join(['{:.3f}'.format(item) for item in first])) .. GENERATED FROM PYTHON SOURCE LINES 42-43 First, let's try CanCorr function from statsmodels package. .. GENERATED FROM PYTHON SOURCE LINES 43-50 .. code-block:: default stats_cca = CanCorr(Z, X) print(stats_cca.corr_test().summary()) print_weights('X', stats_cca.x_cancoef) print_weights('Z', stats_cca.y_cancoef) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Cancorr results ===================================================================== Canonical Correlation Wilks' lambda Num DF Den DF F Value Pr > F --------------------------------------------------------------------- 0 0.7876 0.3768 4.0000 658.0000 103.4837 0.0000 1 0.0864 0.9925 1.0000 330.0000 2.4812 0.1162 --------------------------------------------------------------------- --------------------------------------------------------------------- Multivariate Statistics and F Approximations ----------------------------------------------------------------------- Value Num DF Den DF F Value Pr > F ----------------------------------------------------------------------- Wilks' lambda 0.3768 4.0000 658.0000 103.4837 0.0000 Pillai's trace 0.6278 4.0000 660.0000 75.4943 0.0000 Hotelling-Lawley trace 1.6416 4.0000 393.7623 134.8873 0.0000 Roy's greatest root 1.6341 2.0000 330.0000 269.6262 0.0000 ===================================================================== X: -1.000, 0.826 Z: -1.000, 0.027 .. GENERATED FROM PYTHON SOURCE LINES 51-52 Next, use CCA algorithm from Witten et al. .. GENERATED FROM PYTHON SOURCE LINES 52-63 .. code-block:: default U, V, D = cca_pmd(X, Z, penaltyx=1.0, penaltyz=1.0, K=2, standardize=False) x_weights = U[:, 0] z_weights = V[:, 0] corrcoef = np.corrcoef(np.dot(x_weights, X.T), np.dot(z_weights, Z.T))[0, 1] print("Corrcoef for comp 1: " + str(corrcoef)) print_weights('X', U) print_weights('Z', V) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Corrcoef for comp 1: 0.7629782437780296 X: -1.000, 0.848 Z: -1.000, -0.866 .. GENERATED FROM PYTHON SOURCE LINES 64-66 As the CCA algorithm in Witten et al is faster version of computing SVD of X.T @ Z, try that. .. GENERATED FROM PYTHON SOURCE LINES 66-77 .. code-block:: default U, D, V = svd(X.T @ Z) x_weights = U[:, 0] z_weights = V[0, :] corrcoef = np.corrcoef(np.dot(x_weights, X.T), np.dot(z_weights, Z.T))[0, 1] print("Corrcoef for comp 1: " + str(corrcoef)) print_weights('X', U) print_weights('V', V.T) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Corrcoef for comp 1: 0.7629782437780295 X: -1.000, 0.848 V: -1.000, -0.866 .. GENERATED FROM PYTHON SOURCE LINES 78-81 The novelty in Witten et al is developing matrix decomposition similar to SVD, but which allows to add convex penalties (here lasso). Using that to X.T @ Z without penalty results to same as above. .. GENERATED FROM PYTHON SOURCE LINES 81-92 .. code-block:: default U, V, D = pmd(X.T @ Z, K=2, penaltyu=1.0, penaltyv=1.0, standardize=False) x_weights = U[:, 0] z_weights = V[:, 0] corrcoef = np.corrcoef(np.dot(x_weights, X.T), np.dot(z_weights, Z.T))[0, 1] print("Corrcoef for comp 1: " + str(corrcoef)) print_weights('X', U) print_weights('Z', V) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Corrcoef for comp 1: 0.7629782437780296 X: -1.000, 0.848 Z: -1.000, -0.866 .. GENERATED FROM PYTHON SOURCE LINES 93-94 However, when you add penalties, you get a sparse version of CCA. .. GENERATED FROM PYTHON SOURCE LINES 94-105 .. code-block:: default U, V, D = pmd(X.T @ Z, K=2, penaltyu=0.8, penaltyv=0.9, standardize=False) x_weights = U[:, 0] z_weights = V[:, 0] corrcoef = np.corrcoef(np.dot(x_weights, X.T), np.dot(z_weights, Z.T))[0, 1] print("Corrcoef for comp 1: " + str(corrcoef)) print_weights('X', U) print_weights('Z', V) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Corrcoef for comp 1: 0.7038297679392981 X: -1.000, 0.143 Z: -1.000, -0.347 .. GENERATED FROM PYTHON SOURCE LINES 106-114 PMD is really fantastically simple and powerful idea, and as seen, can be used to implement sparse CCA. However, for SVD(X.T @ Z) to be equivalent to CCA, cov(X) and cov(Z) should be diagonal, which can sometimes give problems. Another CCA algorithm allowing convex penalties that does not require cov(X) and cov(Z) to be diagonal, was presented in Mai et al (2019). It is based on iterative least squares formulation, and as it is solved with GLM, it allows elastic net -like weighting of L1 and L2 -norms for both datasets separately. .. GENERATED FROM PYTHON SOURCE LINES 114-126 .. code-block:: default X_weights, Z_weights = cca_ipls(X, Z, alpha_lambda=0.0, beta_lambda=0.0, standardize=False, n_pairs=2, glm_impl='glmnet_python') x_weights = X_weights[:, 0] z_weights = Z_weights[:, 0] corrcoef = np.corrcoef(np.dot(x_weights, X.T), np.dot(z_weights, Z.T))[0, 1] print("Corrcoef for comp 1: " + str(corrcoef)) print_weights("X", X_weights) print_weights("Z", Z_weights) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Corrcoef for comp 1: 0.7876314577102576 X: -1.000, 0.825 Z: -1.000, 0.026 .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 0.882 seconds) .. _sphx_glr_download_auto_examples_plot_cca.py: .. only :: html .. container:: sphx-glr-footer :class: sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_cca.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_cca.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_