Correlation#
- class pyspark.ml.stat.Correlation[source]#
- Compute the correlation matrix for the input dataset of Vectors using the specified method. Methods currently supported: pearson (default), spearman. - New in version 2.2.0. - Notes - For Spearman, a rank correlation, we need to create an RDD[Double] for each column and sort it in order to retrieve the ranks and then join the columns back into an RDD[Vector], which is fairly costly. Cache the input Dataset before calling corr with method = ‘spearman’ to avoid recomputing the common lineage. - Methods - corr(dataset, column[, method])- Compute the correlation matrix with specified method using dataset. - Methods Documentation - static corr(dataset, column, method='pearson')[source]#
- Compute the correlation matrix with specified method using dataset. - New in version 2.2.0. - Parameters
- datasetpyspark.sql.DataFrame
- A DataFrame. 
- columnstr
- The name of the column of vectors for which the correlation coefficient needs to be computed. This must be a column of the dataset, and it must contain Vector objects. 
- methodstr, optional
- String specifying the method to use for computing correlation. Supported: pearson (default), spearman. 
 
- dataset
- Returns
- A DataFrame that contains the correlation matrix of the column of vectors. This
- DataFrame contains a single row and a single column of name METHODNAME(COLUMN).
 
 - Examples - >>> from pyspark.ml.linalg import DenseMatrix, Vectors >>> from pyspark.ml.stat import Correlation >>> dataset = [[Vectors.dense([1, 0, 0, -2])], ... [Vectors.dense([4, 5, 0, 3])], ... [Vectors.dense([6, 7, 0, 8])], ... [Vectors.dense([9, 0, 0, 1])]] >>> dataset = spark.createDataFrame(dataset, ['features']) >>> pearsonCorr = Correlation.corr(dataset, 'features', 'pearson').collect()[0][0] >>> print(str(pearsonCorr).replace('nan', 'NaN')) DenseMatrix([[ 1. , 0.0556..., NaN, 0.4004...], [ 0.0556..., 1. , NaN, 0.9135...], [ NaN, NaN, 1. , NaN], [ 0.4004..., 0.9135..., NaN, 1. ]]) >>> spearmanCorr = Correlation.corr(dataset, 'features', method='spearman').collect()[0][0] >>> print(str(spearmanCorr).replace('nan', 'NaN')) DenseMatrix([[ 1. , 0.1054..., NaN, 0.4 ], [ 0.1054..., 1. , NaN, 0.9486... ], [ NaN, NaN, 1. , NaN], [ 0.4 , 0.9486... , NaN, 1. ]])