Generalized Soft Impute for Matrix Completion

Max Turgeon

University of Manitoba

Motivation

Missing data is common in every data science domain
Often, methods assume the data is complete or (silently) perform a complete-case analysis.
- Data imputation
Can we improve performance by leveraging structure in the data?

Common methods

Complete-case analysis
Mean imputation
- Also median/mode imputation
Often leads to imprecise and/or biased results

Example

Removing missing data can be misleading

Outline

Matrix factorization and matrix completion
Generalized SVD
Generalized Soft Impute
Simulations

Matrix factorization and matrix completion

Non-Negative Matrix Factorization

Let \(A\) be a \(n\times p\) non-negative matrix
- For NNMF, all entries are non-negative.
Want: \(G, H\) non-negative such that \(A \approx GH\).
- Distance? Frobenius norm of difference
- Can take \(G,H\) of rank \(r < \min(n,p)\).
Lee and Seung (2001) gave iterative algorithm with entrywise updates.
- We can skip over missing data!

Iterative PCA

PCA gives us decomposition of \(A = Z V^T\), with \(V\) containing the eigenvectors of \(\mathrm{Cov}(A)\).
Missing data?
- Randomly impute.
- Compute PCA and use decomposition to update imputation.
- Repeat until convergence.
Control overfitting via 1) Choosing a small number \(r\) of eigenvectors/PCs and 2) \(L_2\) regularization

Soft Impute

Define a projection operator:

\[ [P_\Omega(X)]_{ij} = \begin{cases} X_{ij} & (i,j)\in \Omega\\ 0 & (i,j) \notin \Omega\end{cases}.\]

\(\Omega\): indices of missing entries.
Define objective function:

\[ f(X) = \frac{1}{2} \| P_\Omega(A) - P_\Omega(X) \|_F^2 + \lambda \|X\|_\ast.\]

We have:
- \(\| \cdot \|_F\): Frobenius norm
- \(\| \cdot \|_\ast\): Nuclear norm (i.e. sum of singular values)
- \(\lambda\): Tuning parameter
Convex optimization! But nuclear norm is not differentiable…
Mazumder et al. (2010) proposed an efficient algorithm based on proximal gradient descent.

Summary

Matrix factorization = structure
Learn structure = make good guesses
Avoid overfitting with small rank and regularization

Generalized Singular Value Decomposition

Generalized SVD

Generalized SVD differs from SVD by introducing positive-definite matrices representing row and column constraints.
\(A\) is \(n\times p\) matrix, \(M\) represents row constraints, \(W\) represents column constraints.
We want matrices U, D, V, with D diagonal, such that

\[ A = UDV^T, \qquad U^TMU = V^TWV = I.\]

Define \(\tilde{A} = M^{1/2}A W^{1/2}\).
Compute SVD: \(\tilde{A} = P D Q^T\).
Rescale: \(U=M^{-1/2}P\) and \(V = W^{-1/2}Q\).

We can get a rank \(r\) approximation by taking the first \(r\) columns of \(U,V\), and the first \(r\) diagonal entries of \(D\):

\[A \approx U_{[r]} D_{[r]} V_{[r]}^T.\]

Multiple Correspondence Analysis

Think PCA for categorical data.
Use one-hot encoding to get 0/1 matrix (observed counts).
Compute matrix of expected counts under independence (row frequencies times column frequencies)
Run generalized SVD on the difference matrix.
- \(M\) is a diagonal matrix of inverse row frequencies
- \(W\) is a diagonal matrix of inverse column frequencies

Generalized Soft Impute

Our method

The rank \(r\) matrix \(U_{[r]} D_{[r]} V_{[r]}^T\) (for fixed \(r> 0\)) minimizes the following:

\[f(X) = \frac{1}{2}\mathrm{trace}\left(M(A-X) W (A-X)^T\right),\quad \mathrm{rank}(X) \leq r.\]

Key idea: Compute \(f(X)\) for non-missing values of \(A\), and penalize the nuclear norm (i.e. sum of singular values).

\[X_\mathrm{compl} = \mathrm{\arg\!\min}\, \left\{f(X) + \lambda\|X\|_*\right\}\]

Algorithm

Iterate until convergence:
- Fill missing entries of \(A\) using previous iteration.
- Compute generalized SVD for \(A = UDV^T\).
- Soft-threshold singular values: \(X = US_\lambda(D)V^T\).

Recall: \(S_\lambda(\sigma) = \max(\sigma - \lambda, 0)\).

Aside: Proximal gradient descent

Convex and differentiable objective function \(g(X)\):
- Gradient descent: \(X^{t+1} = X^t - \gamma \nabla g(X^t)\)
What can we do if non-differentiable? It depends

Assume we can write \(g = f + h\) in two parts:
- \(f\) is convex and differentiable
- \(h\) is only convex

Define the proximal map:

\[ \mathrm{prox}_h(X) = \mathrm{argmin}_{\Theta} \left\{\frac{1}{2}\|X - \Theta \|_2^2 + h(\Theta)\right\}. \]

Sometimes, the proximal map is easy to compute:
- \(h\) = indicator function of convex set \(\to\) \(\mathrm{prox}_h\) = projection operator
- \(h\) = L₁ norm \(\to\) \(\mathrm{prox}_h\) = soft-thresholding (e.g. Lasso)
- \(h\) = nuclear norm \(\to\) \(\mathrm{prox}_h\) = matrix soft-thresholding

Proximal gradient descent
- \(X^{t+1} = \mathrm{prox}_{\gamma h}\left(X^t - \gamma \nabla g(X^t)\right)\)

Summary

Proximal gradient descent is useful when objective is non-differentiable, with simple non-differentiable part.
Convergence guarantees when \(g\) is Lipshitz
Can be accelerated just like GD

Simulations

Simulation scenario

We use data from ONDRI.¹
- Sex, NIH stroke scale score, APOE genotype, MAPT diplotype
We randomly remove a fixed proportion of data (\(\pi_{miss}\)).
We compare imputed data to original data.
We compare four methods:
- Generalized Soft Impute, Soft Impute, Iterative PCA, and Regularized Iterative PCA.

Results

\(\pi_{miss}\)	GSI	SI	iPCA	riPCA
0.05	0.0974	0.1053	0.1032	0.1032
0.10	0.2040	0.2284	0.2152	0.2152
0.15	0.3037	0.3688	0.3115	0.3115
0.20	0.4318	0.4925	0.4085	0.4085
0.25	0.5796	0.6251	0.5145	0.5145

Discussion

Our method outperforms Soft Impute.
For small \(\pi_{miss}\), GSI is better than iterative PCA.
For larger \(\pi_{miss}\), riPCA is better than GSI.
- BUT requires the right rank!

Overall, leveraging structure in data does improve performance.

Discussion (cont’d)

We assumed data is MCAR.
- Reasonable when data missing because of technical failure
Need more simulations to understand behaviour with MAR data.

Summary

We presented a matrix completion algorithm specifically designed for methods built on generalized SVD (e.g. weighted PCA, MCA).
We achieved robustness by penalizing the nuclear norm of the approximating matrix.
Proximal gradient descent theory guarantees convergence.
Especially useful for compositional data (e.g. ecology, microbiome data).

Next steps

Hyperparameter selection (\(\lambda\) and \(r\))
Application to microbiome data
- Missing data a consequence of low read depth
Improve computational efficiency by leveraging sparsity
Release GenSoftImpute package

Generalized Soft Impute for Matrix Completion

Motivation

Common methods

Example

Outline

Matrix factorization and matrix completion

Non-Negative Matrix Factorization

Iterative PCA

Soft Impute

Summary

Generalized Singular Value Decomposition

Generalized SVD

Multiple Correspondence Analysis

Generalized Soft Impute

Our method

Algorithm

Aside: Proximal gradient descent

Summary

Simulations

Simulation scenario

Results

Discussion

Discussion (cont’d)

Summary

Next steps

Questions?

Slides can be found at maxturgeon.ca/talks