Max Turgeon
University of Manitoba
Missing data is common in every data science domain
Often, methods assume the data is complete or (silently) perform a complete-case analysis.
Can we improve performance by leveraging structure in the data?
Removing missing data can be misleading
\[ [P_\Omega(X)]_{ij} = \begin{cases} X_{ij} & (i,j)\in \Omega\\ 0 & (i,j) \notin \Omega\end{cases}.\]
\[ f(X) = \frac{1}{2} \| P_\Omega(A) - P_\Omega(X) \|_F^2 + \lambda \|X\|_\ast.\]
\[ A = UDV^T, \qquad U^TMU = V^TWV = I.\]
We can get a rank \(r\) approximation by taking the first \(r\) columns of \(U,V\), and the first \(r\) diagonal entries of \(D\):
\[A \approx U_{[r]} D_{[r]} V_{[r]}^T.\]
\[f(X) = \frac{1}{2}\mathrm{trace}\left(M(A-X) W (A-X)^T\right),\quad \mathrm{rank}(X) \leq r.\]
\[X_\mathrm{compl} = \mathrm{\arg\!\min}\, \left\{f(X) + \lambda\|X\|_*\right\}\]
Recall: \(S_\lambda(\sigma) = \max(\sigma - \lambda, 0)\).
\[ \mathrm{prox}_h(X) = \mathrm{argmin}_{\Theta} \left\{\frac{1}{2}\|X - \Theta \|_2^2 + h(\Theta)\right\}. \]
\(\pi_{miss}\) | GSI | SI | iPCA | riPCA |
---|---|---|---|---|
0.05 | 0.0974 | 0.1053 | 0.1032 | 0.1032 |
0.10 | 0.2040 | 0.2284 | 0.2152 | 0.2152 |
0.15 | 0.3037 | 0.3688 | 0.3115 | 0.3115 |
0.20 | 0.4318 | 0.4925 | 0.4085 | 0.4085 |
0.25 | 0.5796 | 0.6251 | 0.5145 | 0.5145 |
Overall, leveraging structure in data does improve performance.
GenSoftImpute
package
Generalized Soft Impute