Max Turgeon
University of Manitoba
Missing data is common in every data science domain
Often, methods assume the data is complete or (silently) perform a complete-case analysis.
Can we improve performance by leveraging structure in the data?
Removing missing data can be misleading
\[ A = UDV^T, \qquad U^TMU = V^TWV = I.\]
\[f(X) = \frac{1}{2}\mathrm{trace}\left(M(A-X) W (A-X)^T\right),\quad \mathrm{rank}(X) \leq r.\]
\[X_\mathrm{compl} = \mathrm{\arg\!\min}\, f(X) + \lambda\|X\|_*\]
Recall: \(S_\lambda(\sigma) = \max(\sigma - \lambda, 0)\).
\(\pi_{miss}\) | GSI | SI | iPCA | riPCA |
---|---|---|---|---|
0.05 | 0.0974 | 0.1053 | 0.1032 | 0.1032 |
0.10 | 0.2040 | 0.2284 | 0.2152 | 0.2152 |
0.15 | 0.3037 | 0.3688 | 0.3115 | 0.3115 |
0.20 | 0.4318 | 0.4925 | 0.4085 | 0.4085 |
0.25 | 0.5796 | 0.6251 | 0.5145 | 0.5145 |
Overall, leveraging structure in data does improve performance.
GenSoftImpute
package
Generalized Soft Impute