# Bare bones of Matrix Completion – I (7 mins read)

This last three weeks I was in Utah for a summer math school or conference whichever you prefer that was organized by the Institute of Advanced Studies (IAS), Princeton, New Jersey. Basically we had three weeks of lectures, problem sessions, endless fun discussions day in day out with free food, free stay – the resort was amazing, we stayed at the Zermatt resort (if you’re using headphones you might wanna turn the volume down a bit). Anyway, the point of this post is lectures given by Roman Vershynin that I sat through on Random Matrices and applications (basically this is a scribe). Specifically, about the famous problem called as matrix completion. Later (hopefully soon) I’ll write about another topic that definitely got my attention called as Differential Privacy.

The problem is very simple to state: let’s say we have a matrix of $M$ of size $m\times n$, this simply means we have a 2-D data array with $m$ rows and $n$ columns. The rows and columns are indexed over, say, users and movies respectively. The famous Netflix prize problem had a matrix like this and of course, not every user has seen every movie. So the questions is, can we fill in the entries of the matrix otherwise incomplete matrix? Indeed, they had a test dataset (or simply test set) on which your algorithms will be tested on, whoever gets the most accuracy wins the challenge, so they get lots of \$. This problem is obviously not well posed (why?) but as with most machine learning problems, mathematical ill-posedness doesn’t really matter as far as we can do good on the test set (this is not 100% true). In fact, the name of the game is to find mathematical theory/algorithms/models that can explain the problem/solution in practice. There are actually very many cool examples of this, probably we will see about this in later posts. Anyway, to analyze matix completion problem, we will need some tools from, of course, random matrix theory (I was actually surprised that these tools were actually developed very recently, I mean within the last 10-15 years atmost, if I’m wrong, please comment). So in this post we will look at some basic concentration inequalities.

Definition of subgaussian random variable: Let $x$ be a random variable. Then, the following are equivalent (TFAE):
1. Tails: $Pr(|x|\geq t)\leq 2\exp(-t^2/K_1^2)$
2. Moments: $|x|_p=(\mathbb{E}|x|^p)^{1/p}\leq K_2\sqrt{p}$
3. MGF of $x^2$: $\mathbb{E} \exp(x^2/K_3^2)\leq 2$
4. MGF of $x$: $\mathbb{E}\exp(\lambda x)\leq \exp(\lambda^2K_4^2)~\forall~\lambda\in\mathbb{R}$

We will refer to these numbers before the statements through out this post. $Pr$ denotes the probability and $\mathbb{E}$ denotes the expectation.
In these statements, $K_i'$s can be multiplicative factors of each other but this really doesn’t matter since we are interested in the asymptotics. Examples include: gaussian, bernoulli and nonexamples include exponential, poisson etc.. Finally we will denote $\|x\|_{\psi_2}=$ smallest value of $K$ in 3 (MGF of $x^2$).

Proposition 1: If $x_i$‘s are independent subgaussian random variables with mean 0, then $\sum_i x_i$ is also a subgaussian random variable.

Proof: We will show 4 for $\sum_ix_i$.
$\mathbb{E}\exp(\lambda\sum_ix_i)=\Pi_i\mathbb{E}\exp(\lambda x_i)\leq \Pi_i\exp(c\lambda\|x_i\|^2_{\psi_2})=\exp(c\lambda^2\sum_i\|x_i\|^2_{\psi_2})$

The first equality is true because of the independence of the random variables $x_i$, the inequality is due to 4. Choosing $K_4$ to be $\sum_i\|x_i\|_{\psi_2}^2$ we are done. $\square$

Hoeffding’s Inequality: Let $x_i$‘s be independent subgaussian random variables with mean 0. Then $\forall~t>0, ~Pr\{\|\sum_ix_i\|\geq t\}\leq 2\exp\left( -\frac{ct^2}{\sum_i\|x_i\|^2_{\psi_2}}\right)$

Proof: This follows from proposition 1 and the equivalence of the definitions. $\square$

OK, now we need to know a bit about subexponential random variables.

Definition of subexponential random variable: Let $x$ be a random variable. Then, the following are equivalent (TFAE):
1′. $Pr(|x|\geq t)\leq \exp(-t/K_1)$
2′.$|x|_p=K_p~\forall~p$
3′. If $\mathbb{E} x=0$, then $\mathbb{E}\exp(\lambda x)\leq \exp(\lambda^2K_2^2)~\forall~|\lambda|\leq 1/K_2$

In this case we denote $\|x\|_{\psi_1}$  = smallest $K$.

Bernstein’s Inequality: Let $x_i$‘s be independent subexponential random variables with mean 0. Then $\forall~t>0, Pr\{|\sum_ix_i|\geq t\}\leq 2\exp\{-\min \left(\frac{t^2}{\sum_i\|x_i\|^2_{\psi_1}},\frac{t}{\max_i\|x_i\|_{\psi_1}} \right)\}$

Proof (sketch): This requires the use of Markov inequality on the sum $\sum_ix_i$ and then we use 3′. $\square$

Corallary: $Pr\{\frac{1}{N}|\sum_{i=1}^nx_i|>t\}\leq 2\exp\{-cN\min\left( \frac{t^2}{K^2},\frac{t}{K}\right)\}$ where $K=\max_i\|x_i \|_{\psi_1}$.

Synopsis: We have Hoeffding’s inequality for subgaussians and Bernstein’s inequality for subexponentials, so what? well, if you look at the stuff inside the exp, then you’ll notice that the subgaussians are much behaved than subexponentials since they go down (concentrate) faster or better (quadratic $t^2$ in subgaussians vs linear $t$ in subexponentials). In the next post we will see how this can be extended to the case where $x_i$ are matrices.