# NIPS 2017: Themes and Takeaways

Ideas that caught my eye, some that went over my head and few that permeated straight through to the brain. All while at NIPS ’17.

Disclaimer: This list is in no way exhaustive. There was a LOT of information all around me, and I could only make note of a limited themes that I either found interesting or relevant.

1. #### Phase retrieval/solving random quadratic system of equations.

I’ll jump straight to this topic, because it was the most relevant one to me, and I found a couple of interesting papers on this, namely:
i. Solving Most Systems of Random Quadratic Equations [Poster]
(uses iteratively reweighted gradient descent approach, to achieve information theoretically optimal Gaussian sample complexity).
ii. Convolutional Phase Retrieval [Poster]
(proposes a new sensing procedure composed of convolutions of a Gaussian distributed filter).
iii. A Local Analysis of Block Coordinate Descent for Gaussian Phase Retrieval [Workshop]
(uses block coordinate descent for alternating minimization based recovery procedure from phaseless Gaussian measurements).
iv. Fast, Sample-efficient Algorithms for Structured Phase Retrieval [Poster]
(using alternating minimization to recover structurally sparse signals from phaseless Gaussian measurements). (this was mine, so I can’t not publicize it 😀 )

The main takeaways, in my opinion would be
– experimenting with structure of the signals to be recovered,
– experimenting with the measurement setup, and
– modifying the two standard optimization approaches:
Wirtinger flow based gradient descent and Alternating minimization.

2. #### Looking beyond gradient descent for training neural networks.

i. Gradient Descent Can Take Exponential Time to Escape Saddle Points [Spotlight]

Takeaway:
-Gradient descent in itself is ill-equipped to escape saddle points. Perturbed/accelerated versions perform better.

3. #### Sparse Bayesian learning.

i. From Bayesian Sparsity to Gated Recurrent Nets [Orals]
(the authors connect the sparse Bayesian learning problem to RNNs and present an LSTM model that can be used for sparse signal estimation).
ii. DNNs for sparse coding and dictionary learning [Workshop]
(learning sparse regularizers for sparse signal estimation, via deep networks).

Takeaway:
-Interesting connections between Bayesian sparse signal estimation and deep nets.

4. #### Optimization techniques.

i. A Conservation Law Method in Optimization [Workshop]
(an interesting parallel between non-convex optimization and Newton’s second law)
ii. Faster Non-Convex Optimization than SDG [Workshop]
(using ε-approximations of local minima of smooth nonconvex functions)
vi. The marginal value of adaptive gradient methods in machine learning [Orals]
(SGD can perform better, with adequately chosen learning rate, as compared to adaptive optimizers like ADAM. One needs to rethink the optimizers being used for training deep networks) (this paper seems to have sparked a debate and even has a dedicated Reddit thread)
v. Implicit Regularization in Matrix Factorization [Spotlight]
(theoretical guarantees for convergence of gradient descent to minimum nuclear norm solution for matrix factorization problem, under firm initialization and step size constraints).
vi. Generalized Linear Model Regression under Distance-to-set Penalties [Spotlight]
(introduces a new penalty method to overcome drawback of shrinkage, while using Lasso).
vii. Unbiased estimates for linear regression via volume sampling [Spotlight]
(interesting technique to obtain (fat) matrix pseudo-inverse by picking only a subset of columns, hence speeding up the pseudo-inverse operation).

Takeaway:
-new techniques with refined theoretical guarantees for convergence of optimization procedures.

5. #### New directions/miscellaneous

i. Deep Sets [Orals] (design objective functions defined on sets that are invariant to permutations).
ii. Unsupervised object learning from dense equivarient image labeling. [Orals]
(using a large number of images of an object and no other supervision, to
extract a dense object-centric coordinate frame, for 3D modelling).
iii. Geometric deep learning on graphs and manifolds [Tutorial]
iv. A Unified Approach to Interpreting Model Predictions [Orals]
v. Diving into the shallows: a computational perspective on large-scale shallow learning. [Spotlight] (demonstrates that only a vanishingly small fraction of the function space is reachable after a polynomial number of gradient descent iterations when used in conjunctions with smooth kernels/shallow methods, hence exposing the limitation of shallow networks on large-scale data).

i. Gradient Descent GANs are Locally Stable [Orals] (utilizes non-linear systems theory to show local exponential stability of GAN optimization)
ii. Unsupervised image-to-image translation networks [Spotlight]
iii. Dual discriminator GANs [Spotlight] (theoretical analysis to show that, given the maximal discriminators, optimizing the generator of 2-discriminator GAN helps avoiding the mode collapsing problem).
iv. Bridging the gap between theory and practice of GANs [Workshop]

Takeaways:
-new applications
-new breakthroughs in terms of theoretical results for convergence and solving the “mode collapse” problem.

I think overall I was exposed to a lot of interesting ideas, and hopefully I will be able to make time to go through each of these papers in further detail.

# Optimization in Life

I’ve been doing a lot of optimization related work and courses, for my PhD, most notably in convex optimization and non-linear programming. They say that the best way to learn theory is to implement it in real life, and so I thought that it wouldn’t hurt to find ways to optimize… life… eh? On that optimistic enough thought, here we go:

1. The steepest descent is not necessarily the fastest. A common thing that people do when they are in an unwanted situation is to do starkly opposite of what they were initially doing, i.e. $-\nabla f(x)$. This seems to be a go-to solution for minimizing conflict. However, it is well known that to reach the point of minimum (conflict), steepest descent can take far more number of iterations than other gradient based methods. So take it easier, guys. Extremeness is not a smart option.
2. When bogged down by multiple issues, solve one-problem at-a-time. Coordinate descent is an approach in which the objective (life’s problems) is minimized w.r.t a fixed coordinate at a time. It’s known for its simplicity of implementation.
3. The apple does not fall far from the tree. So when Newton came up with his method for optimizing functions, the initial estimates did not fall far from the optimum, most notably in the case of quadratic functions. Turns out, it helps to approximate functions at each point with quadratic estimates, and then to minimize that quadratic estimate. Basically, take a problem and convert it into an easier sub-problem that has a known minimum. Move on to the next sub-problem. This fetches you the global optimum much faster.
4. While positive definiteness is ideal, positive semi-definiteness is good too. If the Hessian of the function to be minimized is positive semi-definite, then the function is convex and can be minimized easily (its local optimum is the global optimum). So keep calm (and kinda positive) and minimize issues.
5. Often when there are too many parameters to handle, we tend to overfit a fairly complicated model to our life. In such cases, it is a good idea to penalize over-complication by adding a regularization term. Regularization also helps in solving an ill-posed problem. If we tend to focus on only a specific set of problems, we forget other facets of life, which leads us into making poorer choices. The key is to find the right balance or trade-off.
6. Some problems actually have closed-form or unique solutions. There’s just one possible answer which is apparent enough. In that scenario the optimal strategy should be to stop optimizing. Stop contemplating, just go-get-it!
7. On a closing note, heuristically speaking, one would need to try out a bunch of optimizing techniques to find the optimal optimization technique.

XKCD

To make this post even more meta, how optimal would it be if the moral of this post converged to this statement?

# Signal Measurements: From linear to non-linear

The best kind of problems to solve are linear measurements

$y = Ax$

where $A$ is a square matrix, and essentially requires linear programming (or simple matrix calculations).

minimize  $c$

subject to $Ax = y$

which equivalently, has the closed form solution given by

$x = A^{-1} y$.

If $A$ is full rank and square, we have a unique solution that satisfies a one-one mapping between inputs $x_i$ and outputs $y_i$. If we go slightly out of comfort zone, we come to the easiest class of nonlinear programming, for linear measurements, which is convex optimization. Why? Because in convex functions, local optimum is global optimum. They are nice that way: as long as you can ensure that with each iteration, you are reducing your objective value, at an appropriate enough rate, you will converge to the correct solution.

An example of this type of problem is still $y=Ax$, with $A$ being a tall matrix, which can be formulated as a least-squares problem. If $A$ is full rank, it has a unique solution that can be computed using a matrix pseudo-inverse. (Note: In this model, there is no one-one mapping between $x_i$‘s and $y_i$‘s. There are more equations than variables, which means we may not be able to find a set of $x_i$‘s which satisfy all of the equations. We can however find an $x$  which comes close enough to the measurements, by minimizing the deviation of the measurement model from the actual measurements).

minimize  $\|Ax-y\|^2_2$

which equivalently, has the closed form solution given by

$x = (A^TA)^{-1} A^Ty$.

What about linear-measurement models that are inherently non-convex? Consider $A$ to be a fat matrix, we can have infinitely many solutions ($A$ is rank deficient). We further impose a restriction on $x$ to be k-sparse (of course, this imposition is natural enough. We are collecting less information about our signal than we’re supposed to. If we want to uniquely recover the signal, we need to utilize certain properties about the structure of the signal). The problem is called the compressed sensing problem, and can be stated as

minimize $\|y-Ax\|_2$

subject to $\|x\|_0 \leq k$.

This problem is np-hard to solve. However, if we choose $A$ such that it satisfies restricted isometry property depending on parameter $k$, then we can still uniquely recover $x$, using convex optimization. Of course there are non-convex approaches as well, but a common convex approximation is the lasso problem

minimize  $\|Ax-y\|^2_2 + \lambda \|x\|_1$.

This is harder than our vanilla least-squares formulation, but still, very much in the realms of the well established convex optimization algorithms. Now let’s get even more out of our comfort zone. Non-linear problems that are non-convex. So we come to phase retrieval

$y = |Ax|$

which is equivalent to

$y_j = |a_j^T x|$ for $j=1,2\dots m$

or

$y_j = (a_j^T x)^2$.

Insane! Phase contains most of the information! As you might have noticed the trend, we’re trying to recover more using lesser and lesser information. A common analysis suggests that most of the information about a signal is contained in the phase of the measurements, not the magnitudes.

Now this function is highly non convex. Why? For convex functions all sub-level sets are convex. What that means is, take a convex curve. Think parabola symmetric about y axis. And now cut it with $y = t$ , for any $t$, and look at the part of the parabola that lies below $y=t$ ( y-axis is $f(x)$). We want to look at all $x$ such that $f(x) < t$, for any $t$. It just forms a line segment on the x-axis. Now a line segment is a convex set. This parabola hence has a unique minimum.

Now think of a highly irregular curve with say 5-6 local minima, 5-6 global maxima and 1 global minimum. Such curves are not convex, because when you project them on to the x-axis, they don’t form convex sets.

Of course, we then resort to using the first trick from our new signal processing handbook. Make things linear and convex. A clever way to look at

$y = |Ax|$.

is to “linearize” it .

$y_{ph} = phase(Ax)$

then

$y \circ y_{ph} = Ax$

where were looking at an element-wise product, on the left side. This can be better written as

$Cy = Ax$

where $C$ is a diagonal matrix $C = diag(y_{ph})$. And so the problem turns into

minimize over $x,C\quad\|Ax-Cy\|_2$.

It’s easy to see that this problem is not convex. Because the entries of diagonal of $C$ are restricted to be phase values. They have to have unit magnitude. This isn’t a convex set. So now what do we do ?

What AltMinPhase does is that it looks at it as 2 alternating minimizations with $C$ and $x$ being variables. For constant $C$, this is convex and is a least-square problem if $A$ is full rank. For constant $x$, it is easy to see that the optimal $C$ is given by

$C = phase( y_{apparent}) = phase(Ax)$.

They then alternatively, minimise over these 2 variables. Now think back to the highly non-convex function with multiple local minima and maxima. If you initialize wrongly, you’re very likely to hit a local minimum. And that’s the end of it. Wrong solution!

Initialize correctly, close enough to the global minimum (which should ideally be very close to the true $x = x^*$, for a noiseless case), and bam! It’s a very fast convergence algorithm. If $A$ is iid Gaussian, we can design an initial vector which has an expected value equal to the true value $x^*$. Beauty of random matrix theory (the same thing that helped us solve the compressed sensing problem).

*******

In the next few posts, I will write about the intuition and big picture of some of the current topics in signal processing, like structured sparsity, phase retrieval, random matrix theory and some applications in machine learning. In doing so, I hope to create a good database of ideas related to my research.