# NIPS 2017: Themes and Takeaways

Ideas that caught my eye, some that went over my head and few that permeated straight through to the brain. All while at NIPS ’17.

Disclaimer: This list is in no way exhaustive. There was a LOT of information all around me, and I could only make note of a limited themes that I either found interesting or relevant.

1. #### Phase retrieval/solving random quadratic system of equations.

I’ll jump straight to this topic, because it was the most relevant one to me, and I found a couple of interesting papers on this, namely:
i. Solving Most Systems of Random Quadratic Equations [Poster]
(uses iteratively reweighted gradient descent approach, to achieve information theoretically optimal Gaussian sample complexity).
ii. Convolutional Phase Retrieval [Poster]
(proposes a new sensing procedure composed of convolutions of a Gaussian distributed filter).
iii. A Local Analysis of Block Coordinate Descent for Gaussian Phase Retrieval [Workshop]
(uses block coordinate descent for alternating minimization based recovery procedure from phaseless Gaussian measurements).
iv. Fast, Sample-efficient Algorithms for Structured Phase Retrieval [Poster]
(using alternating minimization to recover structurally sparse signals from phaseless Gaussian measurements). (this was mine, so I can’t not publicize it 😀 )

The main takeaways, in my opinion would be
– experimenting with structure of the signals to be recovered,
– experimenting with the measurement setup, and
– modifying the two standard optimization approaches:
Wirtinger flow based gradient descent and Alternating minimization.

2. #### Looking beyond gradient descent for training neural networks.

i. Gradient Descent Can Take Exponential Time to Escape Saddle Points [Spotlight]

Takeaway:
-Gradient descent in itself is ill-equipped to escape saddle points. Perturbed/accelerated versions perform better.

3. #### Sparse Bayesian learning.

i. From Bayesian Sparsity to Gated Recurrent Nets [Orals]
(the authors connect the sparse Bayesian learning problem to RNNs and present an LSTM model that can be used for sparse signal estimation).
ii. DNNs for sparse coding and dictionary learning [Workshop]
(learning sparse regularizers for sparse signal estimation, via deep networks).

Takeaway:
-Interesting connections between Bayesian sparse signal estimation and deep nets.

4. #### Optimization techniques.

i. A Conservation Law Method in Optimization [Workshop]
(an interesting parallel between non-convex optimization and Newton’s second law)
ii. Faster Non-Convex Optimization than SDG [Workshop]
(using ε-approximations of local minima of smooth nonconvex functions)
vi. The marginal value of adaptive gradient methods in machine learning [Orals]
(SGD can perform better, with adequately chosen learning rate, as compared to adaptive optimizers like ADAM. One needs to rethink the optimizers being used for training deep networks) (this paper seems to have sparked a debate and even has a dedicated Reddit thread)
v. Implicit Regularization in Matrix Factorization [Spotlight]
(theoretical guarantees for convergence of gradient descent to minimum nuclear norm solution for matrix factorization problem, under firm initialization and step size constraints).
vi. Generalized Linear Model Regression under Distance-to-set Penalties [Spotlight]
(introduces a new penalty method to overcome drawback of shrinkage, while using Lasso).
vii. Unbiased estimates for linear regression via volume sampling [Spotlight]
(interesting technique to obtain (fat) matrix pseudo-inverse by picking only a subset of columns, hence speeding up the pseudo-inverse operation).

Takeaway:
-new techniques with refined theoretical guarantees for convergence of optimization procedures.

5. #### New directions/miscellaneous

i. Deep Sets [Orals] (design objective functions defined on sets that are invariant to permutations).
ii. Unsupervised object learning from dense equivarient image labeling. [Orals]
(using a large number of images of an object and no other supervision, to
extract a dense object-centric coordinate frame, for 3D modelling).
iii. Geometric deep learning on graphs and manifolds [Tutorial]
iv. A Unified Approach to Interpreting Model Predictions [Orals]
v. Diving into the shallows: a computational perspective on large-scale shallow learning. [Spotlight] (demonstrates that only a vanishingly small fraction of the function space is reachable after a polynomial number of gradient descent iterations when used in conjunctions with smooth kernels/shallow methods, hence exposing the limitation of shallow networks on large-scale data).

i. Gradient Descent GANs are Locally Stable [Orals] (utilizes non-linear systems theory to show local exponential stability of GAN optimization)
ii. Unsupervised image-to-image translation networks [Spotlight]
iii. Dual discriminator GANs [Spotlight] (theoretical analysis to show that, given the maximal discriminators, optimizing the generator of 2-discriminator GAN helps avoiding the mode collapsing problem).
iv. Bridging the gap between theory and practice of GANs [Workshop]

Takeaways:
-new applications
-new breakthroughs in terms of theoretical results for convergence and solving the “mode collapse” problem.

I think overall I was exposed to a lot of interesting ideas, and hopefully I will be able to make time to go through each of these papers in further detail.

# The NIPS experience: Newbie edition

Over the past week, I attended (and presented at) one of the biggest conferences in Machine Learning: Neural Information Processing Systems (NIPS) 2017 at Long Beach, California, and the experience was nothing short of exhilarating. There were a number of themes that I made note of, and one blog post is not enough to illustrate them all. So I’ll try to enforce some structural sparsity here to reduce the complexity of this text.

1. NIPS 2017 was humongous.

About 8000 people from academia and industry, thronging to the Long Beach Convention (epi)Center to talk about ground breaking research. It was chaotic. Took me an hour of standing in line to just get my registration badge!

2. Star studded. Both in terms of people and sponsor companies.

3. GANs were an audience favorite. You know something is the new buzz word when companies turn it into a catchphrase and print it on a t-shirt (yes, I did manage to get one for myself!).

You can’t find a better endorsement! There was an entire track of talks specially dedicated to recent advancements in GANs.

4. Bridging the gap between Theory and Practice.

Ali Rahimi’s talk before accepting the Test of Time award was something that was recommended for multiple viewings by multiple people to me. And the entire focus of the talk was about bettering the current brittle algorithmic frameworks, by theoretically analyzing the entire optimization problem, and not treating it like alchemy. There was also an entire workshop dedicated to this theme.

5. “Where’s the party tonight?”

I was asked by at least 5 different people if I was attending a certain sponsor after-party. I had actually got invites to most and RSVP-ed as well, but I found myself extremely exhausted (also, running out of mingling-with-random-strangers stamina). In fact there were people who were particularly interested in the parties and had no clue about what the next talk was about. I guess beyond a point, certain level of sponsor involvement could get worrisome.

6. “Do you want some swag?”

With so many sponsor booths, they had to try different strategies to attract the best minds around. Which meant, flashy sponsor swag (translation: goodies). You could collect enough t-shirts to get through 2 weeks without laundry. These companies certainly know their target, deprived grad students, well.

7. Orals, spotlights and posters.

So much information to gather! NIPS this year, had a record 678 accepted papers, with main themes being Algorithms, Theory, Optimization, Reinforcement Learning, Applications, GANs.

8. Even more orals and posters, in the form of numerous workshops. Also guest appearances from the Women in Machine Learning (WiML) community.

9. Debates and panels.

An interesting debate on the relevance of studying the interpretability problem, sparked a conversation on the various interpretations of the term itself, and whether the problem was motivated well enough, to begin with.

10. A free flow of ideas from every corner.

Some highlights were talks (that I managed to attend) by Bertsekas, Goodfellow, etc. Couldn’t attend some of the morning ones though! And of course, there was a lot to take away from several of the posters sessions. I think I also learnt how to sell an idea better, through my own poster presentation.

I think overall, it was a great learning experience and incredible exposure for a first-timer like me. Hopefully, I will get a chance to visit again! I’m also going to write a part 2 of my experience, which will focus more on some of the more technical ideas that caught my attention at NIPS 2017. Should make a good follow up read after this one! Watch out!

# The Chaos of Deep Learning

Source: xkcd

But maybe something that deepens our understanding on deep neural networks? And transports us to an uber-cool science fantasy? Well, maybe not the later.

I have a background in physics, and I’ve been pursuing problems in machine learning for quite some time now. So my brain often tries to make connections (eh? eh? neural network puns anyone?) between the glorious physics literature, and what seems to be engineers (including myself) struggling to wade through a math dump and explain why deep networks work, theoretically.

My first step towards this was issuing a book from my campus library on Nonlinear Dynamics and Chaos by Strogatz, something I’ve been meaning to read for the longest time. And the next step (though this should have been the first one), was to see if there were other people who had been making these connections before me. And there were! So here are some interesting articles that I came across:

Now, I don’t know if I can do as good a job as these guys in simplifying the text, but I’ll surely be posting something on this shortly. Till then, do check these articles out!

# How interpretable is data?

I am finally done with my second semester towards my PhD, which means it’s time for sum-mer and some-more (or a-lot-more) research!

I happened to have two course projects that I only recently wrapped up, and they turned out to be somewhat related! The two topics being sparse principal component analysis (SPCA) and non-negative matrix factorization (NMF). Both of them, key tools to help interpret data better.

So wait. Given a set of data points, can’t we as humans do the intelligible task of interpretation? What do these data-interpretations tools do that we can’t?

The answer: they don’t do anything we can’t. They are just better at interpreting a larger scale of data. They’re like a self-organizing library. The librarian no longer has to assign books to particular sections, the books do that themselves (not that we want to put librarians out of business)!

Those familiar with machine learning will automatically recognize this problem formulation as that of unsupervised learning. Employ algorithms that make sense out of data! Principal component analysis, does just that. It tries to represent the variation in the data in descending order. The first principal direction has the maximum variation in data. Usually the first few principal components (usually, this number is $\leq r$, where $r$ is rank of the data matrix) are sufficient to explain most of the (variation in) data. Now these “directions” are composed of the relative “importance” of its constituent features.

Mathematically speaking, the PCA problem boils down to the singular value decomposition,

$M_{d\times n} = (U\Sigma)_{d\times r} V^T_{r \times n}$

where our data matrix $M$ is assumed to lie in a lower dimensional subspace of rank $r$. Sparse PCA, additionally assumes that the right singular vectors, which are columns of $V$ are sparse.

The non-negative matrix factorization problem is similar. A non-negative matrix can be decomposed into non-negative matrices $W,H$,

$M_{d\times n} = W_{d\times r} H_{r \times n}$

The basic concept utilized in both of these methods is the same: most data has an underlying structure. Imposing the knowledge of this structure should help us extract meaningful information about this data.

Like what? For example in a text dataset, most articles focus on a few core topics. Further, these core topics, can be represented using few core words. This spurred several cool applications, such as detection of trends on social media. In image processing, this has useful applications in segmentation. Representing images as a sum or weighted sum of components. Demixing of audio signals. The list goes on and on and I bet you can already sense the theme in this one.