The Chaos of Deep Learning

chaos

Source: xkcd

But maybe something that deepens our understanding on deep neural networks? And transports us to an uber-cool science fantasy? Well, maybe not the later.

I have a background in physics, and I’ve been pursuing problems in machine learning for quite some time now. So my brain often tries to make connections (eh? eh? neural network puns anyone?) between the glorious physics literature, and what seems to be engineers (including myself) struggling to wade through a math dump and explain why deep networks work, theoretically.

My first step towards this was issuing a book from my campus library on Nonlinear Dynamics and Chaos by Strogatz, something I’ve been meaning to read for the longest time. And the next step (though this should have been the first one), was to see if there were other people who had been making these connections before me. And there were! So here are some interesting articles that I came across:

  1. How to Explain Deep Learning using Chaos and Complexity
  2. Understanding the depth in deep learning through the lense of chaos theory

Now, I don’t know if I can do as good a job as these guys in simplifying the text, but I’ll surely be posting something on this shortly. Till then, do check these articles out!

Advertisements

How interpretable is data?

I am finally done with my second semester towards my PhD, which means it’s time for sum-mer and some-more (or a-lot-more) research!

I happened to have two course projects that I only recently wrapped up, and they turned out to be somewhat related! The two topics being sparse principal component analysis (SPCA) and non-negative matrix factorization (NMF). Both of them, key tools to help interpret data better.

So wait. Given a set of data points, can’t we as humans do the intelligible task of interpretation? What do these data-interpretations tools do that we can’t?

The answer: they don’t do anything we can’t. They are just better at interpreting a larger scale of data. They’re like a self-organizing library. The librarian no longer has to assign books to particular sections, the books do that themselves (not that we want to put librarians out of business)!

Those familiar with machine learning will automatically recognize this problem formulation as that of unsupervised learning. Employ algorithms that make sense out of data! Principal component analysis, does just that. It tries to represent the variation in the data in descending order. The first principal direction has the maximum variation in data. Usually the first few principal components (usually, this number is \leq r, where r is rank of the data matrix) are sufficient to explain most of the (variation in) data. Now these “directions” are composed of the relative “importance” of its constituent features.

Mathematically speaking, the PCA problem boils down to the singular value decomposition,

M_{d\times n} = (U\Sigma)_{d\times r} V^T_{r \times n}

where our data matrix M is assumed to lie in a lower dimensional subspace of rank r. Sparse PCA, additionally assumes that the right singular vectors, which are columns of V are sparse. 

The non-negative matrix factorization problem is similar. A non-negative matrix can be decomposed into non-negative matrices W,H,

M_{d\times n} = W_{d\times r} H_{r \times n}

The basic concept utilized in both of these methods is the same: most data has an underlying structure. Imposing the knowledge of this structure should help us extract meaningful information about this data.

Like what? For example in a text dataset, most articles focus on a few core topics. Further, these core topics, can be represented using few core words. This spurred several cool applications, such as detection of trends on social media. In image processing, this has useful applications in segmentation. Representing images as a sum or weighted sum of components. Demixing of audio signals. The list goes on and on and I bet you can already sense the theme in this one.

image02

Optimization in Life

I’ve been doing a lot of optimization related work and courses, for my PhD, most notably in convex optimization and non-linear programming. They say that the best way to learn theory is to implement it in real life, and so I thought that it wouldn’t hurt to find ways to optimize… life… eh? On that optimistic enough thought, here we go:

  1. The steepest descent is not necessarily the fastest. A common thing that people do when they are in an unwanted situation is to do starkly opposite of what they were initially doing, i.e. -\nabla f(x). This seems to be a go-to solution for minimizing conflict. However, it is well known that to reach the point of minimum (conflict), steepest descent can take far more number of iterations than other gradient based methods. So take it easier, guys. Extremeness is not a smart option.
  2. When bogged down by multiple issues, solve one-problem at-a-time. Coordinate descent is an approach in which the objective (life’s problems) is minimized w.r.t a fixed coordinate at a time. It’s known for its simplicity of implementation.
  3. The apple does not fall far from the tree. So when Newton came up with his method for optimizing functions, the initial estimates did not fall far from the optimum, most notably in the case of quadratic functions. Turns out, it helps to approximate functions at each point with quadratic estimates, and then to minimize that quadratic estimate. Basically, take a problem and convert it into an easier sub-problem that has a known minimum. Move on to the next sub-problem. This fetches you the global optimum much faster.
  4. While positive definiteness is ideal, positive semi-definiteness is good too. If the Hessian of the function to be minimized is positive semi-definite, then the function is convex and can be minimized easily (its local optimum is the global optimum). So keep calm (and kinda positive) and minimize issues.
  5. Often when there are too many parameters to handle, we tend to overfit a fairly complicated model to our life. In such cases, it is a good idea to penalize over-complication by adding a regularization term. Regularization also helps in solving an ill-posed problem. If we tend to focus on only a specific set of problems, we forget other facets of life, which leads us into making poorer choices. The key is to find the right balance or trade-off.
  6. Some problems actually have closed-form or unique solutions. There’s just one possible answer which is apparent enough. In that scenario the optimal strategy should be to stop optimizing. Stop contemplating, just go-get-it!
  7. On a closing note, heuristically speaking, one would need to try out a bunch of optimizing techniques to find the optimal optimization technique.
optimization

XKCD

To make this post even more meta, how optimal would it be if the moral of this post converged to this statement?

Why Murphy was probably right

So, there’s this law by Murphy that most of you must be aware of.

If anything can go wrong, it will.

Now, the origin of Murphy’s law is quite well explained here.

And so goes the original Murphy’s law:

If there are two or more ways to do something, and one of those ways can result in a catastrophe, then someone will do it.

Now the situation that gave rise to this quote is something like this.

Edward A. Murphy, Jr. was one of the engineers on the rocket-sled experiments that were done by the U.S. Air Force in 1949 to test human acceleration tolerances (USAF project MX981). One experiment involved a set of 16 accelerometers mounted to different parts of the subject’s body. There were two ways each sensor could be glued to its mount, and somebody methodically installed all 16 the wrong way around. Murphy then made the original form of his pronouncement, which the test subject (Major John Paul Stapp) quoted at a news conference a few days later. (Source)

I’d think the odds of failure were quite high. How?

The person in charge of installing the accelerometers can be called Mike. Why? It’s a standard enough name. Now, Mike probably wasn’t a smart enough guy.

  1. He did not know which side of the sensor went where and randomly installed all accelerometers, using no common sense, failing to set the right combinations = 0.5 \times (1 - 0.5^{16}) [FAIL]
  2. He did not know which side went where and randomly installed all accelerometers, using no common sense  but luckily fixing the right combinations = 0.5 \times (0.5^{16}) [SUCCESS]
  3. He had the sides interchanged and installed all in the same way. Well, at least he had some common sense to install all in the same way = 0.5 \times 0.5 [FAIL]
  4. Mike was smart. He got the sides right and had the sense to install all in the right way = 0.5 \times 0.5 [SUCCESS]

Let’s give Mike the benefit of doubt. Maybe he was smart. Let’s assign that a probability of 0.5. Probability Mike was dumb is 0.5.

The probability of failure is then approximately 0.75. If you think of any ideal situation too, the probability of the chain of events leading to a success, when multiplied, is quite low.

Let’s look at it this way. The event: Me getting a sound night’s sleep. Shouldn’t be hard right?

Why it doesn’t work: I have a roommate who keeps talking loudly on the phone till wee hours. Why would I have a roommate? I am a research assistant, we don’t get paid well enough for me to be able to afford a better room. Why am I a research assistant? I want to do a PhD. Why do I want to do a PhD? You get the drift.

Turns out, I was almost destined to have a painful right ear, being subjected to continuous loud mindless rants in the middle of the night. The consequences of a lot of our actions aren’t really predictable until events transpire in due course of time. But when they do happen, it’s not that hard to chart out the trajectory of what might have caused them. And so, if anything wrong can happen, perhaps your brain is able to trace that trajectory in advance to forecast what will go wrong.

Here’s the catch though. When things are expected to go wrong and they don’t, we are so happy with the outcome, we barely recognize it as a failure of the law. So Murphy was a genius, in framing a law whose exceptions would go easily unnoticed. Whoever thought that something so iconic would come out of so much pessimism?

Then again, as Phil Dunphy from Modern Family would say…

 

The Quantum Key to Understanding

I had taken a course on Quantum Information and Computation during my undergrad, and I learnt a lot of cool encryption strategies. For those that are new to the field of cryptography, the objective is to encode information using a shared key between an encoder (Alice) and a decoder (Bob). Anyone who doesn’t have the shared key will not be able to decode this information. Of course, the eavesdropper (Eve) may iteratively try out several different keys to successfully decode the message. The ease of decoding by a third party would determine the robustness (or lack of) of the encryption strategy.

Now, in quantum information theory, bits of information is encoded in terms of the spin of the particle (for instance, an electron can have a spin quantum number +\frac{1}{2} or -\frac{1}{2} ).

These two states are orthogonal to each other, as if the electron suffers from a split personality disorder. It can either have a positive spin (0) or a negative spin (1) but not both at the same time. The glass is either full or empty. There are associated probabilities with both events. Since these two events constitute a partition, their probabilities add up to  1 and equal to \frac{1}{2} each.

Now suppose the electrons think of getting a better perceptive. For half the time, it has a positive spin and for the remaining half, negative (0H and 1H). A glass half full or half empty kinda situation.

This forms the Hadamard basis. I suppose Hadamard was a rational guy*.

hadm

Let’s figure out the whole encoding  and decoding strategy.

ALICE

Now, the probability that the eavesdropper Eve picks the same measurement basis as the encoder Alice, is \frac{1}{2} ( because there are only two possible bases: the standard basis and the Hadamard basis). If she does, she correctly decodes the bit encoded by Alice. If instead, Eve chooses the wrong basis, the probability of that is \frac{1}{2}. Subsequently the probability she guesses the right bit is \frac{1}{2}.

So, probability ( Eve guesses correctly ) = \frac{1}{2} + \frac{1}{2} x \frac{1}{2} = \frac{3}{4}.

Probability ( failure of encryption ) = Probability ( Eve guessing correctly ) = \frac{3}{4}.

Pretty high huh? Well, luckily math in on our side. So one would rarely encode information in 1 bit right? Most messages are 10s and 100s of bits long. Maybe even more! Let’s see how the problem works out then.

For the encryption to fail, it must fail for every single bit.

Probability ( failure of encryption )
= Probability ( bit 1 fails ) x ….. Probability ( bit n fails )
= (\frac{3}{4})^n.

Probability ( success of encryption )
= 1 – Probability ( failure of encryption )
= 1 – (\frac{3}{4})^n.

This value approaches 1 as n approaches \infty . Even for n=10, the probability of success of encryption is 0.944. See how the tables turned? What are the odds of that happening? (well, you know the answer now). Classic quantum mess around. What I discussed here is also called the BB84 quantum key distribution protocol. You could read more about it here: BB84.

Now this is a brilliant way to look at situations in life, in general, isn’t it? Aren’t the events in life also probabilistic? I follow this up in my next post: Why Murphy was probably right.

*Side note: There’s a disturbing lack of females in applied mathematics. I’d most naturally tend to assume that a mathematician is a guy. Here’s an article on how, even though women exist in science, they generally take up positions in biology and healthcare related fields, instead of more mathematically gruesome areas:women in science.