Why we need regularization

As the deep neural network becomes more and more complicated, the over-fitting problem will appear. Therefore we need some tricks to overcome the over-fitting problem. One of the solutions to tackle it is doing regularization. There are several regularization methods, the general version will be discussed in this essay.

How to do regularization

Regularization sounds very noble and mysterious, but it is just an adding item to the original cost function. So let's review what is cost function without regularization:

$J = -\frac{1}{m} \sum\limits_{i = 1}^{m} \large{(}\small y^{(i)}\log\left(a^{[L](i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right) \large{)} \tag{1}$

Then, let's view the cost function with regularization:

$J_{regularized} = \small \underbrace{-\frac{1}{m} \sum\limits_{i = 1}^{m} \large{(}\small y^{(i)}\log\left(a^{[L](i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right) \large{)} }_\text{cross-entropy cost} + \underbrace{\frac{1}{m} \frac{\lambda}{2} \sum\limits_l\sum\limits_k\sum\limits_j W_{k,j}^{[l]2} }_\text{L2 regularization cost} \tag{2}$

Inside this big equation, $\lambda$ is called regularization parameter, apparently it's a kind of hyper-parameters. Different values of $lambda$ will generate different models.

The effects to gradient descent method

In Deep-learning, the Gradient Descent method is usually used to find the most optimal parameters matrix: W. Let's review the gradient descent method on W firstly:

$w := w - \alpha\frac{\partial J(w, b)}{\partial w} \tag{3}$

$\frac{\partial J}{\partial w} = \frac{1}{m}X(A-Y)^T \tag{4}$

If we want to take derivatives on the new version of the cost function, the new partial derivative is:

$\frac{\partial J}{\partial w} = \frac{1}{m}X(A-Y)^T + \frac{\lambda}{m}w \tag{5}$

Now we take the equation 5 into the equation 3, we can get:

$w := w(1 - \alpha\frac{\lambda}{m}) - \frac{\alpha}{m}X(A-Y)^T \tag{6}$

From the equation 5, we can know that $1 - \alpha\frac{\lambda}{m}$ is less than 1, so the final value of W will be smaller than before(without regularization). If the value of $\lambda$ becomes larger, the final value of W would be smaller.

Why Regularization can reduce over-fitting

In order to answer this question intuitively, we start with a fundamental problem: there are only three cases for machine learning models trained by us: "High Bias", "Just right" and "High variance".

Machine-learning cases

Our target is "Just right", and the regularization is used to reduce the third one: "High variance".

According to the deduction from the last section, the $\lambda$ gets bigger, the final W would be smaller. If the $\lambda$ becomes large enough, the value of W will approach zero. That means the whole network becomes a very simple network like Logistic Regression because the majority of network weights becomes 0. So we can find a middle value of $\lambda$ to get the "Just right" case.

The Regularization method to reduce over-fitting

The Regularization method to reduce over-fitting

Why we need regularization

How to do regularization

The effects to gradient descent method

Why Regularization can reduce over-fitting