Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax)

Lecture

In the overwhelming majority of sources of information about neural networks, “now let's educate our network” means “feed the objective function to the optimizer” with only the minimum learning speed setting. It is sometimes said that the weights of the network can be updated not only by stochastic gradient descent, but without any explanation, what is remarkable about other algorithms and what the mysterious ones mean Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax) and in their parameters. Even teachers in machine learning courses often do not focus on this. I would like to correct the lack of information in runet about various optimizers that you may encounter in modern machine learning packages. I hope my article will be useful to people who want to deepen their understanding of machine learning or even invent something of their own.

Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax)

Under the cut a lot of pictures, including animated gif.

The article is aimed at a reader familiar with neural networks. It is assumed that you already understand the essence of backpropagation and SGD. I will not go into a strict proof of the convergence of the algorithms presented below, but on the contrary, I will try to convey their ideas in simple language and show that the formulas are open for further experiments. The article lists not all the complexities of machine learning and not all the ways to overcome them.

Why do we need tricks

Let me remind you what formulas look like for ordinary gradient descent:

Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax)

Where Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax) - network settings - the objective function or loss function in the case of machine learning, and - learning speed. It looks amazingly simple, but a lot of the magic is hidden in - update the parameters of the output layer is quite simple, but to get to the parameters of the layers behind it, you have to go through nonlinearities, the derivatives of which contribute to. This is the familiar principle of the reverse propagation of error - backpropagation.

Explicitly written formulas for updating the scales somewhere in the middle of the network look ugly, because each neuron depends on all the neurons with which it is connected, and those from all the neurons with which they are connected, and so on. At the same time, even in “toy” neural networks, there can be about 10 layers, and among the networks that keep modern classifications of modern datasets - much, much more. Each weight is variable in Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax) . Such an incredible amount of degrees of freedom allows you to build very complex displays, but brings researchers a headache:

Jam at local minima or saddle points, which for a function of There may be a lot of variables.
The complex landscape of the objective function: the plateau alternates with regions of strong nonlinearity. The derivative on the plateau is almost zero, and a sudden break, on the contrary, can send us too far.
Some parameters are updated much less frequently than others, especially when there are informative but rare signs in the data, which have a bad effect on the nuances of the generalizing network rule. On the other hand, giving too much importance to all rarely encountered symptoms in general can lead to retraining.
Too small learning rate causes the algorithm to converge for a very long time and get stuck in local minima, too large - to “fly” narrow global minima or to completely diverge

Computational mathematics known advanced algorithms of the second order, which is able to find a good minimum and on a complex landscape, but then the number of weights hits again. To use the honest method of the second order "in the forehead," you have to count the Hessian. Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax) - the matrix of derivatives for each pair of parameters of a pair of parameters (already bad) - and, say, for the Newton method, also the inverse of it. We have to invent all sorts of tricks to cope with problems, leaving the task computationally lifting. Second-order working optimizers exist, but for now let's concentrate on what we can achieve without considering second derivatives.

Nesterov Accelerated Gradient

By itself, the idea of methods with the accumulation of momentum is obviously simple: "If we move for a while in a certain direction, then we probably should go there for some time in the future." To do this, you need to be able to refer to the recent change history of each parameter. You can store the latest Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax) copies and at each step it is fair to assume an average, but this approach takes too much memory for large . Fortunately, we do not need an exact average, but only an estimate, so we use the exponential moving average.

Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax)

To accumulate something, we will multiply the accumulated value by the conservation factor Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax) and add another value multiplied by . The closer to one, the larger the accumulation window and the stronger the smoothing is history begins to influence more than every next . If a Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax) from a certain moment decay exponentially exponentially, hence the name. We apply the exponential running average to accumulate the gradient of the objective function of our network:

Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax)

Where Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax) usually takes order . note that not lost, but included in ; Sometimes you can find a variant of the formula with an explicit multiplier. The smaller , the more the algorithm behaves like a normal SGD. To get a popular physical interpretation of the equations, imagine how the ball rolls along a hilly surface. If at the moment Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax) under the ball was a non-zero bias ( ), and then it hit the plateau, it will still continue to roll along this plateau. Moreover, the ball will continue to move a couple of updates in the same direction, even if the bias has changed to the opposite. However, viscous friction acts on the ball and every second it loses Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax) its speed. Here's what the accumulated impulse looks like for different (hereinafter, epochs are plotted along the X axis, and the gradient value and accumulated values are along the Y axis):

Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax)

Note that the accumulated in Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax) the value can greatly exceed the value of each of . A simple accumulation of momentum already gives a good result, but Nesterov goes further and applies the idea well-known in computational mathematics: looking ahead along the update vector. Since we're still going to shift to Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax) then let's calculate the loss function gradient not at the point and in . From here:

Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax)

Such a change allows you to “roll” faster if, aside, where we are going, the derivative increases, and more slowly, if vice versa. This is especially evident for Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax) for a graph with a sine.

Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax)

Looking ahead can play a cruel joke on us if you set too large Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax) and : we look so far that we miss the areas with the opposite gradient sign:

Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax)

However, sometimes this behavior may be desirable. I once again draw your attention to the idea - looking ahead - and not to execution. The Nesterov method (6) is the most obvious option, but not the only one. For example, you can use another technique from computational mathematics — stabilization of the gradient by averaging over several points along the line along which we move. So to speak:

Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax)

Or so:

Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax)

Such a technique can help in the case of noisy target functions.

We will not manipulate the argument of the objective function in subsequent methods (although, of course, no one bothers you to experiment). Further, for brevity

Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax)

Adagrad

How methods with the accumulation of momentum are imagined by many. Let us turn to more interesting optimization algorithms. Let's start with a relatively simple Adagrad - adaptive gradient.

Some signs may be extremely informative, but rarely occur. Exotic high-paying profession, a fancy word in the spam database - they will easily drown in the noise of all other updates. It is not only about rarely encountered input parameters. Let's say that you may well encounter rare graphic patterns that even become a sign only after passing through several layers of a convolutional network. It would be nice to be able to update the parameters with an eye on how typical they sign. To achieve this is not difficult: let's store for each network parameter the sum of the squares of its updates. It will act as a proxy for typicality: if the parameter belongs to a chain of frequently activated neurons, it is constantly pulled back and forth, which means the amount quickly accumulates. Rewrite the update formula like this:

Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax)

Where Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax) - the sum of the squares of the updates, and - smoothing parameter necessary to avoid dividing by 0. The parameter that was frequently updated in the past is large , is the big denominator in (12). Parameter changed only one or two will be updated in full force. Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax) take order or for a completely aggressive update, but, as can be seen from the graphs, this plays a role only at the beginning, towards the middle, the training begins to outweigh :

Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax)

Итак, идея Adagrad в том, чтобы использовать что-нибудь , что бы уменьшало обновления для элементов, которые мы и так часто обновляем. Никто нас не заставляет использовать конкретно эту формулу, поэтому Adagrad иногда называют семейством алгоритмов. Скажем, мы можем убрать корень или накапливать не квадраты обновлений, а их модули, или вовсе заменить множитель на что-нибудь вроде Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax) .

(Другое дело, что это требует экспериментов. Если убрать корень, обновления начнут уменьшаться слишком быстро, и алгоритм ухудшится)

Ещё одно достоинство Adagrad — отсутствие необходимости точно подбирать скорость обучения. Достаточно выставить её в меру большой, чтобы обеспечить хороший запас, но не такой громадной, чтобы алгроритм расходился. По сути мы автоматически получаем затухание скорости обучения (learning rate decay).

RMSProp и Adadelta

Недостаток Adagrad в том, что Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax) в (12) может увеличиваться сколько угодно, что через некоторое время приводит к слишком маленьким обновлениям и параличу алгоритма. RMSProp и Adadelta призваны исправить этот недостаток.

Модифицируем идею Adagrad: мы всё так же собираемся обновлять меньше веса, которые слишком часто обновляются, но вместо полной суммы обновлений, будем использовать усреднённый по истории квадрат градиента. Снова используем экспоненциально затухающее бегущее среднее
(4). Let be Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax) — бегущее среднее в момент

Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax)

тогда вместо (12) получим

Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax)

Знаменатель есть корень из среднего квадратов градиентов, отсюда RMSProp — root mean square propagation

Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax)

Обратите внимание, как восстанавливается скорость обновления на графике с длинными зубцами для разных Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax) . Также сравните графики с меандром для Adagrad и RMSProp: в первом случае обновления уменьшаются до нуля, а во втором — выходят на определённый уровень.

Вот и весь RMSProp. Adadelta от него отличается тем, что мы добавляем в числитель (14) стабилизирующий член пропорциональный Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax) from . На шаге мы ещё не знаем значение , поэтому обновление параметров происходит в три этапа, а не в два: сначала накапливаем квадрат градиента, затем обновляем Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax) , после чего обновляем .

Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax)

Такое изменение сделано из соображений, что размерности Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax) and должны совпадать. Заметьте, что learning rate не имеет размерности, а значит во всех алгоритмах до этого мы складывали размерную величину с безразмерной. Физики в этом месте ужаснутся, а мы пожмём плечами: работает же.

Заметим, что нам нужен ненулевой Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax) для первого шага, иначе все последующие , а значит и будут равны нулю. Но эту проблему мы решили ещё раньше, добавив в Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax) . Другое дело, что без явного большого мы получим поведение, противоположное Adagrad и RMSProp: мы будем сильнее (до некоторого предела) обновлять веса, которые используются чаще . Ведь теперь чтобы Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax) стал значимым, параметр должен накопить большую сумму в числителе дроби.

Вот графики для нулевого начального Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax) :

Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax)

А вот для большого:

Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax)

Впрочем, похоже, авторы алгоритма и добивались такого эффекта. Для RMSProp и Adadelta, как и для Adagrad не нужно очень точно подбирать скорость обучения — достаточно прикидочного значения. Обычно советуют начать подгон Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax) c a так и оставить . Чем ближе to , тем дольше RMSProp и Adadelta с большим будут сильно обновлять мало используемые веса. If Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax) and , то Adadelta будет долго «с недоверием» относиться к редко используемым весам. Последнее может привести к параличу алгоритма, а может вызвать намеренно «жадное» поведение, когда алгоритм сначала обновляет нейроны, кодирующие самые лучшие признаки.

Adam

Adam — adaptive moment estimation, ещё один оптимизационный алгоритм. Он сочетает в себе и идею накопления движения и идею более слабого обновления весов для типичных признаков. Снова вспомним (4):

Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax)

От Нестерова Adam отличается тем, что мы накапливаем не Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax) , а значения градиента, хотя это чисто косметическое изменение, см. (23). Кроме того, мы хотим знать, как часто градиент изменяется. Авторы алгоритма предложили для этого оценивать ещё и среднюю нецентрированную дисперсию:

Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax)

Легко заметить, что это уже знакомый нам Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax) , так что по сути тут нет отличий от RMSProp.

Важное отличие состоит в начальной калибровке Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax) and : они страдают от той же проблемы, что и в RMSProp: если задать нулевое начальное значение, то они будут долго накапливаться, особенно при большом окне накопления ( Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax) , ), а какие-то изначальные значения — это ещё два гиперпараметра. Никто не хочет ещё два гиперпараметра, так что мы искусственно увеличиваем Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax) and на первых шагах (примерно for and for )

Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax)

В итоге, правило обновления:

Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax)

Здесь следует внимательно посмотреть на то, как быстро синхронизировались значения обновлений на первых зубцах графиков с прямоугольниками и на гладкость кривой обновлений на графике с синусом — её мы получили «бесплатно». При рекомендуемом параметре Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax) на графике с шипами видно, что резкие всплески градиента не вызывает мгновенного отклика в накопленном значении, поэтому хорошо настроенному Adam не нужен gradient clipping.

Авторы алгоритма выводят (22), разворачивая рекурсивные формулы (20) и (21). For example, for Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax) :

Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax)

Term Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax) близко к при стационарном распределении , что неправда в практически интересующих нас случаях. но мы всё равно переносим скобку с Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax) to the left. Неформально, можно представить что при у нас бесконечная история одинаковых обновлений:

Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax)

Когда же мы получаем более близкое к правильному значение Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax) , мы заставляем «виртуальную» часть ряда затухать быстрее:

Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax)

Авторы Adam предлагают в качестве значений по умолчанию Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax) и утверждают, что алгоритм выступает лучше или примерно так же, как и все предыдущие алгоритмы на широком наборе датасетов за счёт начальной калибровки. Заметьте, что опять-таки, уравнения (22) не высечены в камне. У нас есть некоторое теоретическое обоснование, почему затухание должно выглядеть именно так, но никто не запрещает поэкспериментировать с формулами калибровки. На мой взгляд, здесь просто напрашивается применить заглядывание вперёд, как в методе Нестерова.

Adamax

Adamax is just such an experiment proposed in the same article. Instead of dispersion in (21), we can assume the inertial moment of the distribution of gradients of arbitrary degree Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax) .This can lead to instability to the calculations. However, the case of infinity tends to work surprisingly well.

Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax)

Notice that instead Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax) using the appropriate dimension . In addition, note that to use in the Adam formulas the value obtained in (27), it is required to extract the root from it: . We derive a decisive rule in return (21), taking Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax) by unfolding under the root with the help of (27):

Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax)

It happened because Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax) in total, (28) will dominate the largest term. Informally, you can intuitively understand why this happens by taking a simple sum and a large : . Not scary at all.

The remaining steps of the algorithm are the same as in Adam.

Experiments

Now let's look at the different algorithms in action. To make it clearer, let's look at the trajectory of the algorithms in the problem of finding the minimum of a function of two variables. Let me remind you that the training of a neural network is essentially the same, but there are significantly more than two variables and instead of an explicitly specified function we have only a set of points along which we want to build this function. In our case, the loss function is the objective function along which the optimizers move. Of course, on such a simple task it is impossible to feel the full power of advanced algorithms, but it is intuitively clear.

First, let's look at the accelerated Nesterov gradient with different values Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax) . Understanding why this is the case, it is easier to understand the behavior of all other algorithms with accumulation of momentum, including Adam and Adamax.

Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax)

All trajectories end in the same pool, but they do it differently. Small Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax) the algorithm becomes similar to the usual SGD, at each step the descent goes in the direction of a decreasing gradient. With too much , the prehistory of changes begins to strongly influence, and the trajectory can strongly “walk”. Sometimes this is good: the greater the accumulated impulse, the easier it is to break out of the hollows of local minima on the way.

Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax)

Sometimes bad: you can easily lose momentum by slipping into the hollow of the global minimum and settle in the local. Therefore, for large Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax) You can sometimes see how the losses in the training sample first reach the global minimum, then increase greatly, then begin to fall again, but they never return to the past minimum.

Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax)

Now we will consider the different algorithms launched from one point.

Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax)

As you can see, they all agree quite well (with a minimum selection of learning speed). Pay attention to the big steps that Adam and RMSProp take at the start of training. This happens because from the very beginning there were no such changes in any parameter (in one coordinate) and the sums in the denominators (14) and (23) are equal to zero. Here the situation is more complicated:

Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax)

Besides Adam, everyone was locked in a local minimum. Compare the behavior of the Nesterov method and, say, RMSProp on these graphs. Accelerated Nesterov gradient, with any Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax) , hitting a local minimum, whirl around for a while, then loses momentum and fades at some point. RMSProp draws characteristic hedgehogs. This is also related to the sum in the denominator (14) - the gradient squares are small in the trap and the denominator becomes small again. The magnitude of the jumps is still influenced by the learning speed Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax) the more jumps) and (the less, the more). Adagrad does not show this behavior, since this algorithm has a sum over the entire history of gradients, and not over a window. This is usually the desired behavior, it allows you to jump out of the traps, but occasionally in this way the algorithm escapes from the global minimum, which again, leads to an irreparable deterioration in the performance of the algorithm on the training sample.

Finally, note that even though all these optimizers can find a way to a minimum even on a plateau with a very small incline or escape from a local minimum, if they have already gained momentum before, a bad starting point leaves them no chance:

Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax)

Conclusion

So, we have reviewed several of the most popular first-order neural network optimizers. I hope these algorithms have ceased to seem like a magical black box with a bunch of mysterious parameters, and now you can make an informed decision on which optimizer to use in your tasks.

Finally, I will clarify one important point: it is unlikely that changing the algorithm for updating the weights will solve all your problems with a neural network with one vzhuho. Of course, the increase in the transition from SGD to something else will be obvious, but most likely the learning history for the algorithms described in the article will look something like this for relatively simple data networks and network structures:

Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax)

... not too impressive. I would suggest keeping the quality of Adam's “golden hammer”, as it gives the best results with minimal fitting of parameters. When the network is already more or less debugged, try the Nesterov method with different parameters. Sometimes with the help of it you can achieve better results, but it is relatively sensitive to changes in the network. Plus or minus a couple of layers and you need to look for a new optimal learning rate. Consider the remaining algorithms and their parameters as a few more knobs and toggle switches that can be pulled in some special cases.

If you want some custom graphs with gradients, use this python script (requires python> 3.4, numpy and matplotlib):

Code

from matplotlib import pyplot as plt
import numpy as np
from math import ceil, floor

def linear_interpolation(X, idx):
    idx_min = floor(idx)
    idx_max = ceil(idx)
    if idx_min == idx_max or idx_max >= len(X):
        return X[idx_min]
    elif idx_min < 0:
        return X[idx_max]
    else:
        return X[idx_min] + (idx - idx_min)*X[idx_max]

def EDM(X, gamma, lr=0.25):
    Y = []
    v = 0
    for x in X:
        v = gamma*v + lr*x
        Y.append(v)
    return np.asarray(Y)

def NM(X, gamma, lr=0.25):
    Y = []
    v = 0
    for i in range(len(X)):
        v = gamma*v + lr*(linear_interpolation(X, i+gamma*v) if i+gamma*v < len(X) else 0)
        Y.append(v)
    return np.asarray(Y)

def SmoothedNM(X, gamma, lr=0.25):
    Y = []
    v = 0
    for i in range(len(X)):
        lookahead4 = linear_interpolation(X, i+gamma*v/4)   if i+gamma*v/4      < len(X) else 0
        lookahead3 = linear_interpolation(X, i+gamma*v/2)   if i+gamma*v/2      < len(X) else 0
        lookahead2 = linear_interpolation(X, i+gamma*v*3/4) if i+gamma*v*3/4    < len(X) else 0
        lookahead1 = linear_interpolation(X, i+gamma*v)     if i+gamma*v        < len(X) else 0
        v = gamma*v + lr*(lookahead4 + lookahead3 + lookahead2 + lookahead1)/4
        Y.append(v)
    return np.asarray(Y)

def Adagrad(X, eps, lr=2.5):
    Y = []
    G = 0
    for x in X:
        G += x*x
        v = lr/np.sqrt(G + eps)*x
        Y.append(v)
    return np.asarray(Y)

def RMSProp(X, gamma, lr=0.25, eps=0.00001):
    Y = []
    EG = 0
    for x in X:
        EG = gamma*EG + (1-gamma)*x*x
        v = lr/np.sqrt(EG + eps)*x
        Y.append(v)
    return np.asarray(Y)

def Adadelta(X, gamma, lr=50.0, eps=0.001):
    Y = []
    EG = 0
    EDTheta = lr
    for x in X:
        EG = gamma*EG + (1-gamma)*x*x
        v = np.sqrt(EDTheta + eps)/np.sqrt(EG + eps)*x
        Y.append(v)
        EDTheta = gamma*EDTheta + (1-gamma)*v*v
    return np.asarray(Y)

def AdadeltaZeroStart(X, gamma, eps=0.001):
    return Adadelta(X, gamma, lr=0.0, eps=eps)

def AdadeltaBigStart(X, gamma, eps=0.001):
    return Adadelta(X, gamma, lr=50.0, eps=eps)

def Adam(X, beta1, beta2=0.999, lr=0.25, eps=0.0000001):
    Y = []
    m = 0
    v = 0
    for i, x in enumerate(X):
        m = beta1*m + (1-beta1)*x
        v = beta2*v + (1-beta2)*x*x
        m_hat = m/(1- pow(beta1, i+1) )
        v_hat = v/(1- pow(beta2, i+1) )
        dthetha = lr/np.sqrt(v_hat + eps)*m_hat
        Y.append(dthetha)
    return np.asarray(Y)

np.random.seed(413)
X = np.arange(0, 300)

D_Thetha_spikes = np.asarray( [int(x%60 == 0) for x in X])
D_Thetha_rectangles = np.asarray( [2*int(x%40 < 20) - 1 for x in X])
D_Thetha_noisy_sin = np.asarray( [np.sin(x/20) + np.random.random() - 0.5 for x in X])
D_Thetha_very_noisy_sin = np.asarray( [np.sin(x/20)/5 + np.random.random() - 0.5 for x in X])
D_Thetha_uneven_sawtooth = np.asarray( [ x%20/(15*int(x > 80) + 5) for x in X])
D_Thetha_saturation = np.asarray( [ int(x % 80 < 40) for x in X])

for method_label, method, parameter_step in [
                ("GRAD_Simple_Momentum", EDM, [0.25, 0.9, 0.975]),
                ("GRAD_Nesterov", NM, [0.25, 0.9, 0.975]),
                ("GRAD_Smoothed_Nesterov", SmoothedNM, [0.25, 0.9, 0.975]),
                ("GRAD_Adagrad", Adagrad, [0.0000001, 0.1, 10.0]),
                ("GRAD_RMSProp", RMSProp, [0.25, 0.9, 0.975]),
                ("GRAD_AdadeltaZeroStart", AdadeltaZeroStart, [0.25, 0.9, 0.975]),
                ("GRAD_AdadeltaBigStart", AdadeltaBigStart, [0.25, 0.9, 0.975]),
                ("GRAD_Adam", Adam, [0.25, 0.9, 0.975]),
            ]:
    for label, D_Thetha in [("spikes", D_Thetha_spikes),
                            ("rectangles", D_Thetha_rectangles),
                            ("noisy sin", D_Thetha_noisy_sin),
                            ("very noisy sin", D_Thetha_very_noisy_sin),
                            ("uneven sawtooth", D_Thetha_uneven_sawtooth),
                            ("saturation", D_Thetha_saturation), ]:
        fig = plt.figure(figsize=[16.0, 9.0])
        ax = fig.add_subplot(111)

        ax.plot(X, D_Thetha, label="gradient")
        for gamma in parameter_step:
            Y = method(D_Thetha, gamma)
            ax.plot(X, Y, label="param="+str(gamma))

        ax.spines['bottom'].set_position('zero')
        full_name = method_label + "_" + label

        plt.xticks(np.arange(0, 300, 20))
        plt.grid(True)
        plt.title(full_name)
        plt.xlabel('epoch')
        plt.ylabel('value')
        plt.legend()
        # plt.show(block=True) #Uncoomment and comment next line if you just want to watch
        plt.savefig(full_name)
        plt.close(fig)

If you want to experiment with the parameters of the algorithms and your own functions, use this to create your own animation of the trajectory of the minimizer (you also need theano / lasagne):

More code

import numpy as np
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
import matplotlib.animation as animation
import theano
import theano.tensor as T
from lasagne.updates import nesterov_momentum, rmsprop, adadelta, adagrad, adam

#For reproducibility. Comment it out for randomness
np.random.seed(413)

#Uncoomment and comment next line if you want to try random init
# clean_random_weights = scipy.random.standard_normal((2, 1))
clean_random_weights = np.asarray([[-2.8], [-2.5]])
W = theano.shared(clean_random_weights)
Wprobe = T.matrix('weights')

levels = [x/4.0 for x in range(-8, 2*12, 1)] + [6.25, 6.5, 6.75, 7] + \
         list(range(8, 20, 1))
levels = np.asarray(levels)

O_simple_quad = (W**2).sum()
O_wobbly = (W**2).sum()/3 + T.abs_(W[0][0])*T.sqrt(T.abs_(W[0][0]) + 0.1) + 3*T.sin(W.sum()) + 3.0 + 8*T.exp(-2*((W[0][0] + 1)**2+(W[1][0] + 2)**2))
O_basins_and_walls = (W**2).sum()/2 + T.sin(W[0][0]*4)**2
O_ripple = (W**2).sum()/3 + (T.sin(W[0][0]*20)**2 + T.sin(W[1][0]*20)**2)/15
O_giant_plateu = 4*(1-T.exp(-((W[0][0])**2+(W[1][0])**2)))
O_hills_and_canyon = (W**2).sum()/3 + \
                     3*T.exp(-((W[0][0] + 1)**2+(W[1][0] + 2)**2)) + \
                       T.exp(-1.5*(2*(W[0][0] + 2)**2+(W[1][0] -0.5)**2)) + \
                     3*T.exp(-1.5*((W[0][0] -1)**2+2*(W[1][0] + 1.5)**2)) + \
                     1.5*T.exp(-((W[0][0] + 1.5)**2+3*(W[1][0] + 0.5)**2)) + \
                     4*(1 - T.exp(-((W[0][0] + W[1][0])**2)))

O_two_minimums = 4-0.5*T.exp(-((W[0][0] + 2.5)**2+(W[1][0] + 2.5)**2))-3*T.exp(-((W[0][0])**2+(W[1][0])**2))

nesterov_testsuit = [
                (nesterov_momentum, "nesterov momentum 0.25",    {"learning_rate": 0.01, "momentum": 0.25}),
                (nesterov_momentum, "nesterov momentum 0.9",     {"learning_rate": 0.01, "momentum": 0.9}),
                (nesterov_momentum, "nesterov momentum 0.975",   {"learning_rate": 0.01, "momentum": 0.975})
            ]

cross_method_testsuit = [
                (nesterov_momentum, "nesterov",     {"learning_rate": 0.01}),
                (rmsprop,           "rmsprop",      {"learning_rate": 0.25}),
                (adadelta,          "adadelta",     {"learning_rate": 100.0}),
                (adagrad,           "adagrad",      {"learning_rate": 1.0}),
                (adam,              "adam",         {"learning_rate": 0.25})
            ]

for O, plot_label in [
           (O_wobbly, "Wobbly"),
           (O_basins_and_walls, "Basins_and_walls"),
           (O_giant_plateu, "Giant_plateu"),
           (O_hills_and_canyon, "Hills_and_canyon"),
           (O_two_minimums, "Bad_init")
        ]:

    result_probe = theano.function([Wprobe], O, givens=[(W, Wprobe)])

    history = {}
    for method, history_mark, kwargs_to_method in cross_method_testsuit:
        W.set_value(clean_random_weights)
        history[history_mark] = [W.eval().flatten()]

        updates = method(O, [W], **kwargs_to_method)
        train_fnc = theano.function(inputs=[], outputs=O, updates=updates)

        for i in range(125):
            result_val = train_fnc()
            print("Iteration " + str(i) + " result: "+ str(result_val))
            history[history_mark].append(W.eval().flatten())

        print("-------- DONE {}-------".format(history_mark))

    delta = 0.05
    mesh = np.arange(-3.0, 3.0, delta)
    X, Y = np.meshgrid(mesh, mesh)

    Z = []
    for y in mesh:
        z = []
        for x in mesh:
            z.append(result_probe([[x], [y]]))
        Z.append(z)
    Z = np.asarray(Z)

    print("-------- BUILT MESH -------")

    fig, ax = plt.subplots(figsize=[12.0, 8.0])
    CS = ax.contour(X, Y, Z, levels=levels)
    plt.clabel(CS, inline=1, fontsize=10)
    plt.title(plot_label)

    nphistory = []
    for key in history:
        nphistory.append(
                [np.asarray([h[0] for h in history[key]]),
                 np.asarray([h[1] for h in history[key]]),
                 key]
            )

    lines = []
    for nph in nphistory:
        lines += ax.plot(nph[0], nph[1], label=nph[2])
    leg = plt.legend()

    plt.savefig(plot_label + '_final.png')

    def animate(i):
        for line, hist in zip(lines, nphistory):
            line.set_xdata(hist[0][:i])
            line.set_ydata(hist[1][:i])
        return lines

    def init():
        for line, hist in zip(lines, nphistory):
            line.set_ydata(np.ma.array(hist[0], mask=True))
        return lines

    ani = animation.FuncAnimation(fig, animate, np.arange(1, 120), init_func=init,
                                  interval=100, repeat_delay=0, blit=True, repeat=True)

    print("-------- WRITING ANIMATION -------")

    # plt.show(block=True) #Uncoomment and comment next line if you just want to watch
    ani.save(plot_label + '.mp4', writer='ffmpeg_file', fps=5)

    print("-------- DONE {} -------".format(plot_label))

Comments

To leave a comment

If you have any suggestion, idea, thanks or comment, feel free to write. We really value feedback and are glad to hear your opinion.

To reply

Comment

To confirm that you are not a bot, answer:

Name

Email(not published)

Vote

Neural network optimization methods (gradient descent method, Nesterov, Adagrad, RMSProp and Adadelta, Adam, Adamax)

Why do we need tricks

Nesterov Accelerated Gradient

Adagrad

RMSProp и Adadelta

Adam

Adamax

Experiments

Conclusion

Comments

To leave a comment

Computational Intelligence

Terms: Computational Intelligence