Glossary entry (derived from question below)
Spanish term or phrase:
exponencialmente decreciente de los gradientes cuadrados pasados
English translation:
an exponentially decreasing average of squared past gradients
- The asker opted for community grading. The question was closed on 2020-07-26 15:54:12 based on peer agreement (or, if there were too few peer comments, asker preference.)
Jul 22, 2020 16:05
3 yrs ago
13 viewers *
Spanish term
exponencialmente decreciente de los gradientes cuadrados pasados
Spanish to English
Other
Mathematics & Statistics
Context: También almacena un promedio exponencialmente decreciente de los gradientes cuadrados pasados similar a RMSprop.
Proposed translations
(English)
4 | an exponentially decreasing average of squared past gradients | Francois Boye |
4 | exponentially decaying average of past squared gradientss | Helena Chavarria |
Change log
Jul 22, 2020 16:31: philgoddard changed "Field (write-in)" from "General knowledge" to "(none)"
Proposed translations
8 hrs
Selected
an exponentially decreasing average of squared past gradients
a decreasing average is an average that decreases over time; that decrease is exponential if follows an exponential function.
https://en.wikipedia.org/wiki/Gradient
https://en.wikipedia.org/wiki/Gradient
4 KudoZ points awarded for this answer.
Comment: "Selected automatically based on peer agreement."
17 mins
exponentially decaying average of past squared gradientss
I've found this, though I have no idea what it means!
Adaptive Moment Estimation (Adam) is another method that computes adaptive learning rates for each parameter. In addition to storing an exponentially decaying average of past squared gradients s like Adadelta and RMSprop, Adam also keeps an exponentially decaying average of past gradients v, similar to momentum. Whereas momentum can be seen as a ball running down a slope, Adam behaves like a heavy ball with friction, which thus prefers flat minima in the error surface.
https://towardsdatascience.com/optimisation-algorithm-adapti...
--------------------------------------------------
Note added at 17 mins (2020-07-22 16:23:26 GMT)
--------------------------------------------------
Oops! 'Gradients' should only have one 's'.
--------------------------------------------------
Note added at 2 hrs (2020-07-22 18:59:14 GMT)
--------------------------------------------------
The naive way to do the windowed accumulation of squared gradients is simply by accumulating the last w squared gradients. However, storing and updating the w previous squared gradients is not efficient, especially when the parameter to be updated is very large, which in deep learning could become millions of parameters. Instead, the author of Adadelta implements the accumulation as an exponentially decaying average of the squared gradients, which denoted by 𝔼[g²]. This local accumulation at timestep 𝑡 is computed by
https://medium.com/konvergen/continuing-on-adaptive-method-a...
4.6 Adam
Adaptive Moment Estimation (Adam) [10] is another method that computes adaptive learning rates for each parameter. In addition to storing an exponentially decaying average of past squared gradients vt like Adadelta and RMSprop, Adam also keeps an exponentially decaying average of past gradients mt, similar to momentum:
https://arxiv.org/pdf/1609.04747.pdf
RMSprop
Root Mean Squared Propagation (RMSprop) is very close to Adagrad, except for it does not provide the sum of the gradients, but instead an exponentially decaying average. This decaying average is realized through combining the Momentum algorithm and Adagrad algorithm, with a new term.
https://mlfromscratch.com/optimizers-explained/#/
Adam
Adam stands for Adaptive Moment Estimation. In addition to storing an exponentially decaying average of past squared gradients like Adadelta and RMSprop, Adam also keeps an exponentially decaying average of past gradients, similar to momentum.
https://www.kaggle.com/residentmario/keras-optimizers
Adaptive Moment Estimation (Adam) is another method that computes adaptive learning rates for each parameter. In addition to storing an exponentially decaying average of past squared gradients s like Adadelta and RMSprop, Adam also keeps an exponentially decaying average of past gradients v, similar to momentum. Whereas momentum can be seen as a ball running down a slope, Adam behaves like a heavy ball with friction, which thus prefers flat minima in the error surface.
https://towardsdatascience.com/optimisation-algorithm-adapti...
--------------------------------------------------
Note added at 17 mins (2020-07-22 16:23:26 GMT)
--------------------------------------------------
Oops! 'Gradients' should only have one 's'.
--------------------------------------------------
Note added at 2 hrs (2020-07-22 18:59:14 GMT)
--------------------------------------------------
The naive way to do the windowed accumulation of squared gradients is simply by accumulating the last w squared gradients. However, storing and updating the w previous squared gradients is not efficient, especially when the parameter to be updated is very large, which in deep learning could become millions of parameters. Instead, the author of Adadelta implements the accumulation as an exponentially decaying average of the squared gradients, which denoted by 𝔼[g²]. This local accumulation at timestep 𝑡 is computed by
https://medium.com/konvergen/continuing-on-adaptive-method-a...
4.6 Adam
Adaptive Moment Estimation (Adam) [10] is another method that computes adaptive learning rates for each parameter. In addition to storing an exponentially decaying average of past squared gradients vt like Adadelta and RMSprop, Adam also keeps an exponentially decaying average of past gradients mt, similar to momentum:
https://arxiv.org/pdf/1609.04747.pdf
RMSprop
Root Mean Squared Propagation (RMSprop) is very close to Adagrad, except for it does not provide the sum of the gradients, but instead an exponentially decaying average. This decaying average is realized through combining the Momentum algorithm and Adagrad algorithm, with a new term.
https://mlfromscratch.com/optimizers-explained/#/
Adam
Adam stands for Adaptive Moment Estimation. In addition to storing an exponentially decaying average of past squared gradients like Adadelta and RMSprop, Adam also keeps an exponentially decaying average of past gradients, similar to momentum.
https://www.kaggle.com/residentmario/keras-optimizers
Peer comment(s):
agree |
philgoddard
: These terms may look difficult to a non-statistician, and I'm not one, but they're fairly easy to guess and Google.
7 mins
|
Cheers, Phil :-)
|
|
disagree |
Francois Boye
: a gradient does not decay; instead it in/decreases
1 hr
|
As I have mentioned, I'm definitely no expert. I suggest you contact the authors of the papers/articles I've used to illustrate my answer. Thank you for your much-appreciated opinion.
|
Something went wrong...