1. Abstract
- Adaptive gradient methods,adapt not only to the sequence of gradients, but also to the precise update delays that occur. 不仅考虑梯度序列,也考虑update delay
- 首先提出了一个impractical的算法,能够准确量化delay的影响
- 然后分析了AdaptiveRevision,能够高效实现,获得comparable guarantee
- 核心的idea是,revise以前梯度步骤的learning rate
2. Intro
2.1 Problem setting
- the Readers and Updaters may reside on different machines, perhaps located in different parts
of the world, communication between them is not instantaneous. Reader和updater之间的通信不是瞬时的,因此Reader使用的通常不是最新的参数 - While the coefficient vector may be stored and updated centrally, predictions must be available in milliseconds in any part of the world. This leads naturally to an architecture in which a large number of Readers maintain local copies of the coefficient vector, sending updates to the Updaters and periodically requesting fresh coefficients from them. 因为参数被中心式地存储和更新,但是prediction必须要实时完成。因此Reader需要保存参数的本地副本,发送更新给Updater,定期地请求最新的参数
- 线性模型,稀疏数据
- 因为delay的存在,operation之间交互影响,因此细粒度的分析很困难
- 假设维度之间是独立的,可以先给出每个coordinate的bound,然后累加每个coordinates,就可以得到总体的bound
- 假设Inorder,如果s1的read在s2之前,那么s1的update在s2之后
- 如果产生update s的read发生在t之前,那么我们说update s在t时刻是outstanding的
- require moderate additional storage and network cost: we store a sum of gradients along with each coefficient, and for each Read, we remember the value of this gradient sum at the time of the Read until the corresponding Update occurs. 存储每个维度的梯度总和,对每个Read,记录Read时的梯度总和,直到对应的Update
- the gradient sum is sent from the Updater to the Reader and back again, ensuring it is available exactly when needed. 将gradient sum从updater发送给reader,然后reader再发回给reader,这样能保证需要的时候是available
3. Experiments
- hypothetical algorithms and AdaptiveRevision
- 两个medium-sized数据集。第一个是web search advertising dataset from a large search engine,310万samples,sparse特征;第二个是恶意URL数据集
- 机器学习算法:logistic regression
4. Conclusions and future work
- The key algorithmic technique is maintaining a sum of gradients, which allows the adjustment of
all learning rates for gradient updates that occurred between the current Update and its Read. 核心的算法是,保存gradient sum,允许调整learning rate,对那些发生在当前Update和对应的Read之间的updates - The analysis method is novel, but is also somewhat indirect. Ideally such an analysis would also remove the technical need for the InOrder assumption, and also allow for the analysis of AdaptiveRevision variants of OGD with Projection and Dual Averaging. 证明不是很直接,理想的证明应该去掉InOrder的假设,也应该可以适应projection和dual average的variants