adaQN function

adaQN guided optimizer

adaQN guided optimizer

Optimizes an empirical (possibly non-convex) loss function over batches of sample data.

adaQN(x0, grad_fun, obj_fun = NULL, pred_fun = NULL, initial_step = 0.01, step_fun = function(iter) 1/sqrt((iter/100) + 1), callback_iter = NULL, args_cb = NULL, verbose = TRUE, mem_size = 10, fisher_size = 100, bfgs_upd_freq = 20, max_incr = 1.01, min_curvature = 1e-04, y_reg = NULL, scal_reg = 1e-04, rmsprop_weight = 0.9, use_grad_diff = FALSE, check_nan = TRUE, nthreads = -1, X_val = NULL, y_val = NULL, w_val = NULL)

Arguments

  • x0: Initial values for the variables to optimize.
  • grad_fun: Function taking as unnamed arguments x_curr (variable values), X (covariates), y (target variable), and w (weights), plus additional arguments (...), and producing the expected value of the gradient when evalauted on that data.
  • obj_fun: Function taking as unnamed arguments x_curr (variable values), X (covariates), y (target variable), and w (weights), plus additional arguments (...), and producing the expected value of the objective function when evalauted on that data. Only required when using max_incr.
  • pred_fun: Function taking an unnamed argument as data, another unnamed argument as the variable values, and optional extra arguments (...). Will be called when using predict on the object returned by this function.
  • initial_step: Initial step size.
  • step_fun: Function accepting the iteration number as an unnamed parameter, which will output the number by which initial_step will be multiplied at each iteration to get the step size for that iteration.
  • callback_iter: Callback function which will be called at the end of each iteration. Will pass three unnamed arguments: the current variable values, the current iteration number, and args_cb. Pass NULL if there is no need to call a callback function.
  • args_cb: Extra argument to pass to the callback function.
  • verbose: Whether to print information about iteration statuses when something goes wrong.
  • mem_size: Number of correction pairs to store for approximation of Hessian-vector products.
  • fisher_size: Number of gradients to store for calculation of the empirical Fisher product with gradients. If passing NULL, will force use_grad_diff to TRUE.
  • bfgs_upd_freq: Number of iterations (batches) after which to generate a BFGS correction pair.
  • max_incr: Maximum ratio of function values in the validation set under the average values of x during current epoch vs. previous epoch. If the ratio is above this threshold, the BFGS and Fisher memories will be reset, and x values reverted to their previous average. If not using a validation set, will take a longer batch for function evaluations (same as used for gradients when using use_grad_diff = TRUE). Pass NULL for no function-increase checking.
  • min_curvature: Minimum value of (s * y) / (s * s) in order to accept a correction pair. Pass NULL for no minimum.
  • y_reg: Regularizer for 'y' vector (gets added y_reg * s). Pass NULL for no regularization.
  • scal_reg: Regularization parameter to use in the denominator for AdaGrad and RMSProp scaling.
  • rmsprop_weight: If not NULL, will use RMSProp formula instead of AdaGrad for approximated inverse-Hessian initialization.
  • use_grad_diff: Whether to create the correction pairs using differences between gradients instead of empirical Fisher matrix. These gradients are calculated on a larger batch than the regular ones (given by batch_size * bfgs_upd_freq).
  • check_nan: Whether to check for variables becoming NaN after each iteration, and reverting the step if they do (will also reset BFGS memory).
  • nthreads: Number of parallel threads to use. If set to -1, will determine the number of available threads and use all of them. Note however that not all the computations can be parallelized, and the BLAS backend might use a different number of threads.
  • X_val: Covariates to use as validation set (only used when passing max_incr). If not passed, will use a larger batch of stored data, in the same way as for Hessian-vector products in SQN.
  • y_val: Target variable for the covariates to use as validation set (only used when passing max_incr). If not passed, will use a larger batch of stored data, in the same way as for Hessian-vector products in SQN.
  • w_val: Sample weights for the covariates to use as validation set (only used when passing max_incr). If not passed, will use a larger batch of stored data, in the same way as for Hessian-vector products in SQN.

Returns

an adaQN object with the user-supplied functions, which can be fit to batches of data through function partial_fit, and can produce predictions on new data through function predict.

Examples

### Example regression with randomly-generated data library(stochQN) ### Will sample data y ~ Ax + epsilon true_coefs <- c(1.12, 5.34, -6.123) generate_data_batch <- function(true_coefs, n = 100) { X <- matrix( rnorm(length(true_coefs) * n), nrow=n, ncol=length(true_coefs)) y <- X %*% true_coefs + rnorm(n) return(list(X = X, y = y)) } ### Regular regression function that minimizes RMSE eval_fun <- function(coefs, X, y, weights=NULL, lambda=1e-5) { pred <- as.numeric(X %*% coefs) RMSE <- sqrt(mean((pred - y)^2)) reg <- 2 * lambda * as.numeric(coefs %*% coefs) return(RMSE + reg) } eval_grad <- function(coefs, X, y, weights=NULL, lambda=1e-5) { pred <- X %*% coefs grad <- colMeans(X * as.numeric(pred - y)) grad <- grad + 2 * lambda * as.numeric(coefs^2) return(grad) } pred_fun <- function(X, coefs, ...) { return(as.numeric(X %*% coefs)) } ### Initialize optimizer form arbitrary values x0 <- c(1, 1, 1) optimizer <- adaQN(x0, grad_fun=eval_grad, pred_fun=pred_fun, obj_fun=eval_fun, initial_step=1e-0) val_data <- generate_data_batch(true_coefs, n=100) ### Fit to 50 batches of data, 100 observations each for (i in 1:50) { set.seed(i) new_batch <- generate_data_batch(true_coefs, n=100) partial_fit( optimizer, new_batch$X, new_batch$y, lambda=1e-5) x_curr <- get_curr_x(optimizer) i_curr <- get_iteration_number(optimizer) if ((i_curr %% 10) == 0) { cat(sprintf( "Iteration %d - E[f(x)]: %f - values of x: [%f, %f, %f]\n", i_curr, eval_fun(x_curr, val_data$X, val_data$y, lambda=1e-5), x_curr[1], x_curr[2], x_curr[3])) } } ### Predict for new data new_batch <- generate_data_batch(true_coefs, n=10) yhat <- predict(optimizer, new_batch$X)

References

  • Keskar, N.S. and Berahas, A.S., 2016, September. "adaQN: An Adaptive Quasi-Newton Algorithm for Training RNNs." In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (pp. 1-16). Springer, Cham.
  • Wright, S. and Nocedal, J., 1999. "Numerical optimization." (ch 7) Springer Science, 35(67-68), p.7.

See Also

partial_fit , predict.stochQN_guided , adaQN_free

  • Maintainer: David Cortes
  • License: BSD_2_clause + file LICENSE
  • Last published: 2021-09-26