xvalidate function

Implementing Cross Validation

Implementing Cross Validation

This is the internal function called by mlfitppml_int to perform cross-validation, if the option is enabled. It is available also on a stand-alone basis in case it is needed, but generally users will be better served by using the wrapper mlfitppml.

xvalidate( y, x, fes, IDs, testID = NULL, tol = 1e-08, hdfetol = 1e-04, colcheck_x = TRUE, colcheck_x_fes = TRUE, init_mu = NULL, init_x = NULL, init_z = NULL, verbose = FALSE, cluster = NULL, penalty = "lasso", method = "placeholder", standardize = TRUE, penweights = rep(1, ncol(x_reg)), lambda = 0 )

Arguments

  • y: Dependent variable (a vector)
  • x: Regressor matrix.
  • fes: List of fixed effects.
  • IDs: A vector of fold IDs for k-fold cross validation. If left unspecified, each observation is assigned to a different fold (warning: this is likely to be very resource-intensive).
  • testID: Optional. A number indicating which ID to hold out during cross-validation. If left unspecified, the function cycles through all IDs and reports the average RMSE.
  • tol: Tolerance parameter for convergence of the IRLS algorithm.
  • hdfetol: Tolerance parameter for the within-transformation step, passed on to collapse::fhdwithin.
  • colcheck_x: Logical. If TRUE, this checks collinearity between the independent variables and drops the collinear variables.
  • colcheck_x_fes: Logical. If TRUE, this checks whether the independent variables are perfectly explained by the fixed effects drops those that are perfectly explained.
  • init_mu: Optional: initial values of the conditional mean μ\mu, to be used as weights in the first iteration of the algorithm.
  • init_x: Optional: initial values of the independent variables.
  • init_z: Optional: initial values of the transformed dependent variable, to be used in the first iteration of the algorithm.
  • verbose: Logical. If TRUE, it prints information to the screen while evaluating.
  • cluster: Optional: a vector classifying observations into clusters (to use when calculating SEs).
  • penalty: A string indicating the penalty type. Currently supported: "lasso" and "ridge".
  • method: The user can set this equal to "plugin" to perform the plugin algorithm with coefficient-specific penalty weights (see details). Otherwise, a single global penalty is used.
  • standardize: Logical. If TRUE, x variables are standardized before estimation.
  • penweights: Optional: a vector of coefficient-specific penalties to use in plugin lasso when method == "plugin".
  • lambda: Penalty parameter, to be passed on to penhdfeppml_int or penhdfeppml_cluster_int.

Returns

A list with two elements:

  • rmse: root mean squared error (RMSE).
  • mu: conditional means.

Details

xvalidate carries out cross-validation with the user-provided IDs by holding out each one of them, sequentially, as in the k-fold procedure (unless testID is specified, in which case it just uses this ID for validation). After filtering out the holdout sample, the function simply calls penhdfeppml_int and penhdfeppml_cluster_int to estimate the coefficients, it predicts the conditional means for the held-out observations and finally it calculates the root mean squared error (RMSE).

References

Breinlich, H., Corradi, V., Rocha, N., Ruta, M., Santos Silva, J.M.C. and T. Zylkin (2021). "Machine Learning in International Trade Research: Evaluating the Impact of Trade Agreements", Policy Research Working Paper; No. 9629. World Bank, Washington, DC.

Correia, S., P. Guimaraes and T. Zylkin (2020). "Fast Poisson estimation with high dimensional fixed effects", STATA Journal, 20, 90-115.

Gaure, S (2013). "OLS with multiple high dimensional category variables", Computational Statistics & Data Analysis, 66, 8-18.

Friedman, J., T. Hastie, and R. Tibshirani (2010). "Regularization paths for generalized linear models via coordinate descent", Journal of Statistical Software, 33, 1-22.

Belloni, A., V. Chernozhukov, C. Hansen and D. Kozbur (2016). "Inference in high dimensional panel models with an application to gun control", Journal of Business & Economic Statistics, 34, 590-605.

Examples

# First, we need to transform the data. Start by filtering the data set to keep only countries in # the Americas: americas <- countries$iso[countries$region == "Americas"] trade <- trade[(trade$imp %in% americas) & (trade$exp %in% americas), ] # Now generate the needed x, y and fes objects: y <- trade$export x <- data.matrix(trade[, -1:-6]) fes <- list(exp_time = interaction(trade$exp, trade$time), imp_time = interaction(trade$imp, trade$time), pair = interaction(trade$exp, trade$imp)) # We also need to create the IDs. We split the data set by agreement, not observation: id <- unique(trade[, 5]) nfolds <- 10 unique_ids <- data.frame(id = id, fold = sample(1:nfolds, size = length(id), replace = TRUE)) cross_ids <- merge(trade[, 5, drop = FALSE], unique_ids, by = "id", all.x = TRUE) # Finally, we try xvalidate with a lasso penalty (the default) and two lambda values: ## Not run: reg <- xvalidate(y = y, x = x, fes = fes, lambda = 0.001, IDs = cross_ids$fold, verbose = TRUE) ## End(Not run)
  • Maintainer: Joao Cruz
  • License: MIT + file LICENSE
  • Last published: 2025-02-08