krscv function

Categorical Kernel Regression Spline Cross-Validation

Categorical Kernel Regression Spline Cross-Validation

krscv computes exhaustive cross-validation directed search for a regression spline estimate of a one (1) dimensional dependent variable on an r-dimensional vector of continuous and nominal/ordinal (factor/ordered) predictors.

krscv(xz, y, degree.max = 10, segments.max = 10, degree.min = 0, segments.min = 1, restarts = 0, complexity = c("degree-knots","degree","knots"), knots = c("quantiles","uniform","auto"), basis = c("additive","tensor","glp","auto"), cv.func = c("cv.ls","cv.gcv","cv.aic"), degree = degree, segments = segments, tau = NULL, weights = NULL, singular.ok = FALSE)

Arguments

  • y: continuous univariate vector

  • xz: continuous and/or nominal/ordinal (factor/ordered) predictors

  • degree.max: the maximum degree of the B-spline basis for each of the continuous predictors (default degree.max=10)

  • segments.max: the maximum segments of the B-spline basis for each of the continuous predictors (default segments.max=10)

  • degree.min: the minimum degree of the B-spline basis for each of the continuous predictors (default degree.min=0)

  • segments.min: the minimum segments of the B-spline basis for each of the continuous predictors (default segments.min=1)

  • restarts: number of times to restart optim from different initial random values (default restarts=0) when searching for optimal bandwidths for the categorical predictors for each unique K

    combination (i.e.\ degree/segments)

  • complexity: a character string (default complexity="degree-knots") indicating whether model complexity is determined by the degree of the spline or by the number of segments (knots ). This option allows the user to use cross-validation to select either the spline degree (number of knots held fixed) or the number of knots (spline degree held fixed) or both the spline degree and number of knots

  • knots: a character string (default knots="quantiles") specifying where knots are to be placed. quantiles specifies knots placed at equally spaced quantiles (equal number of observations lie in each segment) and uniform specifies knots placed at equally spaced intervals. If knots="auto", the knot type will be automatically determined by cross-validation

  • basis: a character string (default basis="additive") indicating whether the additive or tensor product B-spline basis matrix for a multivariate polynomial spline or generalized B-spline polynomial basis should be used. Note this can be automatically determined by cross-validation if cv=TRUE and basis="auto", and is an all or none proposition (i.e. interaction terms for all predictors or for no predictors given the nature of tensor products ). Note also that if there is only one predictor this defaults to basis="additive"

    to avoid unnecessary computation as the spline bases are equivalent in this case

  • cv.func: a character string (default cv.func="cv.ls") indicating which method to use to select smoothing parameters. cv.gcv specifies generalized cross-validation (Craven and Wahba (1979)), cv.aic specifies expected Kullback-Leibler cross-validation (Hurvich, Simonoff, and Tsai (1998)), and cv.ls specifies least-squares cross-validation

  • degree: integer/vector specifying the degree of the B-spline basis for each dimension of the continuous x

  • segments: integer/vector specifying the number of segments of the B-spline basis for each dimension of the continuous x

    (i.e. number of knots minus one)

  • tau: if non-null a number in (0,1) denoting the quantile for which a quantile regression spline is to be estimated rather than estimating the conditional mean (default tau=NULL)

  • weights: an optional vector of weights to be used in the fitting process. Should be NULL or a numeric vector. If non-NULL, weighted least squares is used with weights weights (that is, minimizing sum(w*e^2) ); otherwise ordinary least squares is used.

  • singular.ok: a logical value (default singular.ok=FALSE) that, when FALSE, discards singular bases during cross-validation (a check for ill-conditioned bases is performed).

Details

krscv computes exhaustive cross-validation for a regression spline estimate of a one (1) dimensional dependent variable on an r-dimensional vector of continuous and nominal/ordinal (factor/ordered) predictors. The optimal K/lambda combination is returned along with other results (see below for return values). The method uses kernel functions appropriate for categorical (ordinal/nominal) predictors which avoids the loss in efficiency associated with sample-splitting procedures that are typically used when faced with a mix of continuous and nominal/ordinal (factor/ordered) predictors.

For the continuous predictors the regression spline model employs either the additive or tensor product B-spline basis matrix for a multivariate polynomial spline via the B-spline routines in the GNU Scientific Library (https://www.gnu.org/software/gsl/) and the tensor.prod.model.matrix function.

For the discrete predictors the product kernel function is of the Li-Racine type (see Li and Racine (2007) for details).

For each unique combination of degree and segment, numerical search for the bandwidth vector lambda is undertaken using optim and the box-constrained L-BFGS-B

method (see optim for details). The user may restart the optim algorithm as many times as desired via the restarts argument. The approach ascends from K=0 through degree.max/segments.max and for each value of K

searches for the optimal bandwidths for this value of K. After the most complex model has been searched then the optimal K/lambda combination is selected. If any element of the optimal K vector coincides with degree.max/segments.max a warning is produced and the user ought to restart their search with a larger value of degree.max/segments.max.

Returns

krscv returns a crscv object. Furthermore, the function summary supports objects of this type. The returned objects have the following components:

  • K: scalar/vector containing optimal degree(s) of spline or number of segments

  • K.mat: vector/matrix of values of K evaluated during search

  • restarts: number of restarts during search, if any

  • lambda: optimal bandwidths for categorical predictors

  • lambda.mat: vector/matrix of optimal bandwidths for each degree of spline

  • cv.func: objective function value at optimum

  • cv.func.vec: vector of objective function values at each degree of spline or number of segments in K.mat

References

Craven, P. and G. Wahba (1979), Smoothing Noisy Data With Spline Functions, Numerische Mathematik, 13, 377-403.

Hurvich, C.M. and J.S. Simonoff and C.L. Tsai (1998), Smoothing Parameter Selection in Nonparametric Regression Using anImproved Akaike Information Criterion, Journal of the Royal Statistical Society B, 60, 271-293.

Li, Q. and J.S. Racine (2007), Nonparametric Econometrics: Theory and Practice, Princeton University Press.

Ma, S. and J.S. Racine and L. Yang (2015), Spline Regression in the Presence of Categorical Predictors, Journal of Applied Econometrics, Volume 30, 705-717.

Ma, S. and J.S. Racine (2013), Additive Regression Splines with Irrelevant Categorical and ContinuousRegressors,

Statistica Sinica, Volume 23, 515-541.

Author(s)

Jeffrey S. Racine racinej@mcmaster.ca

See Also

loess, npregbw,

Examples

set.seed(42) ## Simulated data n <- 1000 x <- runif(n) z <- round(runif(n,min=-0.5,max=1.5)) z.unique <- uniquecombs(as.matrix(z)) ind <- attr(z.unique,"index") ind.vals <- sort(unique(ind)) dgp <- numeric(length=n) for(i in 1:nrow(z.unique)) { zz <- ind == ind.vals[i] dgp[zz] <- z[zz]+cos(2*pi*x[zz]) } y <- dgp + rnorm(n,sd=.1) xdata <- data.frame(x,z=factor(z)) ## Compute the optimal K and lambda, determine optimal number of knots, set ## spline degree for x to 3 cv <- krscv(x=xdata,y=y,complexity="knots",degree=c(3)) summary(cv)
  • Maintainer: Jeffrey S. Racine
  • License: GPL (>= 3)
  • Last published: 2024-09-29