ca function

Empirical Classification Analysis (CA) and Inference

Empirical Classification Analysis (CA) and Inference

ca conducts CA estimation and inference on user-specified objects of interest: first (weighted) moment or (weighted) distribution. Users can use t to specify variables in interest. When object of interest is moment, use cl to specify whether want to see averages or difference of the two groups.

ca( fm, data, method = c("ols", "logit", "probit", "QR"), var_type = c("binary", "continuous", "categorical"), var, compare, subgroup = NULL, samp_weight = NULL, taus = c(5:95)/100, u = 0.1, interest = c("moment", "dist"), t = c(1, 1, rep(0, dim(data)[2] - 2)), cl = c("both", "diff"), cat = NULL, alpha = 0.1, b = 500, parallel = FALSE, ncores = detectCores(), seed = 1, bc = TRUE, range_cb = c(1:99)/100, boot_type = c("nonpar", "weighted") )

Arguments

  • fm: Regression formula

  • data: The data in use: full sample or subpopulation in interset

  • method: Models to be used for estimating partial effects. Four options: "logit" (binary response), "probit" (binary response), "ols"

    (interactive linear with additive errors), "QR"

    (linear model with non-additive errors). Default is "ols".

  • var_type: The type of parameter in interest. Three options: "binary", "categorical", "continuous". Default is "binary".

  • var: Variable T in interset. Should be a character.

  • compare: If parameter in interest is categorical, then user needs to specify which two category to compare with. Should be a 1 by 2 character vector. For example, if the two levels to compare with is 1 and 3, then c=("1", "3"), which will calculate partial effect from 1 to 3. To use this option, users first need to specify var as a factor variable.

  • subgroup: Subgroup in interest. Default is NULL. Specifcation should be a logical variable. For example, suppose data contain indicator variable for women (female if 1, male if 0). If users are interested in women SPE, then users should specify subgroup = data[, "female"] == 1.

  • samp_weight: Sampling weight of data. Input should be a n by 1 vector, where n denotes sample size. Default is NULL.

  • taus: Indexes for quantile regression. Default is c(5:95)/100.

  • u: Percentile of most and least affected. Default is set to be 0.1.

  • interest: Generic objects in the least and most affected subpopulations. Two options: (1) "moment": weighted mean of Z in the u-least/most affected subpopulation. (2) "dist": distribution of Z in the u-least/most affected subpopulation. Default is interest = "moment".

  • t: An index for ca object. Should be a 1 by ncol(data) indicator vector. Users can either (1) specify names of variables of interest directly, or (2) use 1 to indicate the variable of interest. For example, total number of variables is 5 and interested in the 1st and 3rd vars, then specify t = c(1, 0, 1, 0, 0).

  • cl: If moment = "interest", cl allows the user to get the variables of interest (specified in t

    option) of the most and least affected groups. The default is "both", which shows the variables of the two groups; the alternative is "diff", which shows the difference of the two groups. The user can use the summary.ca to tabulate the results, which also contain the standard errors and p- values. If interest = "dist", this option doesn't have any bearing and user can leave it to be the default value.

  • cat: P-values in classification analysis are adjusted for multiplicity to account for joint testing of zero coefficients on for all variables within a category. Suppose we have selected specified 3 variables in interest: t = c("a", "b", "c"). Without loss of generality, assume "a" is not a factor, while "b" and "c" are two factors. Then users need to specify as cat = c("b", "c"). Default is NULL.

  • alpha: Size for confidence interval. Shoule be between 0 and 1. Default is 0.1

  • b: Number of bootstrap draws. Default is 500.

  • parallel: Whether the user wants to use parallel computation. The default is FALSE and only 1 CPU will be used. The other option is TRUE, and user can specify the number of CPUs in the ncores option.

  • ncores: Number of cores for computation. Default is set to be detectCores(), which is a function from package parallel that detects the number of CPUs on the current host. For large dataset, parallel computing is highly recommended since bootstrap is time-consuming.

  • seed: Pseudo-number generation for reproduction. Default is 1.

  • bc: Whether want the estimate to be bias-corrected. Default is TRUE. If FALSE uncorrected estimate and corresponding confidence bands will be reported.

  • range_cb: When interest = "dist", we sort and unique variables in interest to estimate weighted CDF. For large dataset there can be memory problem storing very many of observations, and thus users can provide a Sort value and the package will sort and unique based on the weighted quantile of Sort. If users don't want this feature, set range_cb = NULL. Default is c(1:99)/100.

  • boot_type: Type of bootstrap. Default is "nonpar", and the package implements nonparametric bootstrap. The alternative is "weighted", and the package implements weighted bootstrap.

Returns

If subgroup = NULL, all outputs are whole sample. Otherwise output are subgroup results. When interest = "moment", the output is a list showing

  • est Estimates of variables in interest.
  • bse Bootstrap standard errors.
  • joint_p P-values that are adjusted for multiplicity to account for joint testing for all variables.
  • pointwise_p P-values that doesn't adjust for join testing

If users have further specified cat (e.g., !is.null(cat)), the fourth component will be replaced with p_cat: P-values that are a djusted for multiplicity to account for joint testing for all variables within a category. Users can use summary.ca to tabulate the results.

When interest = "dist", the output is a list of two components:

  • infresults A list that stores estimates, upper and lower confidence bounds for all variables in interest for least and most affected groups.
  • sortvar A list that stores sorted and unique variables in interest.

We recommend using plot.ca command for result visualization.

Details

All estimates are bias-corrected and all confidence bands are monotonized. The bootstrap procedures follow algorithm 2.2 as in Chernozhukov, Fernandez-Val and Luo (2018).

Examples

data("mortgage") ### Regression Specification fm <- deny ~ black + p_irat + hse_inc + ccred + mcred + pubrec + ltv_med + ltv_high + denpmi + selfemp + single + hischl ### Specify characteristics of interest t <- c("deny", "p_irat", "black", "hse_inc", "ccred", "mcred", "pubrec", "denpmi", "selfemp", "single", "hischl", "ltv_med", "ltv_high") ### issue ca command CA <- ca(fm = fm, data = mortgage, var = "black", method = "logit", cl = "diff", t = t, b = 50, bc = TRUE)