Empirical Classification Analysis (CA) and Inference
Empirical Classification Analysis (CA) and Inference
ca conducts CA estimation and inference on user-specified objects of interest: first (weighted) moment or (weighted) distribution. Users can use t to specify variables in interest. When object of interest is moment, use cl to specify whether want to see averages or difference of the two groups.
data: The data in use: full sample or subpopulation in interset
method: Models to be used for estimating partial effects. Four options: "logit" (binary response), "probit" (binary response), "ols"
(interactive linear with additive errors), "QR"
(linear model with non-additive errors). Default is "ols".
var_type: The type of parameter in interest. Three options: "binary", "categorical", "continuous". Default is "binary".
var: Variable T in interset. Should be a character.
compare: If parameter in interest is categorical, then user needs to specify which two category to compare with. Should be a 1 by 2 character vector. For example, if the two levels to compare with is 1 and 3, then c=("1", "3"), which will calculate partial effect from 1 to 3. To use this option, users first need to specify var as a factor variable.
subgroup: Subgroup in interest. Default is NULL. Specifcation should be a logical variable. For example, suppose data contain indicator variable for women (female if 1, male if 0). If users are interested in women SPE, then users should specify subgroup = data[, "female"] == 1.
samp_weight: Sampling weight of data. Input should be a n by 1 vector, where n denotes sample size. Default is NULL.
taus: Indexes for quantile regression. Default is c(5:95)/100.
u: Percentile of most and least affected. Default is set to be 0.1.
interest: Generic objects in the least and most affected subpopulations. Two options: (1) "moment": weighted mean of Z in the u-least/most affected subpopulation. (2) "dist": distribution of Z in the u-least/most affected subpopulation. Default is interest = "moment".
t: An index for ca object. Should be a 1 by ncol(data) indicator vector. Users can either (1) specify names of variables of interest directly, or (2) use 1 to indicate the variable of interest. For example, total number of variables is 5 and interested in the 1st and 3rd vars, then specify t = c(1, 0, 1, 0, 0).
cl: If moment = "interest", cl allows the user to get the variables of interest (specified in t
option) of the most and least affected groups. The default is "both", which shows the variables of the two groups; the alternative is "diff", which shows the difference of the two groups. The user can use the summary.ca to tabulate the results, which also contain the standard errors and p- values. If interest = "dist", this option doesn't have any bearing and user can leave it to be the default value.
cat: P-values in classification analysis are adjusted for multiplicity to account for joint testing of zero coefficients on for all variables within a category. Suppose we have selected specified 3 variables in interest: t = c("a", "b", "c"). Without loss of generality, assume "a" is not a factor, while "b" and "c" are two factors. Then users need to specify as cat = c("b", "c"). Default is NULL.
alpha: Size for confidence interval. Shoule be between 0 and 1. Default is 0.1
b: Number of bootstrap draws. Default is 500.
parallel: Whether the user wants to use parallel computation. The default is FALSE and only 1 CPU will be used. The other option is TRUE, and user can specify the number of CPUs in the ncores option.
ncores: Number of cores for computation. Default is set to be detectCores(), which is a function from package parallel that detects the number of CPUs on the current host. For large dataset, parallel computing is highly recommended since bootstrap is time-consuming.
seed: Pseudo-number generation for reproduction. Default is 1.
bc: Whether want the estimate to be bias-corrected. Default is TRUE. If FALSE uncorrected estimate and corresponding confidence bands will be reported.
range_cb: When interest = "dist", we sort and unique variables in interest to estimate weighted CDF. For large dataset there can be memory problem storing very many of observations, and thus users can provide a Sort value and the package will sort and unique based on the weighted quantile of Sort. If users don't want this feature, set range_cb = NULL. Default is c(1:99)/100.
boot_type: Type of bootstrap. Default is "nonpar", and the package implements nonparametric bootstrap. The alternative is "weighted", and the package implements weighted bootstrap.
Returns
If subgroup = NULL, all outputs are whole sample. Otherwise output are subgroup results. When interest = "moment", the output is a list showing
est Estimates of variables in interest.
bse Bootstrap standard errors.
joint_p P-values that are adjusted for multiplicity to account for joint testing for all variables.
pointwise_p P-values that doesn't adjust for join testing
If users have further specified cat (e.g., !is.null(cat)), the fourth component will be replaced with p_cat: P-values that are a djusted for multiplicity to account for joint testing for all variables within a category. Users can use summary.ca to tabulate the results.
When interest = "dist", the output is a list of two components:
infresults A list that stores estimates, upper and lower confidence bounds for all variables in interest for least and most affected groups.
sortvar A list that stores sorted and unique variables in interest.
We recommend using plot.ca command for result visualization.
Details
All estimates are bias-corrected and all confidence bands are monotonized. The bootstrap procedures follow algorithm 2.2 as in Chernozhukov, Fernandez-Val and Luo (2018).
Examples
data("mortgage")### Regression Specificationfm <- deny ~ black + p_irat + hse_inc + ccred + mcred + pubrec +ltv_med + ltv_high + denpmi + selfemp + single + hischl
### Specify characteristics of interestt <- c("deny","p_irat","black","hse_inc","ccred","mcred","pubrec","denpmi","selfemp","single","hischl","ltv_med","ltv_high")### issue ca commandCA <- ca(fm = fm, data = mortgage, var ="black", method ="logit",cl ="diff", t = t, b =50, bc =TRUE)