ca() R function from [SortedEffects]

Empirical Classification Analysis (CA) and Inference

ca conducts CA estimation and inference on user-specified objects of interest: first (weighted) moment or (weighted) distribution. Users can use t to specify variables in interest. When object of interest is moment, use cl to specify whether want to see averages or difference of the two groups.


ca(
  fm,
  data,
  method = c("ols", "logit", "probit", "QR"),
  var_type = c("binary", "continuous", "categorical"),
  var,
  compare,
  subgroup = NULL,
  samp_weight = NULL,
  taus = c(5:95)/100,
  u = 0.1,
  interest = c("moment", "dist"),
  t = c(1, 1, rep(0, dim(data)[2] - 2)),
  cl = c("both", "diff"),
  cat = NULL,
  alpha = 0.1,
  b = 500,
  parallel = FALSE,
  ncores = detectCores(),
  seed = 1,
  bc = TRUE,
  range_cb = c(1:99)/100,
  boot_type = c("nonpar", "weighted")
)

Arguments

fm: Regression formula
data: The data in use: full sample or subpopulation in interset
method: Models to be used for estimating partial effects. Four options: "logit" (binary response), "probit" (binary response), "ols"

(interactive linear with additive errors), "QR"

(linear model with non-additive errors). Default is "ols".
var_type: The type of parameter in interest. Three options: "binary", "categorical", "continuous". Default is "binary".
var: Variable T in interset. Should be a character.
compare: If parameter in interest is categorical, then user needs to specify which two category to compare with. Should be a 1 by 2 character vector. For example, if the two levels to compare with is 1 and 3, then c=("1", "3"), which will calculate partial effect from 1 to 3. To use this option, users first need to specify var as a factor variable.
subgroup: Subgroup in interest. Default is NULL. Specifcation should be a logical variable. For example, suppose data contain indicator variable for women (female if 1, male if 0). If users are interested in women SPE, then users should specify subgroup = data[, "female"] == 1.
samp_weight: Sampling weight of data. Input should be a n by 1 vector, where n denotes sample size. Default is NULL.
taus: Indexes for quantile regression. Default is c(5:95)/100.
u: Percentile of most and least affected. Default is set to be 0.1.
interest: Generic objects in the least and most affected subpopulations. Two options: (1) "moment": weighted mean of Z in the u-least/most affected subpopulation. (2) "dist": distribution of Z in the u-least/most affected subpopulation. Default is interest = "moment".
t: An index for ca object. Should be a 1 by ncol(data) indicator vector. Users can either (1) specify names of variables of interest directly, or (2) use 1 to indicate the variable of interest. For example, total number of variables is 5 and interested in the 1st and 3rd vars, then specify t = c(1, 0, 1, 0, 0).
cl: If moment = "interest", cl allows the user to get the variables of interest (specified in t

option) of the most and least affected groups. The default is "both", which shows the variables of the two groups; the alternative is "diff", which shows the difference of the two groups. The user can use the summary.ca to tabulate the results, which also contain the standard errors and p- values. If interest = "dist", this option doesn't have any bearing and user can leave it to be the default value.
cat: P-values in classification analysis are adjusted for multiplicity to account for joint testing of zero coefficients on for all variables within a category. Suppose we have selected specified 3 variables in interest: t = c("a", "b", "c"). Without loss of generality, assume "a" is not a factor, while "b" and "c" are two factors. Then users need to specify as cat = c("b", "c"). Default is NULL.
alpha: Size for confidence interval. Shoule be between 0 and 1. Default is 0.1
b: Number of bootstrap draws. Default is 500.
parallel: Whether the user wants to use parallel computation. The default is FALSE and only 1 CPU will be used. The other option is TRUE, and user can specify the number of CPUs in the ncores option.
ncores: Number of cores for computation. Default is set to be detectCores(), which is a function from package parallel that detects the number of CPUs on the current host. For large dataset, parallel computing is highly recommended since bootstrap is time-consuming.
seed: Pseudo-number generation for reproduction. Default is 1.
bc: Whether want the estimate to be bias-corrected. Default is TRUE. If FALSE uncorrected estimate and corresponding confidence bands will be reported.
range_cb: When interest = "dist", we sort and unique variables in interest to estimate weighted CDF. For large dataset there can be memory problem storing very many of observations, and thus users can provide a Sort value and the package will sort and unique based on the weighted quantile of Sort. If users don't want this feature, set range_cb = NULL. Default is c(1:99)/100.
boot_type: Type of bootstrap. Default is "nonpar", and the package implements nonparametric bootstrap. The alternative is "weighted", and the package implements weighted bootstrap.

Returns

If subgroup = NULL, all outputs are whole sample. Otherwise output are subgroup results. When interest = "moment", the output is a list showing

est Estimates of variables in interest.
bse Bootstrap standard errors.
joint_p P-values that are adjusted for multiplicity to account for joint testing for all variables.
pointwise_p P-values that doesn't adjust for join testing

If users have further specified cat (e.g., !is.null(cat)), the fourth component will be replaced with p_cat: P-values that are a djusted for multiplicity to account for joint testing for all variables within a category. Users can use summary.ca to tabulate the results.

When interest = "dist", the output is a list of two components:

infresults A list that stores estimates, upper and lower confidence bounds for all variables in interest for least and most affected groups.
sortvar A list that stores sorted and unique variables in interest.

We recommend using plot.ca command for result visualization.

Details

All estimates are bias-corrected and all confidence bands are monotonized. The bootstrap procedures follow algorithm 2.2 as in Chernozhukov, Fernandez-Val and Luo (2018).

Examples


data("mortgage")
### Regression Specification
fm <- deny ~ black + p_irat + hse_inc + ccred + mcred + pubrec +
ltv_med + ltv_high + denpmi + selfemp + single + hischl
### Specify characteristics of interest
t <- c("deny", "p_irat", "black", "hse_inc", "ccred", "mcred", "pubrec",
"denpmi", "selfemp", "single", "hischl", "ltv_med", "ltv_high")
### issue ca command
CA <- ca(fm = fm, data = mortgage, var = "black", method = "logit",
cl = "diff", t = t, b = 50, bc = TRUE)

SortedEffects package Read PDF manual

Maintainer: Shuowen Chen
License: MIT + file LICENSE
Last published: 2022-03-22
https://github.com/shuowencs/SortedEffects

ca function

Empirical Classification Analysis (CA) and Inference

Arguments

Returns

Details

Examples