bigergm: Exponential-family random graph models for large networks with local dependence
bigergm: Exponential-family random graph models for large networks with local dependence
The function bigergm estimates and simulates three classes of exponential-family random graph models for large networks under local dependence:
The p_1 model of Holland and Leinhardt (1981) in exponential-family form and extensions by Vu, Hunter, and Schweinberger (2013), Schweinberger, Petrescu-Prahova, and Vu (2014), Dahbura et al. (2021), and Fritz et al. (2024) to both directed and undirected random graphs with additional model terms, with and without covariates.
The stochastic block model of Snijders and Nowicki (1997) and Nowicki and Snijders (2001) in exponential-family form.
The exponential-family random graph models with local dependence of Schweinberger and Handcock (2015), with and without covariates. The exponential-family random graph models with local dependence replace the long-range dependence of conventional exponential-family random graph models by short-range dependence. Therefore, exponential-family random graph models with local dependence replace the strong dependence of conventional exponential-family random graph models by weak dependence, reducing the problem of model degeneracy (Handcock, 2003; Schweinberger, 2011) and improving goodness-of-fit (Schweinberger and Handcock, 2015). In addition, exponential-family random graph models with local dependence satisfy a weak form of self-consistency in the sense that these models are self-consistent under neighborhood sampling (Schweinberger and Handcock, 2015), which enables consistent estimation of neighborhood-dependent parameters (Schweinberger and Stewart, 2017; Schweinberger, 2017).
object: An formula object or bigergm class object. If a formula is given, the function estimates a new model specified by it. It needs to be of the form y ~ <model terms>, where y is a network object. For the details on the possible <model terms>, see ergmTerm and Morris, Handcock and Hunter (2008). All terms that induce dependence are excluded from the between block model, while the within block model includes all terms. When you pass a bigergm class object to the function, you continue from the previous MM step. Note that the block allocation (which is either provided by parameter blocks or estimated in the first step) is saved as the vertex.attribute block of the network. This attribute can also be used in the specified formula. The L-ergmTerm is supported to enable size-dependent coefficients for the within-blocks model. Note, however, that for size-dependent parameters of terms that are included in the between-blocks model, the intercept in the linear model provided to L-ergmTerm should not include the intercept. See the second example below for a demonstration.
add_intercepts: Boolean value to indicate whether adequate intercepts should be added to the provided formula so that the model in the first stage of the estimation is a nested model of the estimated model in the second stage of the estimation.
n_blocks: The number of blocks. This must be specified by the user. When you pass a bigergm class object to the function, you don't have to specify this argument.
n_cores: The number of CPU cores to use.
blocks: The pre-specified block memberships for each node. If NULL, the latent community structure is estimated, assuming that the number of communities is n_blocks.
estimate_parameters: If TRUE, both clustering and parameter estimation are implemented. If FALSE, only clustering is executed.
verbose: A logical or an integer: if this is TRUE/1, the program will print out additional information about the progress of estimation and simulation. A higher value yields lower level information.
n_MM_step_max: The maximum number of MM iterations. Currently, no early stopping criteria is introduced. Thus n_MM_step_max MM iterations are exactly implemented.
tol_MM_step: Tolerance regarding the relative change of the lower bound of the likelihood used to decide on the convergence of the clustering step
initialization: How the blocks should be initialized. If infomap (the default), igraph' or Python's infomap is implemented. If random, the initial clusters are randomly uniformally selected. If spectral, spectral clustering is conducted. If walktrap, the walktrap clustering algorithm as implemented in cluster_walktrap is conducted. If initialization
is a vector of integers of the same length as the number of nodes in the provided network (in object), then the provided vector is used as the initial cluster assignment. If initialization is a string relating to a file path, bigergm will interpret it as block allocations saved in Python's infomap .clu format under that path.
use_infomap_python: If TRUE, the cluster initialization is implemented using Pythons' infomap.
virtualenv_python: Which virtual environment should be used for the infomap algorithm?
seed_infomap: seed value (integer) for the infomap algorithm, which can be used to initialize the estimation of the blocks.
weight_for_initialization: weight value used for cluster initialization. The higher this value, the more weight is put on the initialized block allocation.
seed: seed value (integer) for the random number generator.
method_within: If "MPLE" (the default), then the maximum pseudolikelihood estimator is implemented when estimating the within-block network model. If "MLE", then an approximate maximum likelihood estimator is conducted. If "CD" (EXPERIMENTAL), the Monte-Carlo contrastive divergence estimate is returned.
control_within: A list of control parameters for the ergm function used to estimate the parameters of the within model. See control.ergm for details.
clustering_with_features: If TRUE, clustering is implemented using the discrete covariates specified in the formula.
compute_pi: If TRUE, this function keeps track of pi matrices at each MM iteration. If the network is large, we strongly recommend to set to be FALSE.
check_alpha_update: If TRUE, this function keeps track of alpha matrices at each MM iteration. If the network is large, we strongly recommend to set to be FALSE.
check_blocks: If TRUE, this function keeps track of estimated block memberships at each MM iteration.
cache: a cachem cache object used to store intermediate calculations such as eigenvector decomposition results.
return_checkpoint: If TRUE, the function returns the checkpoint list. For most applications, this should be set to TRUE but if memory space needed by the output is an issue, set to FALSE.
only_use_preprocessed: If TRUE, the function only uses the preprocessed data from a previous fit but does not continue the estimation from its final iteration, instead the estimation is started again from the provided initialization.
...: Additional arguments, to be passed to lower-level functions (mainly to the ergm function used for the estimation of within-block connections).
Returns
An object of class 'bigergm' including the results of the fitted model. These include:
call:: call of the mode
block:: vector of the found block of the nodes into cluster
initial_block:: vector of the initial block of the nodes into cluster
sbm_pi:: Connection probabilities represented as a n_blocks x n_blocks matrix from the first stage of the estimation between all clusters
MM_list_z:: list of cluster allocation for each node and each iteration
MM_list_alpha:: list of posterior distributions of cluster allocations for all nodes for each iteration
MM_change_in_alpha:: change in 'alpha' for each iteration
MM_lower_bound:: vector of the evidence lower bounds from the MM algorithm
alpha:: matrix representing the converged posterior distributions of cluster allocations for all nodes
counter_e_step:: integer number indicating the number of iterations carried out
adjacency_matrix:: sparse matrix representing the adjacency matrix used for the estimation
estimation_status:: character stating the status of the estimation
est_within:: ergm object of the model for within cluster connections
est_between:: ergm object of the model for between cluster connections
checkpoint:: list of information to continue the estimation (only returned if return_checkpoint = TRUE)
membership_before_kmeans:: vector of the found blocks of the nodes into cluster before the final check for bad clusters
estimate_parameters:: binary value if the parameters in the second step of the algorithm should be estimated or not
Examples
# Load an embedded network object.data(toyNet)# Specify the model that you would like to estimate.model_formula <- toyNet ~ edges + nodematch("x")+ nodematch("y")+ triangle
# Estimate the modelbigergm_res <- bigergm( object = model_formula,# The model you would like to estimate n_blocks =4,# The number of blocks n_MM_step_max =10,# The maximum number of MM algorithm steps estimate_parameters =TRUE,# Perform parameter estimation after the block recovery step clustering_with_features =TRUE,# Indicate that clustering must take into account nodematch on characteristics check_blocks =FALSE)# Example with N() operator## Not run:set.seed(1)# Prepare ingredients for simulating a networkN <-500K <-10list_within_params <- c(1,2,2,-0.5)list_between_params <- c(-8,0.5,-0.5)formula <- g ~ edges + nodematch("x")+ nodematch("y")+ N(~edges,~log(n)-1)memb <- sample(1:K,prob = c(0.1,0.2,0.05,0.05,0.10,0.1,0.1,0.1,0.1,0.1), size = N, replace =TRUE)vertex_id <- as.character(11:(11+ N -1))x <- sample(1:2, size = N, replace =TRUE)y <- sample(1:2, size = N, replace =TRUE)df <- tibble::tibble( id = vertex_id, memb = memb, x = x, y = y
)g <- network::network.initialize(n = N, directed =FALSE)g %v%"vertex.names"<- df$id
g %v%"block"<- df$memb
g %v%"x"<- df$x
g %v%"y"<- df$y
# Simulate a networkg_sim <- simulate_bigergm( formula = formula, coef_within = list_within_params, coef_between = list_between_params, nsim =1, control_within = control.simulate.formula(MCMC.burnin =200000))estimation <- bigergm(update(formula,new = g_sim~.), n_blocks =10, verbose = T)summary(estimation)## End(Not run)
References
Babkin, S., Stewart, J., Long, X., and M. Schweinberger (2020). Large-scale estimation of random graph models with local dependence. Computational Statistics and Data Analysis, 152, 1--19.
Dahbura, J. N. M., Komatsu, S., Nishida, T. and Mele, A. (2021), ‘A structural model of business cards exchange networks’. https://arxiv.org/abs/2105.12704
Fritz C., Georg C., Mele A., and Schweinberger M. (2024). A strategic model of software dependency networks. https://arxiv.org/abs/2402.13375
Handcock, M. S. (2003). Assessing degeneracy in statistical models of social networks. Technical report, Center for Statistics and the Social Sciences, University of Washington, Seattle.
Holland, P. W. and S. Leinhardt (1981). An exponential family of probability distributions for directed graphs. Journal of the American Statistical Association, Theory & Methods, 76, 33--65.
Morris M, Handcock MS, Hunter DR (2008). Specification of Exponential-Family Random Graph Models: Terms and Computational Aspects. Journal of Statistical Software, 24.
Nowicki, K. and T. A. B. Snijders (2001). Estimation and prediction for stochastic blockstructures. Journal of the American Statistical Association, Theory & Methods, 96, 1077--1087.
Schweinberger, M. (2011). Instability, sensitivity, and degeneracy of discrete exponential families. Journal of the American Statistical Association, Theory & Methods, 106, 1361--1370.
Schweinberger, M. (2020). Consistent structure estimation of exponential-family random graph models with block structure. Bernoulli, 26, 1205--1233.
Schweinberger, M. and M. S. Handcock (2015). Local dependence in random graph models: characterization, properties, and statistical inference. Journal of the Royal Statistical Society, Series B (Statistical Methodology), 7, 647-676.
Schweinberger, M., Krivitsky, P. N., Butts, C.T. and J. Stewart (2020). Exponential-family models of random graphs: Inference in finite, super, and infinite population scenarios. Statistical Science, 35, 627-662.
Schweinberger, M. and P. Luna (2018). HERGM: Hierarchical exponential-family random graph models. Journal of Statistical Software, 85, 1--39.
Schweinberger, M., Petrescu-Prahova, M. and D. Q. Vu (2014). Disaster response on September 11, 2001 through the lens of statistical network analysis. Social Networks, 37, 42--55.
Schweinberger, M. and J. Stewart (2020). Concentration and consistency results for canonical and curved exponential-family random graphs. The Annals of Statistics, 48, 374--396.
Snijders, T. A. B. and K. Nowicki (1997). Estimation and prediction for stochastic blockmodels for graphs with latent block structure. Journal of Classification, 14, 75--100.
Stewart, J., Schweinberger, M., Bojanowski, M., and M. Morris (2019). Multilevel network data facilitate statistical inference for curved ERGMs with geometrically weighted terms. Social Networks, 59, 98--119.
Vu, D. Q., Hunter, D. R. and M. Schweinberger (2013). Model-based clustering of large networks. Annals of Applied Statistics, 7, 1010--1039.