pivot() is collapse's data reshaping command. It combines longer-, wider-, and recast-pivoting functionality in a single parsimonious API. Notably, it can also accommodate variable labels.
pivot(data,# Summary of Documentation: ids =NULL,# identifier cols to preserve values =NULL,# cols containing the data names =NULL,# name(s) of new col(s) | col(s) containing names labels =NULL,# name of new labels col | col(s) containing labels how ="longer",# method: "longer"/"l", "wider"/"w" or "recast"/"r" na.rm =FALSE,# remove rows missing 'values' in reshaped data factor = c("names","labels"),# create new id col(s) as factor variable(s)? check.dups =FALSE,# detect duplicate 'ids'+'names' combinations# Only apply if how = "wider" or "recast" FUN ="last",# aggregation function (internal or external) FUN.args =NULL,# list of arguments passed to aggregation function nthreads = .op[["nthreads"]],# minor gains as grouping remains serial fill =NULL,# value to insert for unbalanced data (default NA/NULL) drop =TRUE,# drop unused levels (=columns) if 'names' is factor sort =FALSE,# "ids": sort 'ids' and/or "names": alphabetic casting# Only applies if how = "wider" with multiple long columns ('values') transpose =FALSE# "columns": applies t_list() before flattening, and/or)# "names": sets names nami_colj. default: colj_nami
Arguments
data: data frame-like object (list of equal-length columns).
ids: identifier columns to keep. Specified using column names, indices, a logical vector or an identifier function e.g. is_categorical.
values: columns containing the data to be reshaped. Specified like ids.
names: names of columns to generate, or retrieve variable names from:
how
Description
"longer"
list of names for the variable and value column in the long format, respectively. If NULL , list("variable", "value") will be chosen. Alternatively, a named list length 1 or 2 can be provided using "variable"/"value" as keys e.g. list(value = "data_col") .
"wider"
column(s) containing names of the new variables. Specified using a vector of column names, indices, a logical vector or selector function e.g. is.character . Multiple columns will be combined using finteraction with "_" as separator.
"recast"
(named) list with the following elements: [[1]]/[["from"]] - column(s) containing names of the new variables, specified as in "wider" ; [[2]]/[["to"]] - name of the variable to generate containing old column names. If NULL , list("variable", "variable") will be chosen.
labels: names of columns to generate, or retrieve variable labels from:
how
Description
"longer"
A string specifying the name of the column to store labels - retrieved from the data using vlabels(values) . TRUE will create a column named "label" . Alternatively, a (named) list with two elements: [[1]]/[["name"]] - the name of the labels column; [[2]]/[["new"]] - a (named) character vector of new labels for the 'variable', 'label' and 'value' columns in the long-format frame. See Examples.
"wider"
column(s) containing labels of the new variables. Specified using a vector of column names, indices, a logical vector or selector function e.g. is.character . Multiple columns will be combined using finteraction with " - " as separator.
"recast"
(named) list with the following elements: [[1]]/[["from"]] - column(s) containing labels for the new variables, specified as in "wider" ; [[2]]/[["to"]] - name of the variable to generate containing old labels; [[3]]/[["new"]] - a (named) character vector of new labels for the generated 'variable' and 'label' columns. If [[1]]/[["from"]] is not supplied, this can also include labels for new variables. Omitting one of the elements via a named list or setting it to NULL in a list of 3 will omit the corresponding operation i.e. either not saving existing labels or not assigning new ones.
how: character. The pivoting method: one of "longer", "wider" or "recast". These can be abbreviated by the first letter i.e. "l"/"w"/"r".
na.rm: logical. TRUE will remove missing values such that in the reshaped data there is no row missing all data columns - selected through 'values'. For wide/recast pivots using internal FUN's "first"/"last"/"count", this also toggles skipping of missing values.
factor: character. Whether to generate new 'names' and/or 'labels' columns as factor variables. This is generally recommended as factors are more memory efficient than character vectors and also faster in subsequent filtering and grouping. Internally, this argument is evaluated as factor <- c("names", "labels") %in% factor, so passing anything other than "names" and/or "labels" will disable it.
check.dups: logical. TRUE checks for duplicate 'ids'+'names' combinations, and, if 'labels' are specified, also for duplicate 'names'+'labels' combinations. The default FALSE implies that the algorithm just runs through the data, leading effectively to the FUN option to be executed (default last value). See Details.
FUN: function to aggregate values. At present, only a single function is allowed. Fast Statistical Functions receive vectorized execution. For maximum efficiency, a small set of internal functions is provided: "first", "last", "count", "sum", "mean", "min", or "max". In options "first"/"last"/"count" setting na.rm = TRUE skips missing values. In options "sum"/"mean"/"min"/"max" missing values are always skipped (see Details why). The fill argument is ignored in "count"/"sum"/"mean"/"min"/"max" ("count"/"sum" force fill = 0 else NA is used).
FUN.args: (optional) list of arguments passed to FUN (if using an external function). Data-length arguments such as weight vectors are supported.
nthreads: integer. if how = "wider"|"recast": number of threads to use with OpenMP (default get_collapse("nthreads"), initialized to 1). Only the distribution of values to columns with how = "wider"|"recast" is multithreaded here. Since grouping id columns on a long data frame is expensive and serial, the gains are minor. With how = "long", multithreading does not make much sense as the most expensive operation is allocating the long results vectors. The rest is a couple of memset()'s in C to copy the values.
fill: if how = "wider"|"recast": value to insert for 'ids'-'names' combinations not present in the long format. NULL uses NA for atomic vectors and NULL for lists.
drop: logical. if how = "wider"|"recast" and 'names' is a single factor variable: TRUE will check for and drop unused levels in that factor, avoiding the generation of empty columns.
sort: if how = "wider"|"recast": specifying "ids" applies ordered grouping on the id-columns, returning data sorted by ids. Specifying "names" sorts the names before casting (unless 'names' is a factor), yielding columns cast in alphabetic order. Both options can be passed as a character vector, or, alternatively, TRUE can be used to enable both.
transpose: if how = "wider"|"recast" and multiple columns are selected through 'values': specifying "columns" applies t_list to the result before flattening, resulting in a different column order. Specifying "names" generates names of the form nami_colj, instead of colj_nami. Both options can be passed as a character vector, or, alternatively, TRUE can be used to enable both.
Details
Pivot wider essentially works as follows: compute g_rows = group(ids) and also g_cols = group(names) (using group if sort = FALSE). g_rows gives the row-numbers of the wider data frame and g_cols the column numbers.
Then, a C function generates a wide data frame and runs through each long column ('values'), assigning each value to the corresponding row and column in the wide frame. In this process FUN is always applied. The default, "last", does nothing at all, i.e., if there are duplicates, some values are overwritten. "first" works similarly just that the C-loop is executed the other way around. The other hard-coded options count, sum, average, or compare observations on the fly. Missing values are internally skipped for statistical functions as there is no way to distinguish an incoming NA from an initial NA - apart from counting occurrences using an internal structure of the same size as the result data frame which is costly and thus not implemented.
When passing an R-function to FUN, the data is grouped using g_full = group(g_rows, g_cols), aggregated by groups, and expanded again to full length using TRA before entering the reshaping algorithm. Thus, this is significantly more expensive than the optimized internal functions. With Fast Statistical Functions the aggregation is vectorized across groups, other functions are applied using BY - by far the slowest option.
If check.dups = TRUE, a check of the form fnunique(list(g_rows, g_cols)) < fnrow(data) is run, and an informative warning is issued if duplicates are found.
Recast pivoting works similarly. In long pivots FUN is ignored and the check simply amounts to fnunique(ids) < fnrow(data).
Returns
A reshaped data frame with the same class and attributes (except for 'names'/'row-names') as the input frame.
Note
Leaving either 'ids' or 'values' empty will assign all other columns (except for "variable" if how = "wider"|"recast") to the non-specified argument. It is also possible to leave both empty, e.g. for complete melting if how = "wider" or data transposition if how = "recast" (similar to data.table::transpose but supporting multiple names columns and variable labels). See Examples.
pivot currently does not support concurrently melting/pivoting longer to multiple columns. See data.table::melt or pivot_longer from tidyr or tidytable for an efficient alternative with this feature. It is also possible to achieve this with just a little bit of programming. An example is provided below.
See Also
collap, vec, rowbind, unlist2d, Data Frame Manipulation , Collapse Overview
Examples
# -------------------------------- PIVOT LONGER ---------------------------------# Simple Melting (Reshaping Long)pivot(mtcars)|> head()pivot(iris,"Species")|> head()pivot(iris, values =1:4)|> head()# Same thing# Using collapse's datasetshead(wlddev)pivot(wlddev,1:8, na.rm =TRUE)|> head()pivot(wlddev, c("iso3c","year"), c("PCGDP","LIFEEX"), na.rm =TRUE)|> head()head(GGDC10S)pivot(GGDC10S,1:5, names = list("Sectorcode","Value"), na.rm =TRUE)|> head()# Can also set by name: variable and/or value. Note that 'value' here remains lowercasepivot(GGDC10S,1:5, names = list(variable ="Sectorcode"), na.rm =TRUE)|> head()# Melting including saving labelspivot(GGDC10S,1:5, na.rm =TRUE, labels =TRUE)|> head()pivot(GGDC10S,1:5, na.rm =TRUE, labels ="description")|> head()# Also assigning new labelspivot(GGDC10S,1:5, na.rm =TRUE, labels = list("description", c("Sector Code","Sector Description","Value")))|> namlab()# Can leave out value column by providing named vector of labelspivot(GGDC10S,1:5, na.rm =TRUE, labels = list("description", c(variable ="Sector Code", description ="Sector Description")))|> namlab()# Now here is a nice example that is explicit and respects the dataset naming conventionspivot(GGDC10S, ids =1:5, na.rm =TRUE, names = list(variable ="Sectorcode", value ="Value"), labels = list(name ="Sector", new = c(Sectorcode ="GGDC10S Sector Code", Sector ="Long Sector Description", Value ="Employment or Value Added")))|> namlab(N =TRUE, Nd =TRUE, class =TRUE)# Note that pivot() currently does not support melting to multiple columns# But you can tackle the issue with a bit of programming:wide <- pivot(GGDC10S, c("Country","Year"), c("AGR","MAN","SUM"),"Variable", how ="wider", na.rm =TRUE)head(wide)library(magrittr)wide %>%{av(pivot(.,1:2, grep("_VA", names(.))), pivot(gvr(.,"_EMP")))}|> head()wide %>%{av(av(gv(.,1:2), rm_stub(gvr(.,"_VA"),"_VA", pre =FALSE))|> pivot(1:2, names = list("Sectorcode","VA"), labels ="Sector"), EMP = vec(gvr(.,"_EMP")))}|> head()rm(wide)# -------------------------------- PIVOT WIDER ---------------------------------iris_long <- pivot(iris,"Species")# Getting a long framehead(iris_long)# If 'names'/'values' not supplied, searches for 'variable' and 'value' columnspivot(iris_long, how ="wider")# But here the records are not identified by 'Species': thus aggregation with last value:pivot(iris_long, how ="wider", check =TRUE)# issues a warningrm(iris_long)# This works better, these two are inverse operationswlddev |> pivot(1:8)|> pivot(how ="w")|> head()# ...but not perfect, we loose labelsnamlab(wlddev)wlddev |> pivot(1:8)|> pivot(how ="w")|> namlab()# But pivot() supports labels: these are perfect inverse operationswlddev |> pivot(1:8, labels ="label")|> print(max =50)|># Notice the "label" column pivot(how ="w", labels ="label")|> namlab()# If the data does not have 'variable'/'value' cols: need to specify 'names'/'values'# Using a single column:pivot(GGDC10S, c("Country","Year"),"SUM","Variable", how ="w")|> head()SUM_wide <- pivot(GGDC10S, c("Country","Year"),"SUM","Variable", how ="w", na.rm =TRUE)head(SUM_wide)# na.rm = TRUE here removes all new rows completely missing datatail(SUM_wide)# But there may still be NA's, notice the NA in the final row# We could use fill to set another valuepivot(GGDC10S, c("Country","Year"),"SUM","Variable", how ="w", na.rm =TRUE, fill =-9999)|> tail()# This will keep the label of "SUM", unless we supply a column with new labelsnamlab(SUM_wide)# Such a column is not available here, but we could use "Variable" twicepivot(GGDC10S, c("Country","Year"),"SUM","Variable","Variable", how ="w", na.rm =TRUE)|> namlab()# Alternatively, can of course relabel ex-postSUM_wide |> relabel(VA ="Value Added", EMP ="Employment")|> namlab()rm(SUM_wide)# Multiple-column pivotspivot(GGDC10S, c("Country","Year"), c("AGR","MAN","SUM"),"Variable", how ="w", na.rm =TRUE)|> head()# Here we may prefer a transposed column orderpivot(GGDC10S, c("Country","Year"), c("AGR","MAN","SUM"),"Variable", how ="w", na.rm =TRUE, transpose ="columns")|> head()# Can also flip the order of names (independently of columns)pivot(GGDC10S, c("Country","Year"), c("AGR","MAN","SUM"),"Variable", how ="w", na.rm =TRUE, transpose ="names")|> head()# Can also enable both (complete transposition)pivot(GGDC10S, c("Country","Year"), c("AGR","MAN","SUM"),"Variable", how ="w", na.rm =TRUE, transpose =TRUE)|> head()# or tranpose = c("columns", "names")# Finally, here is a nice, simple way to reshape the entire dataset.pivot(GGDC10S, values =6:16, names ="Variable", na.rm =TRUE, how ="w")|> namlab(N =TRUE, Nd =TRUE, class =TRUE)# -------------------------------- PIVOT RECAST ---------------------------------# Look at the data againhead(GGDC10S)# Let's stack the sectors and instead create variable columnspivot(GGDC10S, .c(Country, Regioncode, Region, Year), names = list("Variable","Sectorcode"), how ="r")|> head()# Same thing (a bit easier)pivot(GGDC10S, values =6:16, names = list("Variable","Sectorcode"), how ="r")|> head()# Removing missing valuespivot(GGDC10S, values =6:16, names = list("Variable","Sectorcode"), how ="r", na.rm =TRUE)|> head()# Saving Labelspivot(GGDC10S, values =6:16, names = list("Variable","Sectorcode"), labels = list(to ="Sector"), how ="r", na.rm =TRUE)|> head()# Supplying new labels for generated columns: as complete as it getspivot(GGDC10S, values =6:16, names = list("Variable","Sectorcode"), labels = list(to ="Sector", new = c(Sectorcode ="GGDC10S Sector Code", Sector ="Long Sector Description", VA ="Value Added", EMP ="Employment")), how ="r", na.rm =TRUE)|> namlab(N =TRUE, Nd =TRUE, class =TRUE)# Now another (slightly unconventional) use case here is data transposition# Let's get the data for BotswanaBWA <- GGDC10S |> fsubset(Country =="BWA", Variable, Year, AGR:SUM)head(BWA)# By supplying no ids or values, we are simply requesting a transpose operationpivot(BWA, names = list(from = c("Variable","Year"), to ="Sectorcode"), how ="r")# Same with labelspivot(BWA, names = list(from = c("Variable","Year"), to ="Sectorcode"), labels = list(to ="Sector"), how ="r")# For simple cases, data.table::transpose() will be more efficient, but with multiple# columns to generate names and/or variable labels to be saved/assigned, pivot() is handyrm(BWA)