Reduce, replace or transform levels of a data.frame or factor variable (useful for preprocessing datasets).
Reduce, replace or transform levels of a data.frame or factor variable (useful for preprocessing datasets).
delevels(x, levels, label =NULL)
Arguments
x: factor with several levels or a data.frame. If a data.frame, then all factor attributes are transformed.
levels: character vector with several options:
idf -- factor is transformed into a numeric vector using IDF transform.
pcp or c("pcp",perc) -- factor is transformed using PCP transform. If perc is not provided, the default 0.1 value is used.
any other values -- all level values are merged into a single factor level according to label.
Another possibility is to define a vector list, with levels[[i]] values for each factor of the data.frame (see example).
label: the new label used for all levels examples (if NULL then "_OTHER" is assumed).
Details
The Inverse Document Frequency (IDF) uses f(x)= log(n/f_x), where n is the length of x and f_x is the frequency of x.
The Percentage Categorical Pruned (PCP) merges all least frequent levels (summing up to perc percent) into a single level.
When other values are used for levels, this function replaces all levels values with the single label value.
Returns
Returns a transformed factor or data.frame.
References
PCP transform:
L.M. Matos, P. Cortez, R. Mendes, A. Moreau.
Using Deep Learning for Mobile Marketing User Conversion Prediction. In Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN 2019), paper N-19327, Budapest, Hungary, July, 2019 (8 pages), IEEE, ISBN 978-1-7281-2009-6.
A Comparison of Data-Driven Approaches for Mobile Marketing User Conversion Prediction. In Proceedings of 9th IEEE International Conference on Intelligent Systems (IS 2018), pp. 140-146, Funchal, Madeira, Portugal, September, 2018, IEEE, ISBN 978-1-5386-7097-2.
### simples examples:f=factor(c("A","A","B","B","C","D","E"))print(table(f))# replace "A" with "a":f1=delevels(f,"A","a")print(table(f1))# merge c("C","D","E") into "CDE":f2=delevels(f,c("C","D","E"),"CDE")print(table(f2))# merge c("B","C","D","E") into _OTHER:f3=delevels(f,c("B","C","D","E"))print(table(f3))## Not run:# larger factor:x=factor(c(1,rep(2,2),rep(3,3),rep(4,4),rep(5,5),rep(10,10),rep(100,100)))print(table(x))# IDF: frequent values are close to zero and# infrequent ones are more close to each other:x1=delevels(x,"idf")print(table(x1))# PCP: infrequent values are mergedx2=delevels(x,c("pcp",0.1))# around 10print(table(x2))# example with a data.frame:y=factor(c(rep("a",100),rep("b",20),rep("c",5)))z=1:125# numericd=data.frame(x=x,y=y,z=z,x2=x)print(summary(d))# IDF:d1=delevels(d,"idf")print(summary(d1))# PCP:d2=delevels(d,"pcp")print(summary(d2))# delevels:L=vector("list",ncol(d))# one per attributeL[[1]]=c("1","2","3","4","5")L[[2]]=c("b","c")L[[4]]=c("1","2","3")# different on purposed3=delevels(d,levels=L,label="other")print(summary(d3))## End(Not run)# end dontrun