delevels function

Reduce, replace or transform levels of a data.frame or factor variable (useful for preprocessing datasets).

Reduce, replace or transform levels of a data.frame or factor variable (useful for preprocessing datasets).

delevels(x, levels, label = NULL)

Arguments

  • x: factor with several levels or a data.frame. If a data.frame, then all factor attributes are transformed.

  • levels: character vector with several options:

    • idf -- factor is transformed into a numeric vector using IDF transform.
    • pcp or c("pcp",perc) -- factor is transformed using PCP transform. If perc is not provided, the default 0.1 value is used.
    • any other values -- all level values are merged into a single factor level according to label.

    Another possibility is to define a vector list, with levels[[i]] values for each factor of the data.frame (see example).

  • label: the new label used for all levels examples (if NULL then "_OTHER" is assumed).

Details

The Inverse Document Frequency (IDF) uses f(x)= log(n/f_x), where n is the length of x and f_x is the frequency of x.

The Percentage Categorical Pruned (PCP) merges all least frequent levels (summing up to perc percent) into a single level.

When other values are used for levels, this function replaces all levels values with the single label value.

Returns

Returns a transformed factor or data.frame.

References

  • PCP transform:

    L.M. Matos, P. Cortez, R. Mendes, A. Moreau.

    Using Deep Learning for Mobile Marketing User Conversion Prediction. In Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN 2019), paper N-19327, Budapest, Hungary, July, 2019 (8 pages), IEEE, ISBN 978-1-7281-2009-6.

    tools:::Rd_expr_doi("10.1109/IJCNN.2019.8851888")

    http://hdl.handle.net/1822/62771

  • IDF transform:

    L.M. Matos, P. Cortez, R. Mendes and A. Moreau.

    A Comparison of Data-Driven Approaches for Mobile Marketing User Conversion Prediction. In Proceedings of 9th IEEE International Conference on Intelligent Systems (IS 2018), pp. 140-146, Funchal, Madeira, Portugal, September, 2018, IEEE, ISBN 978-1-5386-7097-2.

    https://ieeexplore.ieee.org/document/8710472

    http://hdl.handle.net/1822/61586

Author(s)

Paulo Cortez http://www3.dsi.uminho.pt/pcortez/

See Also

fit and imputation.

Examples

### simples examples: f=factor(c("A","A","B","B","C","D","E")) print(table(f)) # replace "A" with "a": f1=delevels(f,"A","a") print(table(f1)) # merge c("C","D","E") into "CDE": f2=delevels(f,c("C","D","E"),"CDE") print(table(f2)) # merge c("B","C","D","E") into _OTHER: f3=delevels(f,c("B","C","D","E")) print(table(f3)) ## Not run: # larger factor: x=factor(c(1,rep(2,2),rep(3,3),rep(4,4),rep(5,5),rep(10,10),rep(100,100))) print(table(x)) # IDF: frequent values are close to zero and # infrequent ones are more close to each other: x1=delevels(x,"idf") print(table(x1)) # PCP: infrequent values are merged x2=delevels(x,c("pcp",0.1)) # around 10 print(table(x2)) # example with a data.frame: y=factor(c(rep("a",100),rep("b",20),rep("c",5))) z=1:125 # numeric d=data.frame(x=x,y=y,z=z,x2=x) print(summary(d)) # IDF: d1=delevels(d,"idf") print(summary(d1)) # PCP: d2=delevels(d,"pcp") print(summary(d2)) # delevels: L=vector("list",ncol(d)) # one per attribute L[[1]]=c("1","2","3","4","5") L[[2]]=c("b","c") L[[4]]=c("1","2","3") # different on purpose d3=delevels(d,levels=L,label="other") print(summary(d3)) ## End(Not run) # end dontrun