Removes or flags duplicated records based on species name and coordinates, as well as user-defined additional columns. True (specimen) duplicates or duplicates from the same species can make up the bulk of records in a biological collection database, but are undesirable for many analyses. Both can be flagged with this function, the former given enough additional information.
cc_dupl( x, lon ="decimalLongitude", lat ="decimalLatitude", species ="species", additions =NULL, value ="clean", verbose =TRUE)
Arguments
x: data.frame. Containing geographical coordinates and species names.
lon: character string. The column with the longitude coordinates. Default = decimalLongitude .
lat: character string. The column with the latitude coordinates. Default = decimalLatitude .
species: a character string. The column with the species name. Default = species .
additions: a vector of character strings. Additional columns to be included in the test for duplication. For example as below, collector name and collector number.
value: character string. Defining the output value. See value.
verbose: logical. If TRUE reports the name of the test and the number of records flagged.
Returns
Depending on the value argument, either a data.frame
containing the records considered correct by the test (clean ) or a logical vector (flagged ), with TRUE = test passed and FALSE = test failed/potentially problematic . Default = clean .