specleanr1.0.0 package

Detecting Environmental Outliers in Data Analysis Pipelines

adjustboxplots

Adjust the boxplots bounding fences using medcouple to flag suspicious...

bestmethod

Identifies the best method for outlier detection for a single species.

boots

To implement bootstrapping procedures. Sampling with replacement.

broad_classify

Outlier detection method broad classification.

check_names

Check species names for inconsistencies

check_packages

Check for packages to install and respond to use

check.exclude

indicate excluded columns.

checks

Post checks for PCA and bootstrapping

classify_data

Extract final clean data using either absolute or best method generate...

cosine

Cosine similarity index based on (Gautam & Kulkarni 2014; Joy & Renumo...

datacleaner-class

Outlier detection class for multiple methods

distboxplot

Distribution boxplot

ecological_ranges

Check for environmental outliers using species optimal ranges.

eif

Computes the empirical influence function for each values in the datas...

extentvalues

To check for a bounding box

extract_clean_data

Extract final clean data using either absolute or best method generate...

extractMethods

List of outlier detection methods implemented in this package.

extractoutliers

Extract outliers for a one species

geo_ranges

Checks for geographic ranges from FishBase

getdata

Download species records from online database.

getdiff

get dataframe from the large dataframe.

ggenvironmentalspace

Title Plotting to show the quality controlled data in environmental sp...

ggoutlieraccum

Identify if enough methods are selected for the outlier detection.

ggoutliers

Visualize the outliers identified by each method

hamming

Identify best outlier detection method using Hamming distance.

hampel

Flag suspicious outliers based on the Hampel filter method..

handle_true_errors

Catch errors during methods implementation.

interquartile

Computes interquartile range to flag environmental outliers

isoforest

Identify outliers using isolation forest model.

jaccard

Identifies the best outlier detection method using Jaccard coefficient...

jknife

Identifies outliers using Reverse Jackknifing method based on Chapman ...

logboxplot

Log boxplot based for outlier detection.

mahal

Flags outliers based on Mahalanobis distance matrix for all records.

match_datasets

Data harmonizing for offline data based on Darwin Core terms .

match.argc

Customized match function

medianrule

Median rule method

mixediqr

Mixed Interquartile range and semiInterquartile range `Walker et al., ...

multiabsolute

Identifies absolute outliers for multiple species.

multibestmethod

Identify best method for outlier removal for multiple species using ma...

multidetect

Ensemble multiple outlier detection methods.

ocindex

Identifies absolute outliers and their proportions for a single specie...

onesvm

Identify outliers using One Class Support Vector Machines

optimal_threshold

Optimize threshold for clean data extraction.

overlap

Identifies best outlier detection method using Overlap coefficient.

pca

Implement principal component analysis for dimension reduction

pcboot

To package both principal component analysis and bootstrapping.

pred_extract

Preliminary data cleaning including removing duplicates, records outsi...

search_threshold

Determine the threshold using Locally estimated or weighted Scatterplo...

semiIQR

Computes semi-interquantile range to flag suspicious outliers

seqfences

Sequential fences method

show-datacleaner-method

set method for displaying output details after outlier detection.

smc

Identify best outlier detection method using simple matching coefficie...

sorensen

Identifies best outlier detection method suing Sorensen Similarity Ind...

thermal_ranges

Collates minimum, maximum, and preferable temperatures from FishBase.

xglosh

Global-Local Outlier Score from Hierarchies

xkmeans

Flags outliers using kmeans clustering method

xknn

k-nearest neighbors for outlier detection

xlof

Flags suspicious using the local outlier factor or Density-Based Spati...

zscore

Computes z-scores to flag environmental outliers.

A framework used to detect and handle outliers during data analysis workflows. Outlier detection is a statistical concept with applications in data analysis workflows, highlighting records that are suspiciously high or low. Outlier detection in distribution models was initiated by Chapman (1991) (available at <https://www.researchgate.net/publication/332537800_Quality_control_and_validation_of_point-sourced_environmental_resource_data>), who developed the reverse jackknifing method. The concept was further developed and incorporated into different R packages, including 'flexsdm' (Velazco et al., 2022, <doi:10.1111/2041-210X.13874>) and 'biogeo' (Robertson et al., 2016 <doi:10.1111/ecog.02118>). We compiled various outlier detection methods obtained from the literature, including those elaborated in Dastjerdy et al. (2023) <doi:10.3390/geotechnics3020022> and Liu et al. (2008) <doi:10.1109/ICDM.2008.17>. In this package, we introduced the ensembling aspect, where multiple outlier detection methods are used to flag the record as either an absolute outlier. The concept can also be applied in general data analysis, as well as during the development of species distribution models.

  • Maintainer: Anthony Basooma
  • License: GPL (>= 3)
  • Last published: 2025-11-25