Scanner function

Scan the contents of a dataset

Scan the contents of a dataset

A Scanner iterates over a Dataset 's fragments and returns data according to given row filtering and column projection. A ScannerBuilder

can help create one.

Factory

Scanner$create() wraps the ScannerBuilder interface to make a Scanner. It takes the following arguments:

  • dataset: A Dataset or arrow_dplyr_query object, as returned by the dplyr methods on Dataset.
  • projection: A character vector of column names to select columns or a named list of expressions
  • filter: A Expression to filter the scanned rows by, or TRUE (default) to keep all rows.
  • use_threads: logical: should scanning use multithreading? Default TRUE
  • ...: Additional arguments, currently ignored

Methods

ScannerBuilder has the following methods:

  • $Project(cols): Indicate that the scan should only return columns given by cols, a character vector of column names or a named list of Expression .
  • $Filter(expr): Filter rows by an Expression .
  • $UseThreads(threads): logical: should the scan use multithreading? The method's default input is TRUE, but you must call the method to enable multithreading because the scanner default is FALSE.
  • $BatchSize(batch_size): integer: Maximum row count of scanned record batches, default is 32K. If scanned record batches are overflowing memory then this method can be called to reduce their size.
  • $schema: Active binding, returns the Schema of the Dataset
  • $Finish(): Returns a Scanner

Scanner currently has a single method, $ToTable(), which evaluates the query and returns an Arrow Table .

Examples

# Set up directory for examples tf <- tempfile() dir.create(tf) on.exit(unlink(tf)) write_dataset(mtcars, tf, partitioning="cyl") ds <- open_dataset(tf) scan_builder <- ds$NewScan() scan_builder$Filter(Expression$field_ref("hp") > 100) scan_builder$Project(list(hp_times_ten = 10 * Expression$field_ref("hp"))) # Once configured, call $Finish() scanner <- scan_builder$Finish() # Can get results as a table as.data.frame(scanner$ToTable()) # Or as a RecordBatchReader scanner$ToRecordBatchReader()
  • Maintainer: Jonathan Keane
  • License: Apache License (>= 2.0)
  • Last published: 2025-02-26