Apache Arrow defines two formats for serializing data for interprocess communication (IPC): a "stream" format and a "file" format, known as Feather. RecordBatchStreamWriter and RecordBatchFileWriter are interfaces for writing record batches to those formats, respectively.
For guidance on how to use these classes, see the examples section.
class
Factory
The RecordBatchFileWriter$create() and RecordBatchStreamWriter$create()
factory methods instantiate the object and take the following arguments:
sink An OutputStream
schema A Schema for the data to be written
use_legacy_format logical: write data formatted so that Arrow libraries versions 0.14 and lower can read it. Default is FALSE. You can also enable this by setting the environment variable ARROW_PRE_0_15_IPC_FORMAT=1.
metadata_version: A string like "V5" or the equivalent integer indicating the Arrow IPC MetadataVersion. Default (NULL) will use the latest version, unless the environment variable ARROW_PRE_1_0_METADATA_VERSION=1, in which case it will be V4.
Methods
$write(x): Write a RecordBatch , Table , or data.frame, dispatching to the methods below appropriately
$write_batch(batch): Write a RecordBatch to stream
$write_table(table): Write a Table to stream
$close(): close stream. Note that this indicates end-of-file or end-of-stream--it does not close the connection to the sink. That needs to be closed separately.
Examples
tf <- tempfile()on.exit(unlink(tf))batch <- record_batch(chickwts)# This opens a connection to the file in Arrowfile_obj <- FileOutputStream$create(tf)# Pass that to a RecordBatchWriter to write data conforming to a schemawriter <- RecordBatchFileWriter$create(file_obj, batch$schema)writer$write(batch)# You may write additional batches to the stream, provided that they have# the same schema.# Call "close" on the writer to indicate end-of-file/streamwriter$close()# Then, close the connection--closing the IPC message does not close the filefile_obj$close()# Now, we have a file we can read from. Same pattern: open file connection,# then pass it to a RecordBatchReaderread_file_obj <- ReadableFile$create(tf)reader <- RecordBatchFileReader$create(read_file_obj)# RecordBatchFileReader knows how many batches it has (StreamReader does not)reader$num_record_batches
# We could consume the Reader by calling $read_next_batch() until all are,# consumed, or we can call $read_table() to pull them all into a Tabletab <- reader$read_table()# Call as.data.frame to turn that Table into an R data.framedf <- as.data.frame(tab)# This should be the same data we sentall.equal(df, chickwts, check.attributes =FALSE)# Unlike the Writers, we don't have to close RecordBatchReaders,# but we do still need to close the file connectionread_file_obj$close()
See Also
write_ipc_stream() and write_feather() provide a much simpler interface for writing data to these formats and are sufficient for many use cases. write_to_raw() is a version that serializes data to a buffer.