A FileFormat holds information about how to read and parse the files included in a Dataset. There are subclasses corresponding to the supported file formats (ParquetFileFormat and IpcFileFormat).
Factory
FileFormat$create() takes the following arguments:
format: A string identifier of the file format. Currently supported values:
"parquet"
"ipc"/"arrow"/"feather", all aliases for each other; for Feather, note that only version 2 files are supported
"csv"/"text", aliases for the same thing (because comma is the default delimiter for text files
"tsv", equivalent to passing format = "text", delimiter = "\t"
...: Additional format-specific options
format = "parquet":
dict_columns: Names of columns which should be read as dictionaries.
Any Parquet options from FragmentScanOptions .
format = "text": see CsvParseOptions . Note that you can specify them either with the Arrow C++ library naming ("delimiter", "quoting", etc.) or the readr-style naming used in read_csv_arrow() ("delim", "quote", etc.). Not all readr options are currently supported; please file an issue if you encounter one that arrow should support. Also, the following options are supported. From CsvReadOptions :
skip_rows
column_names. Note that if a Schema is specified, column_names must match those specified in the schema.
autogenerate_column_names
From CsvFragmentScanOptions (these values can be overridden at scan time):
convert_options: a CsvConvertOptions
block_size
It returns the appropriate subclass of FileFormat (e.g. ParquetFileFormat)
Examples
## Semi-colon delimited files# Set up directory for examplestf <- tempfile()dir.create(tf)on.exit(unlink(tf))write.table(mtcars, file.path(tf,"file1.txt"), sep =";", row.names =FALSE)# Create FileFormat objectformat <- FileFormat$create(format ="text", delimiter =";")open_dataset(tf, format = format)