Define Partitioning for a Dataset
Pass a Partitioning
object to a FileSystemDatasetFactory 's $create()
method to indicate how the file's paths should be interpreted to define partitioning.
DirectoryPartitioning
describes how to interpret raw path segments, in order. For example, schema(year = int16(), month = int8())
would define partitions for file paths like "2019/01/file.parquet", "2019/02/file.parquet", etc. In this scheme NULL
values will be skipped. In the previous example: when writing a dataset if the month was NA
(or NULL
), the files would be placed in "2019/file.parquet". When reading, the rows in "2019/file.parquet" would return an NA
for the month column. An error will be raised if an outer directory is NULL
and an inner directory is not.
HivePartitioning
is for Hive-style partitioning, which embeds field names and values in path segments, such as "/year=2019/month=2/data.parquet". Because fields are named in the path segments, order does not matter. This partitioning scheme allows NULL
values. They will be replaced by a configurable null_fallback
which defaults to the string "__HIVE_DEFAULT_PARTITION__"
when writing. When reading, the null_fallback
string will be replaced with NA
s as appropriate.
PartitioningFactory
subclasses instruct the DatasetFactory
to detect partition features from the file paths.
Both DirectoryPartitioning$create()
and HivePartitioning$create()
methods take a Schema as a single input argument. The helper function hive_partition(...)
is shorthand for HivePartitioning$create(schema(...))
.
With DirectoryPartitioningFactory$create()
, you can provide just the names of the path segments (in our example, c("year", "month")
), and the DatasetFactory
will infer the data types for those partition variables. HivePartitioningFactory$create()
takes no arguments: both variable names and their types can be inferred from the file paths. hive_partition()
with no arguments returns a HivePartitioningFactory
.
Useful links