Code
options(mlr3oml.cache = FALSE)Cao Wei
March 4, 2026
This report explores the ostk in mlr3oml package, analyzing its data retrieval mechanisms, format preferences (ARFF vs. Parquet).
This experiment runs on an AutoDL cloud server with the following environment and hardware/software configuration:
Based on the mlr3oml documentation, we can easily fetch an OpenML task. I set the cache to FALSE to avoid the cache files from the previous experiments.
With task ID 145953, “Supervised Classification on kr-vs-kp” is a supervised classification task. The corresponding dataset “kr-vs-kp” has 3196 instances and 37 features, which is a small dataset.
<OMLTask:145953>
* Type: Supervised Classification
* Data: kr-vs-kp (id: 3; dim: 3196x37)
* Target: class
* Estimation: crossvalidation (id: 1; repeats: 1, folds: 10)
I don’t mind walking through the document to have a first taste of the package.
<OMLData:3:kr-vs-kp> (3196x37)
* Default target: class
With task ID 146606, “Supervised Classification on higgs” is also a supervised classification task. The corresponding dataset “higgs” has 98050 instances and 29 features, which is a larger dataset.
We also walk through the document for this task.
<OMLTask:146606>
* Type: Supervised Classification
* Data: higgs (id: 23512; dim: 98050x29)
* Target: class
* Estimation: crossvalidation (id: 1; repeats: 1, folds: 10)
In the blink of an eye, the resample task for the small dataset is completed. However, for the large dataset, it takes time.
While using ostk to access the data, we can see the following logs:
INFO [23:32:05.482] Retrieving JSON {url: `https://www.openml.org/api/v1/json/task/146606`, authenticated: `FALSE`}
INFO [23:32:06.064] Retrieving JSON {url: `https://www.openml.org/api/v1/json/data/qualities/23512`, authenticated: `FALSE`}
INFO [23:32:06.622] Retrieving JSON {url: `https://www.openml.org/api/v1/json/data/features/23512`, authenticated: `FALSE`}
INFO [23:32:07.123] Retrieving JSON {url: `https://www.openml.org/api/v1/json/data/23512`, authenticated: `FALSE`}Only JSON metadata is downloaded, the ARFF data file is not downloaded until as_task and as_resampling are called.
In fact, the ARFF data file is not downloaded until as_task is called, and the splits file is not downloaded until as_resampling is called. We will examine the source code to understand the data fetching process later.
otsk()The signature for otsk is: otsk(id, parquet = parquet_default(), test_server = test_server_default()), according to the official documentation.
test_server Typing InconsistencyThe documentation specifies test_server as character(1), yet the default fallback is FALSE. This is a slight typing inconsistency in the documentation. In practice, standard users should rely on the default (Public Server) to access real datasets. Here I won’t explore the test_server parameter in detail.
Let’s try parquet for the small dataset: kr-vs-kp.
An INFO log is printed:
Proving that the parquet file is downloaded.
Warning ⚠️
To useparquet = TRUE, you must have the following packages installed:mlr3db,duckdb, andDBI.
You can install them with:among which the compiling of
duckdbis time-consuming. It is a high-performance analytical database system, and it implements an open-source Parquet format.DBIis a database interface for R.mlr3dbis a package that provides a database interface formlr3.
Here I provide a R script to test the speed and size of ARFF vs. Parquet for the small dataset: kr-vs-kp and a very large dataset: physionet_sepsis (1552210*44).
The results are as follows:
[1] "Task ID: 363535"
[1] "Cache directory: /root/autodl-tmp/data/mlr3oml_cache"
[1] "--- ARFF: download + read ---"
[1] "ARFF task object creation time (download path):"
user system elapsed
0.119 0.008 0.126
INFO [00:59:15.889] Retrieving JSON {url: `https://www.openml.org/api/v1/json/task/363535`, authenticated: `FALSE`}
INFO [00:59:34.850] Retrieving JSON {url: `https://www.openml.org/api/v1/json/data/46817`, authenticated: `FALSE`}
INFO [00:59:37.789] Retrieving ARFF {url: `https://openml.org/data/v1/download/22124380/physionet_sepsis.arff`, authenticated: `FALSE`}
INFO [01:00:51.760] Retrieving JSON {url: `https://www.openml.org/api/v1/json/data/features/46817`, authenticated: `FALSE`}
[1] "ARFF data materialization time (download path):"
user system elapsed
12.998 3.174 99.319
[1] "--- ARFF cache files after download path ---"
size
/root/autodl-tmp/data/mlr3oml_cache/public/data_desc/46817.qs2 1972
/root/autodl-tmp/data/mlr3oml_cache/public/data_features/46817.qs2 1294
/root/autodl-tmp/data/mlr3oml_cache/public/data/46817.qs2 21767295
/root/autodl-tmp/data/mlr3oml_cache/public/task_desc/363535.qs2 751
/root/autodl-tmp/data/mlr3oml_cache/version.json 171
[1] "--- ARFF: read from cache ---"
[1] "ARFF task object creation time (cache path):"
user system elapsed
0.071 0.003 0.075
[1] "ARFF data materialization time (cache path):"
user system elapsed
1.366 0.641 1.391
[1] "--- ARFF cache files after cache path ---"
size
/root/autodl-tmp/data/mlr3oml_cache/public/data_desc/46817.qs2 1972
/root/autodl-tmp/data/mlr3oml_cache/public/data_features/46817.qs2 1294
/root/autodl-tmp/data/mlr3oml_cache/public/data/46817.qs2 21767295
/root/autodl-tmp/data/mlr3oml_cache/public/task_desc/363535.qs2 751
/root/autodl-tmp/data/mlr3oml_cache/version.json 171
[1] "--- Parquet: download + read ---"
[1] "Parquet task object creation time (download path):"
user system elapsed
0.001 0.000 0.001
INFO [01:00:56.952] Cache directory '/root/autodl-tmp/data/mlr3oml_cache' changed since initializing this object and is now '/root/autodl-tmp/data/mlr3oml_cache'.
INFO [01:00:56.954] Retrieving JSON {url: `https://www.openml.org/api/v1/json/task/363535`, authenticated: `FALSE`}
INFO [01:01:00.887] Retrieving JSON {url: `https://www.openml.org/api/v1/json/data/46817`, authenticated: `FALSE`}
INFO [01:01:03.886] Retrieving parquet. {url: `https://data.openml.org/datasets/0004/46817/dataset_46817.pq`, authenticated: `FALSE`}
INFO [01:01:10.427] Retrieving JSON {url: `https://www.openml.org/api/v1/json/data/features/46817`, authenticated: `FALSE`}
[1] "Parquet data materialization time (download path):"
user system elapsed
3.940 0.899 19.880
[1] "--- Parquet cache files after download path ---"
size
/root/autodl-tmp/data/mlr3oml_cache/public/data_desc/46817.qs2 1972
/root/autodl-tmp/data/mlr3oml_cache/public/data_features/46817.qs2 1294
/root/autodl-tmp/data/mlr3oml_cache/public/data_parquet/46817.parquet 17000952
/root/autodl-tmp/data/mlr3oml_cache/public/task_desc/363535.qs2 751
[1] "--- Parquet: read from cache ---"
[1] "Parquet task object creation time (cache path):"
user system elapsed
0.001 0.000 0.000
[1] "Parquet data materialization time (cache path):"
user system elapsed
3.131 0.945 2.592
[1] "--- Parquet cache files after cache path ---"
size
/root/autodl-tmp/data/mlr3oml_cache/public/data_desc/46817.qs2 1972
/root/autodl-tmp/data/mlr3oml_cache/public/data_features/46817.qs2 1294
/root/autodl-tmp/data/mlr3oml_cache/public/data_parquet/46817.parquet 17000952
/root/autodl-tmp/data/mlr3oml_cache/public/task_desc/363535.qs2 751Here are some key observations:
Conclusion:
One final remark: I really wanted to explore is there any performance difference while analyzing a large dataset with ARFF and Parquet, especially when the queries are parquet-friendly. Or is it not format-dependent once the data is loaded? But it is beyond the scope of this report.
I inspected the OMLData.R source code from the mlr3oml GitHub repository to understand the data fetching process.
Why this file? Because oml_task$data is an OMLData object, and as stated previously in section 1.3, only JSON metadata is downloaded, the ARFF data file is not downloaded until as_task is called.
There is no secret in the source code.
private.get_backend() is called (Line 201-230)..get_backend = function(primary_key = NULL) {
if (!is.null(private$.backend)) {
return(private$.backend)
}
backend = NULL
if (self$parquet) {
require_namespaces(c("mlr3db", "duckdb", "DBI"))
path = try({self$parquet_path}, silent = TRUE)
if (inherits(path, "try-error")) {
lg$info("Failed to download parquet, trying arff.", id = self$id)
} else {
factors = self$features[get("data_type") == "nominal", "name"][[1L]]
backend = try(as_duckdb_backend_character(path, primary_key = primary_key, factors = factors), silent = TRUE)
if (inherits(backend, "try-error")) {
msg = sprintf(
"Parquet available but failed to create backend, reverting to arff. Error message is '%s'", # nolint
backend
)
lg$info(msg, id = self$id)
}
}
}
if (is.null(backend) || inherits(path, "try-error") || inherits(backend, "try-error")) {
data = cached(download_arff, "data", self$id, desc = self$desc, cache_dir = self$cache_dir,
server = self$server, test_server = self$test_server
)
backend = as_data_backend(data, primary_key = primary_key)
}
private$.backend = backend
}For ARFF, it is:
where download_arff is a function that downloads the ARFF data file, depending on whether using a cache.
For Parquet, it is:
where as self$parquet_path is defined in the active binding parquet_path (Line 180-193):
parquet_path = function() {
if (isFALSE(self$parquet)) {
messagef("Parquet is not the selected data format, returning NULL.")
return(NULL)
}
if (is.null(private$.parquet_path)) {
loadNamespace("mlr3db")
private$.parquet_path = cached(
download_parquet, "data_parquet", self$id, desc = self$desc, cache_dir = self$cache_dir,
server = self$server, test_server = self$test_server, parquet = TRUE
)
}
private$.parquet_path
}We saw the exactly same logic as the ARFF case:
.get_backend() when we explicitly call oml_task$data$data (Line 118-128) or convert it using as_task() via as_data_backend(Line 259, Line 235-237):# Line 118-128, called when we explicitly call `oml_task$data$data`
data = function() {
backend = private$.get_backend()
ii = !self$features$is_ignore & !self$features$is_row_identifier
cols = self$features$name[ii]
existing = setdiff(backend$colnames, backend$primary_key)
if (!test_subset(cols, existing)) {
missing = setdiff(cols, existing)
warningf("Data is missing features from feature description {%s}.\n", paste0(missing, collapse = ", "))
}
backend$data(backend$rownames, cols)
},The Parquet Fallback Mechanism:
parquet = TRUE but duckdb or other dependencies is missing, the code halts at require_namespaces().try() block catches the error, and the system seamlessly falls back to downloading the ARFF version.If failed to creat a Parquet backend, the system seamlessly falls back to downloading the ARFF version as well:
backend = try(as_duckdb_backend_character(path, primary_key = primary_key, factors = factors), silent = TRUE)
if (inherits(backend, "try-error")) {
msg = sprintf(
"Parquet available but failed to create backend, reverting to arff. Error message is '%s'", # nolint
backend
)
lg$info(msg, id = self$id)
}This report explored the ostk in mlr3oml package, analyzing its data retrieval mechanisms, format preferences (ARFF vs. Parquet).
I inspected the OMLData.R source code from the mlr3oml GitHub repository to understand the data fetching process.
I provided a R script to test the speed and size of ARFF vs. Parquet for the small dataset: kr-vs-kp and a very large dataset: physionet_sepsis (1552210*44).
The results showed that Parquet is a possibly better choice for large datasets with less download time and less space consumption. ARFF are downloaded and converted to qs2 format to cache instantly. Loading from the cache, ARFF is faster than Parquet. It is evident that the loading of both formats use multiple cores.
One final remark: I really wanted to explore is there any performance difference while analyzing a large dataset with ARFF and Parquet, especially when the queries are parquet-friendly. Or is it not format-dependent once the data is loaded? But it is beyond the scope of this report.
---
title: "mlr3hf Easy Task: Deep Dive into mlr3oml and Data Fetching"
author: "Cao Wei"
date: today
format:
html:
toc: true
toc-depth: 3
theme: cosmo
code-fold: show
code-tools: true
execute:
warning: false
message: false
---
# 1. Introduction & Basic Usage
This report explores the `ostk` in `mlr3oml` package, analyzing its data retrieval mechanisms, format preferences (ARFF vs. Parquet).
This experiment runs on an AutoDL cloud server with the following environment and hardware/software configuration:
- Image: PyTorch 2.5.1 + Python 3.12 (Ubuntu 22.04) + CUDA 12.4
- GPU: 1 × NVIDIA RTX 3060 (12GB)
- CPU: 7 vCPU Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
- Memory: 15GB
- Disk: 30GB system disk, free 50GB SSD data disk (no paid disk), no additional disk
- Port mapping: None
- Custom service ports: 6006 (HTTP), 6008 (HTTP)
- Networking: Shared bandwidth with other instances in the same region
Based on the [mlr3oml documentation](https://mlr3oml.mlr-org.com/index.html), we can easily fetch an OpenML task. I set the cache to FALSE to avoid the cache files from the previous experiments.
```{r set_cache_false}
options(mlr3oml.cache = FALSE)
```
## 1.1 Classification Task 1: kr-vs-kp
With task ID 145953, ["Supervised Classification on kr-vs-kp"](https://www.openml.org/search?type=task&id=145953) is a supervised classification task. The corresponding dataset ["kr-vs-kp"](https://www.openml.org/d/3) has 3196 instances and 37 features, which is a small dataset.
```{r setup}
library(mlr3oml)
library(mlr3)
# Download and print the OpenML task with ID 145953
oml_task <- otsk(145953)
oml_task
```
I don't mind walking through the document to have a first taste of the package.
```{r access_data}
# Access the OpenML data object on which the task is built
oml_task$data
```
```{r convert_task}
# Convert the OpenML task to an mlr3 task and resampling
task <- as_task(oml_task)
resampling <- as_resampling(oml_task)
# Conduct a simple resample experiment
rr <- resample(task, lrn("classif.rpart"), resampling)
rr$aggregate()
```
## 1.2 Classification Task 2: Supervised Classification on higgs
With task ID 146606, ["Supervised Classification on higgs"](https://www.openml.org/search?type=task&qualities.NumberOfClasses=%3D_2&id=146606&source_data.data_id=23512) is also a supervised classification task. The corresponding dataset ["higgs"](https://www.openml.org/search?type=data&id=23512&sort=runs&status=active) has 98050 instances and 29 features, which is a larger dataset.
We also walk through the document for this task.
```{r access_data_2}
oml_task <- otsk(146606)
oml_task
```
```{r convert_task_2}
task <- as_task(oml_task)
resampling <- as_resampling(oml_task)
# Conduct a simple resample experiment
rr <- resample(task, lrn("classif.rpart"), resampling)
rr$aggregate()
```
## 1.3 Observations
In the blink of an eye, the resample task for the small dataset is completed. However, for the large dataset, it takes time.
While using `ostk` to access the data, we can see the following logs:
```sh
INFO [23:32:05.482] Retrieving JSON {url: `https://www.openml.org/api/v1/json/task/146606`, authenticated: `FALSE`}
INFO [23:32:06.064] Retrieving JSON {url: `https://www.openml.org/api/v1/json/data/qualities/23512`, authenticated: `FALSE`}
INFO [23:32:06.622] Retrieving JSON {url: `https://www.openml.org/api/v1/json/data/features/23512`, authenticated: `FALSE`}
INFO [23:32:07.123] Retrieving JSON {url: `https://www.openml.org/api/v1/json/data/23512`, authenticated: `FALSE`}
```
Only JSON metadata is downloaded, the ARFF data file is not downloaded until `as_task` and `as_resampling` are called.
```sh
INFO [23:32:07.650] Retrieving ARFF {url: `https://openml.org/data/v1/download/2063675/higgs.arff`, authenticated: `FALSE`}
INFO [23:32:25.445] Retrieving ARFF {url: `https://openml.org/api_splits/get/146606/Task_146606_splits.arff`, authenticated: `FALSE`}
```
In fact, the ARFF data file is not downloaded until `as_task` is called, and the splits file is not downloaded until `as_resampling` is called. We will examine the source code to understand the data fetching process later.
# 2. Parameter Analysis: `otsk()`
The signature for `otsk` is: `otsk(id, parquet = parquet_default(), test_server = test_server_default())`, according to the [official documentation](https://mlr3oml.mlr-org.com/reference/otsk.html).
### 2.1 The `test_server` Typing Inconsistency
The documentation specifies `test_server` as `character(1)`, yet the default fallback is `FALSE`. This is a slight typing inconsistency in the documentation. In practice, standard users should rely on the default (Public Server) to access real datasets. Here I won't explore the `test_server` parameter in detail.
### 2.2 ARFF vs. Parquet
* **ARFF**: An ARFF (Attribute-Relation File Format) file is an ASCII text file that describes a list of instances sharing a set of attributes. According to [weka-wiki](https://waikato.github.io/weka-wiki/formats_and_processing/arff_stable/), each instance is represented on a single line, with carriage returns denoting the end of the instance. So it is a row-oriented plain text format.
* **Parquet**: According to the [Apache Parquet documentation](https://parquet.apache.org/docs/), Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides high performance compression and encoding schemes to handle complex data in bulk and is supported in many programming languages and analytics tools. I remember that the startup DB company I joined in 2021 had a self-developed storage engine based on Parquet, aiming to improve the performance in time-series data analysis.
#### 2.2.1 Parquet for the small dataset: kr-vs-kp
Let's try parquet for the small dataset: kr-vs-kp.
```{r experiment_parquet}
library(mlr3oml)
library(mlr3)
options(mlr3oml.cache = FALSE)
# Small dataset: kr-vs-kp
oml_task <- otsk(145953, parquet = TRUE)
task <- as_task(oml_task)
```
An INFO log is printed:
```sh
INFO [00:49:57.949] Retrieving parquet. {url: `https://data.openml.org/datasets/0000/0003/dataset_3.pq`, authenticated: `FALSE`}
```
Proving that the parquet file is downloaded.
> **Warning ⚠️**
> To use `parquet = TRUE`, you must have the following packages installed: `mlr3db`, `duckdb`, and `DBI`.
> You can install them with:
> ```r
> install.packages(c("mlr3db", "duckdb", "DBI"))
> ```
> among which the compiling of `duckdb` is time-consuming. It is a high-performance analytical database system, and it implements an open-source Parquet format. `DBI` is a database interface for R. `mlr3db` is a package that provides a database interface for `mlr3`.
#### 2.2.2 Comprehensive Test Script for ARFF vs. Parquet: Speed and Size
Here I provide a [R script](https://github.com/weicaocw/mlr3-pkgs/blob/main/mlr3hf/entry_tasks/easy/comprehensive_comparison_arff_parquet.r) to test the speed and size of ARFF vs. Parquet for the small dataset: kr-vs-kp and a very large dataset: physionet_sepsis (1552210*44).
The results are as follows:
```sh
[1] "Task ID: 363535"
[1] "Cache directory: /root/autodl-tmp/data/mlr3oml_cache"
[1] "--- ARFF: download + read ---"
[1] "ARFF task object creation time (download path):"
user system elapsed
0.119 0.008 0.126
INFO [00:59:15.889] Retrieving JSON {url: `https://www.openml.org/api/v1/json/task/363535`, authenticated: `FALSE`}
INFO [00:59:34.850] Retrieving JSON {url: `https://www.openml.org/api/v1/json/data/46817`, authenticated: `FALSE`}
INFO [00:59:37.789] Retrieving ARFF {url: `https://openml.org/data/v1/download/22124380/physionet_sepsis.arff`, authenticated: `FALSE`}
INFO [01:00:51.760] Retrieving JSON {url: `https://www.openml.org/api/v1/json/data/features/46817`, authenticated: `FALSE`}
[1] "ARFF data materialization time (download path):"
user system elapsed
12.998 3.174 99.319
[1] "--- ARFF cache files after download path ---"
size
/root/autodl-tmp/data/mlr3oml_cache/public/data_desc/46817.qs2 1972
/root/autodl-tmp/data/mlr3oml_cache/public/data_features/46817.qs2 1294
/root/autodl-tmp/data/mlr3oml_cache/public/data/46817.qs2 21767295
/root/autodl-tmp/data/mlr3oml_cache/public/task_desc/363535.qs2 751
/root/autodl-tmp/data/mlr3oml_cache/version.json 171
[1] "--- ARFF: read from cache ---"
[1] "ARFF task object creation time (cache path):"
user system elapsed
0.071 0.003 0.075
[1] "ARFF data materialization time (cache path):"
user system elapsed
1.366 0.641 1.391
[1] "--- ARFF cache files after cache path ---"
size
/root/autodl-tmp/data/mlr3oml_cache/public/data_desc/46817.qs2 1972
/root/autodl-tmp/data/mlr3oml_cache/public/data_features/46817.qs2 1294
/root/autodl-tmp/data/mlr3oml_cache/public/data/46817.qs2 21767295
/root/autodl-tmp/data/mlr3oml_cache/public/task_desc/363535.qs2 751
/root/autodl-tmp/data/mlr3oml_cache/version.json 171
[1] "--- Parquet: download + read ---"
[1] "Parquet task object creation time (download path):"
user system elapsed
0.001 0.000 0.001
INFO [01:00:56.952] Cache directory '/root/autodl-tmp/data/mlr3oml_cache' changed since initializing this object and is now '/root/autodl-tmp/data/mlr3oml_cache'.
INFO [01:00:56.954] Retrieving JSON {url: `https://www.openml.org/api/v1/json/task/363535`, authenticated: `FALSE`}
INFO [01:01:00.887] Retrieving JSON {url: `https://www.openml.org/api/v1/json/data/46817`, authenticated: `FALSE`}
INFO [01:01:03.886] Retrieving parquet. {url: `https://data.openml.org/datasets/0004/46817/dataset_46817.pq`, authenticated: `FALSE`}
INFO [01:01:10.427] Retrieving JSON {url: `https://www.openml.org/api/v1/json/data/features/46817`, authenticated: `FALSE`}
[1] "Parquet data materialization time (download path):"
user system elapsed
3.940 0.899 19.880
[1] "--- Parquet cache files after download path ---"
size
/root/autodl-tmp/data/mlr3oml_cache/public/data_desc/46817.qs2 1972
/root/autodl-tmp/data/mlr3oml_cache/public/data_features/46817.qs2 1294
/root/autodl-tmp/data/mlr3oml_cache/public/data_parquet/46817.parquet 17000952
/root/autodl-tmp/data/mlr3oml_cache/public/task_desc/363535.qs2 751
[1] "--- Parquet: read from cache ---"
[1] "Parquet task object creation time (cache path):"
user system elapsed
0.001 0.000 0.000
[1] "Parquet data materialization time (cache path):"
user system elapsed
3.131 0.945 2.592
[1] "--- Parquet cache files after cache path ---"
size
/root/autodl-tmp/data/mlr3oml_cache/public/data_desc/46817.qs2 1972
/root/autodl-tmp/data/mlr3oml_cache/public/data_features/46817.qs2 1294
/root/autodl-tmp/data/mlr3oml_cache/public/data_parquet/46817.parquet 17000952
/root/autodl-tmp/data/mlr3oml_cache/public/task_desc/363535.qs2 751
```
Here are some key observations:
1. The speed of reading the data from Parquet is much faster than from ARFF: 19.880 seconds vs. 99.319 seconds.
2. When selecting ARFF, the actual cached format is qs2, which aims to provide reliable and fast performance for saving and loading objects in R.
3. The size of the Parquet file is smaller than the qs2 file: 17000952 bytes vs. 21767295 bytes.
4. Loading from the cache, the time of reading the data from Parquet is not less than the time of reading the data from ARFF: 2.592 seconds vs. 1.391 seconds.
5. Loading from the cache, both ARFF and Parquet have elapsed time lower than the sum of user and system time:
- ARFF: 1.391 seconds vs. 1.366 seconds + 0.641 seconds = 2.007 seconds
- Parquet: 2.592 seconds vs. 3.131 seconds + 0.945 seconds = 4.076 seconds
Conclusion:
1. Parquet is possibly a better choice for large datasets, with less download time and less space consumption.
2. ARFF data is downloaded and converted to qs2 format for caching.
3. Loading from the cache, ARFF is faster than Parquet.
4. It is evident that loading both formats uses multiple cores.
One final remark:
I really wanted to explore is there any performance difference while analyzing a large dataset with ARFF and Parquet, especially when the queries are parquet-friendly. Or is it not format-dependent once the data is loaded? But it is beyond the scope of this report.
# 3. Source Code Analysis
I inspected the [`OMLData.R` source code](https://github.com/mlr-org/mlr3oml/blob/main/R/OMLData.R) from the `mlr3oml` GitHub repository to understand the data fetching process.
Why this file? Because `oml_task$data` is an `OMLData` object, and as stated previously in section 1.3, only JSON metadata is downloaded, the ARFF data file is not downloaded until `as_task` is called.
There is no secret in the source code.
1. **.get_backend()**: The actual data downloading occurs only when `private.get_backend()` is called (Line 201-230).
```r
.get_backend = function(primary_key = NULL) {
if (!is.null(private$.backend)) {
return(private$.backend)
}
backend = NULL
if (self$parquet) {
require_namespaces(c("mlr3db", "duckdb", "DBI"))
path = try({self$parquet_path}, silent = TRUE)
if (inherits(path, "try-error")) {
lg$info("Failed to download parquet, trying arff.", id = self$id)
} else {
factors = self$features[get("data_type") == "nominal", "name"][[1L]]
backend = try(as_duckdb_backend_character(path, primary_key = primary_key, factors = factors), silent = TRUE)
if (inherits(backend, "try-error")) {
msg = sprintf(
"Parquet available but failed to create backend, reverting to arff. Error message is '%s'", # nolint
backend
)
lg$info(msg, id = self$id)
}
}
}
if (is.null(backend) || inherits(path, "try-error") || inherits(backend, "try-error")) {
data = cached(download_arff, "data", self$id, desc = self$desc, cache_dir = self$cache_dir,
server = self$server, test_server = self$test_server
)
backend = as_data_backend(data, primary_key = primary_key)
}
private$.backend = backend
}
```
For ARFF, it is:
```r
data = cached(download_arff, "data", self$id, desc = self$desc, cache_dir = self$cache_dir,
server = self$server, test_server = self$test_server
)
```
where `download_arff` is a function that downloads the ARFF data file, depending on whether using a cache.
For Parquet, it is:
```r
path = try({self$parquet_path}, silent = TRUE)
```
where as `self$parquet_path` is defined in the active binding `parquet_path` (Line 180-193):
```r
parquet_path = function() {
if (isFALSE(self$parquet)) {
messagef("Parquet is not the selected data format, returning NULL.")
return(NULL)
}
if (is.null(private$.parquet_path)) {
loadNamespace("mlr3db")
private$.parquet_path = cached(
download_parquet, "data_parquet", self$id, desc = self$desc, cache_dir = self$cache_dir,
server = self$server, test_server = self$test_server, parquet = TRUE
)
}
private$.parquet_path
}
```
We saw the exactly same logic as the ARFF case:
```r
private$.parquet_path = cached(
download_parquet, "data_parquet", self$id, desc = self$desc, cache_dir = self$cache_dir,
server = self$server, test_server = self$test_server, parquet = TRUE
)
```
2. **Triggering the Download**: The download is triggered through `.get_backend()` when we explicitly call `oml_task$data$data` (Line 118-128) or convert it using `as_task()` via `as_data_backend`(Line 259, Line 235-237):
```r
# Line 118-128, called when we explicitly call `oml_task$data$data`
data = function() {
backend = private$.get_backend()
ii = !self$features$is_ignore & !self$features$is_row_identifier
cols = self$features$name[ii]
existing = setdiff(backend$colnames, backend$primary_key)
if (!test_subset(cols, existing)) {
missing = setdiff(cols, existing)
warningf("Data is missing features from feature description {%s}.\n", paste0(missing, collapse = ", "))
}
backend$data(backend$rownames, cols)
},
```
```r
# Line 259, called when we convert it using `as_task()`
task = constructor$new(x$name, as_data_backend(x), target = target)
```
```r
# Line 235-237
as_data_backend.OMLData = function(data, primary_key = NULL, ...) {
get_private(data)$.get_backend(primary_key)
}
```
3. **The Parquet Fallback Mechanism**:
* **Hard Error**: If `parquet = TRUE` but `duckdb` or other dependencies is missing, the code halts at `require_namespaces()`.
```r
# Line 207
require_namespaces(c("mlr3db", "duckdb", "DBI"))
```
* **Silent Fallback**: If the Parquet file does not exist on the OpenML server, the `try()` block catches the error, and the system seamlessly falls back to downloading the ARFF version.
```r
path = try({self$parquet_path}, silent = TRUE)
if (inherits(path, "try-error")) {
lg$info("Failed to download parquet, trying arff.", id = self$id)
}
```
If failed to creat a Parquet backend, the system seamlessly falls back to downloading the ARFF version as well:
```r
backend = try(as_duckdb_backend_character(path, primary_key = primary_key, factors = factors), silent = TRUE)
if (inherits(backend, "try-error")) {
msg = sprintf(
"Parquet available but failed to create backend, reverting to arff. Error message is '%s'", # nolint
backend
)
lg$info(msg, id = self$id)
}
```
# 4. Conclusion
This report explored the `ostk` in `mlr3oml` package, analyzing its data retrieval mechanisms, format preferences (ARFF vs. Parquet).
I inspected the [`OMLData.R` source code](https://github.com/mlr-org/mlr3oml/blob/main/R/OMLData.R) from the `mlr3oml` GitHub repository to understand the data fetching process.
I provided a [R script](https://github.com/weicaocw/mlr3-pkgs/blob/main/mlr3hf/entry_tasks/easy/comprehensive_comparison_arff_parquet.r) to test the speed and size of ARFF vs. Parquet for the small dataset: kr-vs-kp and a very large dataset: physionet_sepsis (1552210*44).
The results showed that Parquet is a possibly better choice for large datasets with less download time and less space consumption. ARFF are downloaded and converted to qs2 format to cache instantly. Loading from the cache, ARFF is faster than Parquet. It is evident that the loading of both formats use multiple cores.
One final remark: I really wanted to explore is there any performance difference while analyzing a large dataset with ARFF and Parquet, especially when the queries are parquet-friendly. Or is it not format-dependent once the data is loaded? But it is beyond the scope of this report.