mlr3hf Easy Task: Deep Dive into mlr3oml and Data Fetching

Author

Cao Wei

Published

March 4, 2026

1. Introduction & Basic Usage

This report explores the ostk in mlr3oml package, analyzing its data retrieval mechanisms, format preferences (ARFF vs. Parquet).

This experiment runs on an AutoDL cloud server with the following environment and hardware/software configuration:

Image: PyTorch 2.5.1 + Python 3.12 (Ubuntu 22.04) + CUDA 12.4
GPU: 1 × NVIDIA RTX 3060 (12GB)
CPU: 7 vCPU Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
Memory: 15GB
Disk: 30GB system disk, free 50GB SSD data disk (no paid disk), no additional disk
Port mapping: None
Custom service ports: 6006 (HTTP), 6008 (HTTP)
Networking: Shared bandwidth with other instances in the same region

Based on the mlr3oml documentation, we can easily fetch an OpenML task. I set the cache to FALSE to avoid the cache files from the previous experiments.

Code

options(mlr3oml.cache = FALSE)

1.1 Classification Task 1: kr-vs-kp

With task ID 145953, “Supervised Classification on kr-vs-kp” is a supervised classification task. The corresponding dataset “kr-vs-kp” has 3196 instances and 37 features, which is a small dataset.

Code

library(mlr3oml)
library(mlr3)

# Download and print the OpenML task with ID 145953
oml_task <- otsk(145953)
oml_task

<OMLTask:145953>
 * Type: Supervised Classification
 * Data: kr-vs-kp (id: 3; dim: 3196x37)
 * Target: class
 * Estimation: crossvalidation (id: 1; repeats: 1, folds: 10)

I don’t mind walking through the document to have a first taste of the package.

Code

# Access the OpenML data object on which the task is built
oml_task$data

<OMLData:3:kr-vs-kp> (3196x37)
 * Default target: class

Code

# Convert the OpenML task to an mlr3 task and resampling
task <- as_task(oml_task)
resampling <- as_resampling(oml_task)

# Conduct a simple resample experiment
rr <- resample(task, lrn("classif.rpart"), resampling)
rr$aggregate()

classif.ce 
 0.0319181

1.2 Classification Task 2: Supervised Classification on higgs

With task ID 146606, “Supervised Classification on higgs” is also a supervised classification task. The corresponding dataset “higgs” has 98050 instances and 29 features, which is a larger dataset.

We also walk through the document for this task.

Code

oml_task <- otsk(146606)
oml_task

<OMLTask:146606>
 * Type: Supervised Classification
 * Data: higgs (id: 23512; dim: 98050x29)
 * Target: class
 * Estimation: crossvalidation (id: 1; repeats: 1, folds: 10)

Code

task <- as_task(oml_task)
resampling <- as_resampling(oml_task)

# Conduct a simple resample experiment
rr <- resample(task, lrn("classif.rpart"), resampling)
rr$aggregate()

classif.ce 
 0.3462621

1.3 Observations

In the blink of an eye, the resample task for the small dataset is completed. However, for the large dataset, it takes time.

While using ostk to access the data, we can see the following logs:

INFO  [23:32:05.482] Retrieving JSON {url: `https://www.openml.org/api/v1/json/task/146606`, authenticated: `FALSE`}
INFO  [23:32:06.064] Retrieving JSON {url: `https://www.openml.org/api/v1/json/data/qualities/23512`, authenticated: `FALSE`}
INFO  [23:32:06.622] Retrieving JSON {url: `https://www.openml.org/api/v1/json/data/features/23512`, authenticated: `FALSE`}
INFO  [23:32:07.123] Retrieving JSON {url: `https://www.openml.org/api/v1/json/data/23512`, authenticated: `FALSE`}

Only JSON metadata is downloaded, the ARFF data file is not downloaded until as_task and as_resampling are called.

INFO  [23:32:07.650] Retrieving ARFF {url: `https://openml.org/data/v1/download/2063675/higgs.arff`, authenticated: `FALSE`}
INFO  [23:32:25.445] Retrieving ARFF {url: `https://openml.org/api_splits/get/146606/Task_146606_splits.arff`, authenticated: `FALSE`}

In fact, the ARFF data file is not downloaded until as_task is called, and the splits file is not downloaded until as_resampling is called. We will examine the source code to understand the data fetching process later.

2. Parameter Analysis: `otsk()`

The signature for otsk is: otsk(id, parquet = parquet_default(), test_server = test_server_default()), according to the official documentation.

2.1 The `test_server` Typing Inconsistency

The documentation specifies test_server as character(1), yet the default fallback is FALSE. This is a slight typing inconsistency in the documentation. In practice, standard users should rely on the default (Public Server) to access real datasets. Here I won’t explore the test_server parameter in detail.

2.2 ARFF vs. Parquet

ARFF: An ARFF (Attribute-Relation File Format) file is an ASCII text file that describes a list of instances sharing a set of attributes. According to weka-wiki, each instance is represented on a single line, with carriage returns denoting the end of the instance. So it is a row-oriented plain text format.
Parquet: According to the Apache Parquet documentation, Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides high performance compression and encoding schemes to handle complex data in bulk and is supported in many programming languages and analytics tools. I remember that the startup DB company I joined in 2021 had a self-developed storage engine based on Parquet, aiming to improve the performance in time-series data analysis.

2.2.1 Parquet for the small dataset: kr-vs-kp

Let’s try parquet for the small dataset: kr-vs-kp.

Code

library(mlr3oml)
library(mlr3)
options(mlr3oml.cache = FALSE)

# Small dataset: kr-vs-kp
oml_task <- otsk(145953, parquet = TRUE)
task <- as_task(oml_task)

An INFO log is printed:

INFO  [00:49:57.949] Retrieving parquet. {url: `https://data.openml.org/datasets/0000/0003/dataset_3.pq`, authenticated: `FALSE`}

Proving that the parquet file is downloaded.

Warning ⚠️
To use parquet = TRUE, you must have the following packages installed: mlr3db, duckdb, and DBI.
You can install them with:
install.packages(c("mlr3db", "duckdb", "DBI"))
among which the compiling of duckdb is time-consuming. It is a high-performance analytical database system, and it implements an open-source Parquet format. DBI is a database interface for R. mlr3db is a package that provides a database interface for mlr3.

2.2.2 Comprehensive Test Script for ARFF vs. Parquet: Speed and Size

Here I provide a R script to test the speed and size of ARFF vs. Parquet for the small dataset: kr-vs-kp and a very large dataset: physionet_sepsis (1552210*44).

The results are as follows:

[1] "Task ID: 363535"
[1] "Cache directory: /root/autodl-tmp/data/mlr3oml_cache"
[1] "--- ARFF: download + read ---"
[1] "ARFF task object creation time (download path):"
   user  system elapsed
  0.119   0.008   0.126
INFO  [00:59:15.889] Retrieving JSON {url: `https://www.openml.org/api/v1/json/task/363535`, authenticated: `FALSE`}
INFO  [00:59:34.850] Retrieving JSON {url: `https://www.openml.org/api/v1/json/data/46817`, authenticated: `FALSE`}
INFO  [00:59:37.789] Retrieving ARFF {url: `https://openml.org/data/v1/download/22124380/physionet_sepsis.arff`, authenticated: `FALSE`}
INFO  [01:00:51.760] Retrieving JSON {url: `https://www.openml.org/api/v1/json/data/features/46817`, authenticated: `FALSE`}
[1] "ARFF data materialization time (download path):"
   user  system elapsed
 12.998   3.174  99.319
[1] "--- ARFF cache files after download path ---"
                                                                       size
/root/autodl-tmp/data/mlr3oml_cache/public/data_desc/46817.qs2         1972
/root/autodl-tmp/data/mlr3oml_cache/public/data_features/46817.qs2     1294
/root/autodl-tmp/data/mlr3oml_cache/public/data/46817.qs2          21767295
/root/autodl-tmp/data/mlr3oml_cache/public/task_desc/363535.qs2         751
/root/autodl-tmp/data/mlr3oml_cache/version.json                        171
[1] "--- ARFF: read from cache ---"
[1] "ARFF task object creation time (cache path):"
   user  system elapsed
  0.071   0.003   0.075
[1] "ARFF data materialization time (cache path):"
   user  system elapsed
  1.366   0.641   1.391
[1] "--- ARFF cache files after cache path ---"
                                                                       size
/root/autodl-tmp/data/mlr3oml_cache/public/data_desc/46817.qs2         1972
/root/autodl-tmp/data/mlr3oml_cache/public/data_features/46817.qs2     1294
/root/autodl-tmp/data/mlr3oml_cache/public/data/46817.qs2          21767295
/root/autodl-tmp/data/mlr3oml_cache/public/task_desc/363535.qs2         751
/root/autodl-tmp/data/mlr3oml_cache/version.json                        171
[1] "--- Parquet: download + read ---"
[1] "Parquet task object creation time (download path):"
   user  system elapsed
  0.001   0.000   0.001
INFO  [01:00:56.952] Cache directory '/root/autodl-tmp/data/mlr3oml_cache' changed since initializing this object and is now '/root/autodl-tmp/data/mlr3oml_cache'.
INFO  [01:00:56.954] Retrieving JSON {url: `https://www.openml.org/api/v1/json/task/363535`, authenticated: `FALSE`}
INFO  [01:01:00.887] Retrieving JSON {url: `https://www.openml.org/api/v1/json/data/46817`, authenticated: `FALSE`}
INFO  [01:01:03.886] Retrieving parquet. {url: `https://data.openml.org/datasets/0004/46817/dataset_46817.pq`, authenticated: `FALSE`}
INFO  [01:01:10.427] Retrieving JSON {url: `https://www.openml.org/api/v1/json/data/features/46817`, authenticated: `FALSE`}
[1] "Parquet data materialization time (download path):"
   user  system elapsed
  3.940   0.899  19.880
[1] "--- Parquet cache files after download path ---"
                                                                          size
/root/autodl-tmp/data/mlr3oml_cache/public/data_desc/46817.qs2            1972
/root/autodl-tmp/data/mlr3oml_cache/public/data_features/46817.qs2        1294
/root/autodl-tmp/data/mlr3oml_cache/public/data_parquet/46817.parquet 17000952
/root/autodl-tmp/data/mlr3oml_cache/public/task_desc/363535.qs2            751
[1] "--- Parquet: read from cache ---"
[1] "Parquet task object creation time (cache path):"
   user  system elapsed
  0.001   0.000   0.000
[1] "Parquet data materialization time (cache path):"
   user  system elapsed
  3.131   0.945   2.592
[1] "--- Parquet cache files after cache path ---"
                                                                          size
/root/autodl-tmp/data/mlr3oml_cache/public/data_desc/46817.qs2            1972
/root/autodl-tmp/data/mlr3oml_cache/public/data_features/46817.qs2        1294
/root/autodl-tmp/data/mlr3oml_cache/public/data_parquet/46817.parquet 17000952
/root/autodl-tmp/data/mlr3oml_cache/public/task_desc/363535.qs2            751

Here are some key observations:

The speed of reading the data from Parquet is much faster than from ARFF: 19.880 seconds vs. 99.319 seconds.
When selecting ARFF, the actual cached format is qs2, which aims to provide reliable and fast performance for saving and loading objects in R.
The size of the Parquet file is smaller than the qs2 file: 17000952 bytes vs. 21767295 bytes.
Loading from the cache, the time of reading the data from Parquet is not less than the time of reading the data from ARFF: 2.592 seconds vs. 1.391 seconds.
Loading from the cache, both ARFF and Parquet have elapsed time lower than the sum of user and system time:

ARFF: 1.391 seconds vs. 1.366 seconds + 0.641 seconds = 2.007 seconds
Parquet: 2.592 seconds vs. 3.131 seconds + 0.945 seconds = 4.076 seconds

Conclusion:

Parquet is possibly a better choice for large datasets, with less download time and less space consumption.
ARFF data is downloaded and converted to qs2 format for caching.
Loading from the cache, ARFF is faster than Parquet.
It is evident that loading both formats uses multiple cores.

One final remark: I really wanted to explore is there any performance difference while analyzing a large dataset with ARFF and Parquet, especially when the queries are parquet-friendly. Or is it not format-dependent once the data is loaded? But it is beyond the scope of this report.

3. Source Code Analysis

I inspected the OMLData.R source code from the mlr3oml GitHub repository to understand the data fetching process.

Why this file? Because oml_task$data is an OMLData object, and as stated previously in section 1.3, only JSON metadata is downloaded, the ARFF data file is not downloaded until as_task is called.

There is no secret in the source code.

.get_backend(): The actual data downloading occurs only when private.get_backend() is called (Line 201-230).

.get_backend = function(primary_key = NULL) {
  if (!is.null(private$.backend)) {
    return(private$.backend)
  }
  backend = NULL
  if (self$parquet) {
    require_namespaces(c("mlr3db", "duckdb", "DBI"))
    path = try({self$parquet_path}, silent = TRUE)
    if (inherits(path, "try-error")) {
      lg$info("Failed to download parquet, trying arff.", id = self$id)
    } else {
      factors = self$features[get("data_type") == "nominal", "name"][[1L]]
      backend = try(as_duckdb_backend_character(path, primary_key = primary_key, factors = factors), silent = TRUE)
      if (inherits(backend, "try-error")) {
        msg = sprintf(
          "Parquet available but failed to create backend, reverting to arff. Error message is '%s'", # nolint
          backend
        )
        lg$info(msg, id = self$id)
      }
    }
  }
  if (is.null(backend) || inherits(path, "try-error") || inherits(backend, "try-error")) {
    data = cached(download_arff, "data", self$id, desc = self$desc, cache_dir = self$cache_dir,
      server = self$server, test_server = self$test_server
    )
    backend = as_data_backend(data, primary_key = primary_key)
  }
  private$.backend = backend
}

For ARFF, it is:

data = cached(download_arff, "data", self$id, desc = self$desc, cache_dir = self$cache_dir,
  server = self$server, test_server = self$test_server
)

where download_arff is a function that downloads the ARFF data file, depending on whether using a cache.

For Parquet, it is:

path = try({self$parquet_path}, silent = TRUE)

where as self$parquet_path is defined in the active binding parquet_path (Line 180-193):

parquet_path = function() {
  if (isFALSE(self$parquet)) {
    messagef("Parquet is not the selected data format, returning NULL.")
    return(NULL)
  }
  if (is.null(private$.parquet_path)) {
    loadNamespace("mlr3db")
    private$.parquet_path = cached(
      download_parquet, "data_parquet", self$id, desc = self$desc, cache_dir = self$cache_dir,
      server = self$server, test_server = self$test_server, parquet = TRUE
    )
  }
  private$.parquet_path
}

We saw the exactly same logic as the ARFF case:

private$.parquet_path = cached(
  download_parquet, "data_parquet", self$id, desc = self$desc, cache_dir = self$cache_dir,
  server = self$server, test_server = self$test_server, parquet = TRUE
)

Triggering the Download: The download is triggered through .get_backend() when we explicitly call oml_task$data$data (Line 118-128) or convert it using as_task() via as_data_backend(Line 259, Line 235-237):

# Line 118-128, called when we explicitly call `oml_task$data$data`
data = function() {
  backend = private$.get_backend()
  ii = !self$features$is_ignore & !self$features$is_row_identifier
  cols = self$features$name[ii]
  existing = setdiff(backend$colnames, backend$primary_key)
  if (!test_subset(cols, existing)) {
    missing = setdiff(cols, existing)
    warningf("Data is missing features from feature description {%s}.\n", paste0(missing, collapse = ", "))
  }
  backend$data(backend$rownames, cols)
},

# Line 259, called when we convert it using `as_task()`
task = constructor$new(x$name, as_data_backend(x), target = target)

# Line 235-237
as_data_backend.OMLData = function(data, primary_key = NULL, ...) {
  get_private(data)$.get_backend(primary_key)
}

The Parquet Fallback Mechanism:

Hard Error: If parquet = TRUE but duckdb or other dependencies is missing, the code halts at require_namespaces().

# Line 207
require_namespaces(c("mlr3db", "duckdb", "DBI"))

Silent Fallback: If the Parquet file does not exist on the OpenML server, the try() block catches the error, and the system seamlessly falls back to downloading the ARFF version.

path = try({self$parquet_path}, silent = TRUE)
if (inherits(path, "try-error")) {
  lg$info("Failed to download parquet, trying arff.", id = self$id)
}

If failed to creat a Parquet backend, the system seamlessly falls back to downloading the ARFF version as well:

backend = try(as_duckdb_backend_character(path, primary_key = primary_key, factors = factors), silent = TRUE)
if (inherits(backend, "try-error")) {
 msg = sprintf(
   "Parquet available but failed to create backend, reverting to arff. Error message is '%s'", # nolint
   backend
   )
  lg$info(msg, id = self$id)
 }

4. Conclusion

This report explored the ostk in mlr3oml package, analyzing its data retrieval mechanisms, format preferences (ARFF vs. Parquet).

I inspected the OMLData.R source code from the mlr3oml GitHub repository to understand the data fetching process.

I provided a R script to test the speed and size of ARFF vs. Parquet for the small dataset: kr-vs-kp and a very large dataset: physionet_sepsis (1552210*44).

The results showed that Parquet is a possibly better choice for large datasets with less download time and less space consumption. ARFF are downloaded and converted to qs2 format to cache instantly. Loading from the cache, ARFF is faster than Parquet. It is evident that the loading of both formats use multiple cores.

--- title: "mlr3hf Easy Task: Deep Dive into mlr3oml and Data Fetching" author: "Cao Wei" date: today format: html: toc: true toc-depth: 3 theme: cosmo code-fold: show code-tools: true execute: warning: false message: false --- # 1. Introduction & Basic Usage This report explores the `ostk` in `mlr3oml` package, analyzing its data retrieval mechanisms, format preferences (ARFF vs. Parquet). This experiment runs on an AutoDL cloud server with the following environment and hardware/software configuration: - Image: PyTorch 2.5.1 + Python 3.12 (Ubuntu 22.04) + CUDA 12.4 - GPU: 1 × NVIDIA RTX 3060 (12GB) - CPU: 7 vCPU Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz - Memory: 15GB - Disk: 30GB system disk, free 50GB SSD data disk (no paid disk), no additional disk - Port mapping: None - Custom service ports: 6006 (HTTP), 6008 (HTTP) - Networking: Shared bandwidth with other instances in the same region Based on the [mlr3oml documentation](https://mlr3oml.mlr-org.com/index.html), we can easily fetch an OpenML task. I set the cache to FALSE to avoid the cache files from the previous experiments. ```{r set_cache_false} options(mlr3oml.cache = FALSE) ``` ## 1.1 Classification Task 1: kr-vs-kp With task ID 145953, ["Supervised Classification on kr-vs-kp"](https://www.openml.org/search?type=task&id=145953) is a supervised classification task. The corresponding dataset ["kr-vs-kp"](https://www.openml.org/d/3) has 3196 instances and 37 features, which is a small dataset. ```{r setup} library(mlr3oml) library(mlr3) # Download and print the OpenML task with ID 145953 oml_task <- otsk(145953) oml_task ``` I don't mind walking through the document to have a first taste of the package. ```{r access_data} # Access the OpenML data object on which the task is built oml_task$data ``` ```{r convert_task} # Convert the OpenML task to an mlr3 task and resampling task <- as_task(oml_task) resampling <- as_resampling(oml_task) # Conduct a simple resample experiment rr <- resample(task, lrn("classif.rpart"), resampling) rr$aggregate() ``` ## 1.2 Classification Task 2: Supervised Classification on higgs With task ID 146606, ["Supervised Classification on higgs"](https://www.openml.org/search?type=task&qualities.NumberOfClasses=%3D_2&id=146606&source_data.data_id=23512) is also a supervised classification task. The corresponding dataset ["higgs"](https://www.openml.org/search?type=data&id=23512&sort=runs&status=active) has 98050 instances and 29 features, which is a larger dataset. We also walk through the document for this task. ```{r access_data_2} oml_task <- otsk(146606) oml_task ``` ```{r convert_task_2} task <- as_task(oml_task) resampling <- as_resampling(oml_task) # Conduct a simple resample experiment rr <- resample(task, lrn("classif.rpart"), resampling) rr$aggregate() ``` ## 1.3 Observations In the blink of an eye, the resample task for the small dataset is completed. However, for the large dataset, it takes time. While using `ostk` to access the data, we can see the following logs: ```sh INFO [23:32:05.482] Retrieving JSON {url: `https://www.openml.org/api/v1/json/task/146606`, authenticated: `FALSE`} INFO [23:32:06.064] Retrieving JSON {url: `https://www.openml.org/api/v1/json/data/qualities/23512`, authenticated: `FALSE`} INFO [23:32:06.622] Retrieving JSON {url: `https://www.openml.org/api/v1/json/data/features/23512`, authenticated: `FALSE`} INFO [23:32:07.123] Retrieving JSON {url: `https://www.openml.org/api/v1/json/data/23512`, authenticated: `FALSE`} ``` Only JSON metadata is downloaded, the ARFF data file is not downloaded until `as_task` and `as_resampling` are called. ```sh INFO [23:32:07.650] Retrieving ARFF {url: `https://openml.org/data/v1/download/2063675/higgs.arff`, authenticated: `FALSE`} INFO [23:32:25.445] Retrieving ARFF {url: `https://openml.org/api_splits/get/146606/Task_146606_splits.arff`, authenticated: `FALSE`} ``` In fact, the ARFF data file is not downloaded until `as_task` is called, and the splits file is not downloaded until `as_resampling` is called. We will examine the source code to understand the data fetching process later. # 2. Parameter Analysis: `otsk()` The signature for `otsk` is: `otsk(id, parquet = parquet_default(), test_server = test_server_default())`, according to the [official documentation](https://mlr3oml.mlr-org.com/reference/otsk.html). ### 2.1 The `test_server` Typing Inconsistency The documentation specifies `test_server` as `character(1)`, yet the default fallback is `FALSE`. This is a slight typing inconsistency in the documentation. In practice, standard users should rely on the default (Public Server) to access real datasets. Here I won't explore the `test_server` parameter in detail. ### 2.2 ARFF vs. Parquet * **ARFF**: An ARFF (Attribute-Relation File Format) file is an ASCII text file that describes a list of instances sharing a set of attributes. According to [weka-wiki](https://waikato.github.io/weka-wiki/formats_and_processing/arff_stable/), each instance is represented on a single line, with carriage returns denoting the end of the instance. So it is a row-oriented plain text format. * **Parquet**: According to the [Apache Parquet documentation](https://parquet.apache.org/docs/), Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides high performance compression and encoding schemes to handle complex data in bulk and is supported in many programming languages and analytics tools. I remember that the startup DB company I joined in 2021 had a self-developed storage engine based on Parquet, aiming to improve the performance in time-series data analysis. #### 2.2.1 Parquet for the small dataset: kr-vs-kp Let's try parquet for the small dataset: kr-vs-kp. ```{r experiment_parquet} library(mlr3oml) library(mlr3) options(mlr3oml.cache = FALSE) # Small dataset: kr-vs-kp oml_task <- otsk(145953, parquet = TRUE) task <- as_task(oml_task) ``` An INFO log is printed: ```sh INFO [00:49:57.949] Retrieving parquet. {url: `https://data.openml.org/datasets/0000/0003/dataset_3.pq`, authenticated: `FALSE`} ``` Proving that the parquet file is downloaded. > **Warning ⚠️** > To use `parquet = TRUE`, you must have the following packages installed: `mlr3db`, `duckdb`, and `DBI`. > You can install them with: > ```r > install.packages(c("mlr3db", "duckdb", "DBI")) > ``` > among which the compiling of `duckdb` is time-consuming. It is a high-performance analytical database system, and it implements an open-source Parquet format. `DBI` is a database interface for R. `mlr3db` is a package that provides a database interface for `mlr3`. #### 2.2.2 Comprehensive Test Script for ARFF vs. Parquet: Speed and Size Here I provide a [R script](https://github.com/weicaocw/mlr3-pkgs/blob/main/mlr3hf/entry_tasks/easy/comprehensive_comparison_arff_parquet.r) to test the speed and size of ARFF vs. Parquet for the small dataset: kr-vs-kp and a very large dataset: physionet_sepsis (1552210*44). The results are as follows: ```sh [1] "Task ID: 363535" [1] "Cache directory: /root/autodl-tmp/data/mlr3oml_cache" [1] "--- ARFF: download + read ---" [1] "ARFF task object creation time (download path):" user system elapsed 0.119 0.008 0.126 INFO [00:59:15.889] Retrieving JSON {url: `https://www.openml.org/api/v1/json/task/363535`, authenticated: `FALSE`} INFO [00:59:34.850] Retrieving JSON {url: `https://www.openml.org/api/v1/json/data/46817`, authenticated: `FALSE`} INFO [00:59:37.789] Retrieving ARFF {url: `https://openml.org/data/v1/download/22124380/physionet_sepsis.arff`, authenticated: `FALSE`} INFO [01:00:51.760] Retrieving JSON {url: `https://www.openml.org/api/v1/json/data/features/46817`, authenticated: `FALSE`} [1] "ARFF data materialization time (download path):" user system elapsed 12.998 3.174 99.319 [1] "--- ARFF cache files after download path ---" size /root/autodl-tmp/data/mlr3oml_cache/public/data_desc/46817.qs2 1972 /root/autodl-tmp/data/mlr3oml_cache/public/data_features/46817.qs2 1294 /root/autodl-tmp/data/mlr3oml_cache/public/data/46817.qs2 21767295 /root/autodl-tmp/data/mlr3oml_cache/public/task_desc/363535.qs2 751 /root/autodl-tmp/data/mlr3oml_cache/version.json 171 [1] "--- ARFF: read from cache ---" [1] "ARFF task object creation time (cache path):" user system elapsed 0.071 0.003 0.075 [1] "ARFF data materialization time (cache path):" user system elapsed 1.366 0.641 1.391 [1] "--- ARFF cache files after cache path ---" size /root/autodl-tmp/data/mlr3oml_cache/public/data_desc/46817.qs2 1972 /root/autodl-tmp/data/mlr3oml_cache/public/data_features/46817.qs2 1294 /root/autodl-tmp/data/mlr3oml_cache/public/data/46817.qs2 21767295 /root/autodl-tmp/data/mlr3oml_cache/public/task_desc/363535.qs2 751 /root/autodl-tmp/data/mlr3oml_cache/version.json 171 [1] "--- Parquet: download + read ---" [1] "Parquet task object creation time (download path):" user system elapsed 0.001 0.000 0.001 INFO [01:00:56.952] Cache directory '/root/autodl-tmp/data/mlr3oml_cache' changed since initializing this object and is now '/root/autodl-tmp/data/mlr3oml_cache'. INFO [01:00:56.954] Retrieving JSON {url: `https://www.openml.org/api/v1/json/task/363535`, authenticated: `FALSE`} INFO [01:01:00.887] Retrieving JSON {url: `https://www.openml.org/api/v1/json/data/46817`, authenticated: `FALSE`} INFO [01:01:03.886] Retrieving parquet. {url: `https://data.openml.org/datasets/0004/46817/dataset_46817.pq`, authenticated: `FALSE`} INFO [01:01:10.427] Retrieving JSON {url: `https://www.openml.org/api/v1/json/data/features/46817`, authenticated: `FALSE`} [1] "Parquet data materialization time (download path):" user system elapsed 3.940 0.899 19.880 [1] "--- Parquet cache files after download path ---" size /root/autodl-tmp/data/mlr3oml_cache/public/data_desc/46817.qs2 1972 /root/autodl-tmp/data/mlr3oml_cache/public/data_features/46817.qs2 1294 /root/autodl-tmp/data/mlr3oml_cache/public/data_parquet/46817.parquet 17000952 /root/autodl-tmp/data/mlr3oml_cache/public/task_desc/363535.qs2 751 [1] "--- Parquet: read from cache ---" [1] "Parquet task object creation time (cache path):" user system elapsed 0.001 0.000 0.000 [1] "Parquet data materialization time (cache path):" user system elapsed 3.131 0.945 2.592 [1] "--- Parquet cache files after cache path ---" size /root/autodl-tmp/data/mlr3oml_cache/public/data_desc/46817.qs2 1972 /root/autodl-tmp/data/mlr3oml_cache/public/data_features/46817.qs2 1294 /root/autodl-tmp/data/mlr3oml_cache/public/data_parquet/46817.parquet 17000952 /root/autodl-tmp/data/mlr3oml_cache/public/task_desc/363535.qs2 751 ``` Here are some key observations: 1. The speed of reading the data from Parquet is much faster than from ARFF: 19.880 seconds vs. 99.319 seconds. 2. When selecting ARFF, the actual cached format is qs2, which aims to provide reliable and fast performance for saving and loading objects in R. 3. The size of the Parquet file is smaller than the qs2 file: 17000952 bytes vs. 21767295 bytes. 4. Loading from the cache, the time of reading the data from Parquet is not less than the time of reading the data from ARFF: 2.592 seconds vs. 1.391 seconds. 5. Loading from the cache, both ARFF and Parquet have elapsed time lower than the sum of user and system time: - ARFF: 1.391 seconds vs. 1.366 seconds + 0.641 seconds = 2.007 seconds - Parquet: 2.592 seconds vs. 3.131 seconds + 0.945 seconds = 4.076 seconds Conclusion: 1. Parquet is possibly a better choice for large datasets, with less download time and less space consumption. 2. ARFF data is downloaded and converted to qs2 format for caching. 3. Loading from the cache, ARFF is faster than Parquet. 4. It is evident that loading both formats uses multiple cores. One final remark: I really wanted to explore is there any performance difference while analyzing a large dataset with ARFF and Parquet, especially when the queries are parquet-friendly. Or is it not format-dependent once the data is loaded? But it is beyond the scope of this report. # 3. Source Code Analysis I inspected the [`OMLData.R` source code](https://github.com/mlr-org/mlr3oml/blob/main/R/OMLData.R) from the `mlr3oml` GitHub repository to understand the data fetching process. Why this file? Because `oml_task$data` is an `OMLData` object, and as stated previously in section 1.3, only JSON metadata is downloaded, the ARFF data file is not downloaded until `as_task` is called. There is no secret in the source code. 1. **.get_backend()**: The actual data downloading occurs only when `private.get_backend()` is called (Line 201-230). ```r .get_backend = function(primary_key = NULL) { if (!is.null(private$.backend)) { return(private$.backend) } backend = NULL if (self$parquet) { require_namespaces(c("mlr3db", "duckdb", "DBI")) path = try({self$parquet_path}, silent = TRUE) if (inherits(path, "try-error")) { lg$info("Failed to download parquet, trying arff.", id = self$id) } else { factors = self$features[get("data_type") == "nominal", "name"][[1L]] backend = try(as_duckdb_backend_character(path, primary_key = primary_key, factors = factors), silent = TRUE) if (inherits(backend, "try-error")) { msg = sprintf( "Parquet available but failed to create backend, reverting to arff. Error message is '%s'", # nolint backend ) lg$info(msg, id = self$id) } } } if (is.null(backend) || inherits(path, "try-error") || inherits(backend, "try-error")) { data = cached(download_arff, "data", self$id, desc = self$desc, cache_dir = self$cache_dir, server = self$server, test_server = self$test_server ) backend = as_data_backend(data, primary_key = primary_key) } private$.backend = backend } ``` For ARFF, it is: ```r data = cached(download_arff, "data", self$id, desc = self$desc, cache_dir = self$cache_dir, server = self$server, test_server = self$test_server ) ``` where `download_arff` is a function that downloads the ARFF data file, depending on whether using a cache. For Parquet, it is: ```r path = try({self$parquet_path}, silent = TRUE) ``` where as `self$parquet_path` is defined in the active binding `parquet_path` (Line 180-193): ```r parquet_path = function() { if (isFALSE(self$parquet)) { messagef("Parquet is not the selected data format, returning NULL.") return(NULL) } if (is.null(private$.parquet_path)) { loadNamespace("mlr3db") private$.parquet_path = cached( download_parquet, "data_parquet", self$id, desc = self$desc, cache_dir = self$cache_dir, server = self$server, test_server = self$test_server, parquet = TRUE ) } private$.parquet_path } ``` We saw the exactly same logic as the ARFF case: ```r private$.parquet_path = cached( download_parquet, "data_parquet", self$id, desc = self$desc, cache_dir = self$cache_dir, server = self$server, test_server = self$test_server, parquet = TRUE ) ``` 2. **Triggering the Download**: The download is triggered through `.get_backend()` when we explicitly call `oml_task$data$data` (Line 118-128) or convert it using `as_task()` via `as_data_backend`(Line 259, Line 235-237): ```r # Line 118-128, called when we explicitly call `oml_task$data$data` data = function() { backend = private$.get_backend() ii = !self$features$is_ignore & !self$features$is_row_identifier cols = self$features$name[ii] existing = setdiff(backend$colnames, backend$primary_key) if (!test_subset(cols, existing)) { missing = setdiff(cols, existing) warningf("Data is missing features from feature description {%s}.\n", paste0(missing, collapse = ", ")) } backend$data(backend$rownames, cols) }, ``` ```r # Line 259, called when we convert it using `as_task()` task = constructor$new(x$name, as_data_backend(x), target = target) ``` ```r # Line 235-237 as_data_backend.OMLData = function(data, primary_key = NULL, ...) { get_private(data)$.get_backend(primary_key) } ``` 3. **The Parquet Fallback Mechanism**: * **Hard Error**: If `parquet = TRUE` but `duckdb` or other dependencies is missing, the code halts at `require_namespaces()`. ```r # Line 207 require_namespaces(c("mlr3db", "duckdb", "DBI")) ``` * **Silent Fallback**: If the Parquet file does not exist on the OpenML server, the `try()` block catches the error, and the system seamlessly falls back to downloading the ARFF version. ```r path = try({self$parquet_path}, silent = TRUE) if (inherits(path, "try-error")) { lg$info("Failed to download parquet, trying arff.", id = self$id) } ``` If failed to creat a Parquet backend, the system seamlessly falls back to downloading the ARFF version as well: ```r backend = try(as_duckdb_backend_character(path, primary_key = primary_key, factors = factors), silent = TRUE) if (inherits(backend, "try-error")) { msg = sprintf( "Parquet available but failed to create backend, reverting to arff. Error message is '%s'", # nolint backend ) lg$info(msg, id = self$id) } ``` # 4. Conclusion This report explored the `ostk` in `mlr3oml` package, analyzing its data retrieval mechanisms, format preferences (ARFF vs. Parquet). I inspected the [`OMLData.R` source code](https://github.com/mlr-org/mlr3oml/blob/main/R/OMLData.R) from the `mlr3oml` GitHub repository to understand the data fetching process. I provided a [R script](https://github.com/weicaocw/mlr3-pkgs/blob/main/mlr3hf/entry_tasks/easy/comprehensive_comparison_arff_parquet.r) to test the speed and size of ARFF vs. Parquet for the small dataset: kr-vs-kp and a very large dataset: physionet_sepsis (1552210*44). The results showed that Parquet is a possibly better choice for large datasets with less download time and less space consumption. ARFF are downloaded and converted to qs2 format to cache instantly. Loading from the cache, ARFF is faster than Parquet. It is evident that the loading of both formats use multiple cores. One final remark: I really wanted to explore is there any performance difference while analyzing a large dataset with ARFF and Parquet, especially when the queries are parquet-friendly. Or is it not format-dependent once the data is loaded? But it is beyond the scope of this report.

1. Introduction & Basic Usage

1.1 Classification Task 1: kr-vs-kp

1.2 Classification Task 2: Supervised Classification on higgs

1.3 Observations

2. Parameter Analysis: otsk()

2.1 The test_server Typing Inconsistency

2.2 ARFF vs. Parquet

2.2.1 Parquet for the small dataset: kr-vs-kp

2.2.2 Comprehensive Test Script for ARFF vs. Parquet: Speed and Size

3. Source Code Analysis

4. Conclusion

2. Parameter Analysis: `otsk()`

2.1 The `test_server` Typing Inconsistency