mlr3hf Hard Task: Downloading and Converting Hugging Face Datasets to mlr3 Tasks

Author

Cao Wei

Published

March 4, 2026

0. Local Environment Snapshot

Environment was detected locally on March 4, 2026.

Item Value
OS Kernel Darwin 25.0.0 x86_64
OS Version macOS 26.0.1 (Build 25A362)
Machine Model MacBook Pro (MacBookPro16,1)
CPU Intel Core i7, 2.6 GHz, 6 cores
Memory 32 GB
R Version R 4.5.2 (2025-10-31)
C Compiler (CC) clang -std=gnu2x
C++ Compiler (CXX) clang++ -std=gnu++17

1. Executive Summary and Task Objective

This live report documents the Minimum Viable Product (MVP) for the mlr3hf Hard Task. The goal is to download a dataset from the Hugging Face Hub and convert it into a standard mlr3 task using only native R tooling, without Python dependencies such as reticulate.

Beyond the MVP code path, this report also highlights architecture constraints, performance choices, and a practical roadmap toward a production-quality CRAN package.

2. MVP Implementation: fka/prompts.chat Case Study

This section demonstrates an end-to-end path built with httr2, duckdb, and data.table.

3.1 Fetch Metadata and Retrieve the Parquet URL

Code
library(httr2)

dataset_id <- "fka/prompts.chat"
api_url <- "https://datasets-server.huggingface.co/parquet"

req <- request(api_url) |>
    req_url_query(dataset = dataset_id) |>
    req_retry(max_tries = 3)

resp <- req_perform(req)
meta <- resp_body_json(resp)

stopifnot(length(meta$parquet_files) > 0)
parquet_url <- meta$parquet_files[[1]]$url

cat("Parquet URL retrieved:\n", parquet_url, "\n")
Parquet URL retrieved:
 https://huggingface.co/datasets/fka/prompts.chat/resolve/refs%2Fconvert%2Fparquet/default/train/0000.parquet 

3.2 Stream to Disk to Protect Memory

Code
tmp_file <- tempfile(fileext = ".parquet")

request(parquet_url) |>
    req_perform(path = tmp_file)
<httr2_response>
GET https://cas-bridge.xethub.hf.co/xet-bridge-us/63990f21cc50af73d29ecfa3/022ab4c5690d8857591c6b1efab63e5a30a10ada1005ecb3137729d412e04729?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=cas%2F20260304%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20260304T154558Z&X-Amz-Expires=3600&X-Amz-Signature=5fa1f35c8d996826fed82b526401a2cbc1873bc0140baa7cc2aa807b4f861da4&X-Amz-SignedHeaders=host&X-Xet-Cas-Uid=public&response-content-disposition=inline%3B+filename*%3DUTF-8%27%270000.parquet%3B+filename%3D%220000.parquet%22%3B&x-amz-checksum-mode=ENABLED&x-id=GetObject&Expires=1772642758&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc3MjY0Mjc1OH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2FzLWJyaWRnZS54ZXRodWIuaGYuY28veGV0LWJyaWRnZS11cy82Mzk5MGYyMWNjNTBhZjczZDI5ZWNmYTMvMDIyYWI0YzU2OTBkODg1NzU5MWM2YjFlZmFiNjNlNWEzMGExMGFkYTEwMDVlY2IzMTM3NzI5ZDQxMmUwNDcyOSoifV19&Signature=rMp4HF-A1QEO91OijIxhseywHfsp7XVP7ODgNlSRkOKbRIxYaRp8Zkwfbgi257yulW12ST6NUeOQBSsqDHzJhUQn52mgkafla7FWsmFd23KgbuFnK0EwyHZ6Om%7Ei5vS6ZxUiIH7758yzbY7nzUI07uVgWLRye53n5Q0xd-d1fxa0fBJAh49b%7Ej%7Ej%7ED6tveVRT4hkG9%7EMwSrwl61vX7MF1xzGJJAmgXEd1wt5FkuFZCtlhKftCupg4eu8DN9204faxHaj6XOBpyZSg4ntd7tSRAXOw9lws5fKSAGUTuyyj46VWBIPzbggtGR6FTMVSXXcrI2XumJlfHDfyu5AksQpmg__&Key-Pair-Id=K2L8F4GPSG1IFC
Status: 200 OK
Body: On disk '/var/folders/cd/8ncrly493_1101j5bf9zcx240000gn/T//Rtmpt9WyEO/file175b315062250.parquet' (1724280 bytes)
Code
cat("Parquet saved to:", tmp_file, "\n")
Parquet saved to: /var/folders/cd/8ncrly493_1101j5bf9zcx240000gn/T//Rtmpt9WyEO/file175b315062250.parquet 
Code
cat("File size (bytes):", file.info(tmp_file)$size, "\n")
File size (bytes): 1724280 

3.3 Query with DuckDB and Clean with data.table

Code
library(DBI)
library(duckdb)
library(data.table)

con <- dbConnect(duckdb())
query <- sprintf(
    "SELECT * FROM read_parquet(%s)",
    dbQuoteString(con, tmp_file)
)
data <- dbGetQuery(con, query)

setDT(data)

if ("for_devs" %in% names(data)) {
    data[, for_devs := as.factor(for_devs)]
}

if ("mlr3_row_id" %in% names(data)) {
    data[, mlr3_row_id := NULL]
}

head(data, 3)
                               act
                            <char>
1:              Ethereum Developer
2:                  Linux Terminal
3: English Translator and Improver
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               prompt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               <char>
1:                 Imagine you are an experienced Ethereum developer tasked with creating a smart contract for a blockchain messenger. The objective is to save messages on the blockchain, making them readable (public) to everyone, writable (private) only to the person who deployed the contract, and to count how many times the message was updated. Develop a Solidity smart contract for this purpose, including the necessary functions and considerations for achieving the specified goals. Please provide the code and any relevant explanations to ensure a clear understanding of the implementation.
2:                                                                                                                                                                         I want you to act as a linux terminal. I will type commands and you will reply with what the terminal should show. I want you to only reply with the terminal output inside one unique code block, and nothing else. do not write explanations. do not type commands unless I instruct you to do so. when i need to tell you something in english, i will do so by putting text inside curly brackets {like this}. my first command is pwd
3: I want you to act as an English translator, spelling corrector and improver. I will speak to you in any language and you will detect the language, translate it and answer in the corrected and improved version of my text, in English. I want you to replace my simplified A0-level words and sentences with more beautiful and elegant, upper level English words and sentences. Keep the meaning same, but make them more literary. I want you to only reply the correction, the improvements and nothing else, do not write explanations. My first sentence is "istanbulu cok seviyom burada olmak cok guzel"
   for_devs   type contributor
     <fctr> <char>      <char>
1:     TRUE   TEXT  ameya-2003
2:     TRUE   TEXT           f
3:    FALSE   TEXT           f
Code
dbDisconnect(con, shutdown = TRUE)

3.4 Construct the mlr3 Classification Task

Code
library(mlr3)

stopifnot("for_devs" %in% names(data))
task <- as_task_classif(data, target = "for_devs", id = "hf_prompts_task")

task

── <TaskClassif> (1395x5) ──────────────────────────────────────────────────────
• Target: for_devs
• Target classes: FALSE (positive class, 93%), TRUE (7%)
• Properties: twoclass
• Features (4):
  • chr (4): act, contributor, prompt, type

3. Future Roadmap and API Design

3.1 Lessons Learned from Easy Task using ostk

When doing the easy task, we learned the following lessons, and they should be considered in the future design.

3.1.1 Lazy Loading

While reading the mlr3oml source code, I noticed that when ostk() is called, it only downloads metadata. The actual data is downloaded only when as_task is used or when $data$data is accessed. This behavior should also be integrated into our design.

3.1.2 Caching issue: very large datasets

Very large datasets may consist of multiple Parquet files, which differs from the MVP where only a single file is handled. Users are also likely to want to use caching, so this needs to be handled properly.

3.1.3 Fallback

Has the HF data server really prepared proper Parquet files for all datasets? What if some are not ready? In that case, do we have no option, or is it necessary to download the original format (e.g., CSV) directly? Or should we just return an error? In addition to this, setting up Parquet is a heavy operation. For example, the compiling of duckdb costs more than half an hour. So we need to consider the performance impact of this operation. If the user does not want to use Parquet, we should not force them to use it.

3.2 Robustness

At a basic level, we need robust engineering practices, such as checking HTTP status codes during network downloads. At a more advanced level, can we support resume-from-breakpoint downloads and multithreaded downloading?

3.3 Customized Tasks: usability-oriented interfaces

There are many advanced requirements for customized tasks, such as merging fields. Can we provide a more user-friendly interface so users do not need to manipulate data.table directly, manually compose view/query statements, or build mlr3pipelines by themselves?

3.4 Support conversion to multiple task types

Classification, regression, and Supervised Data Stream Classification. It seems mlr3oml previously could not handle task 7317 of type Supervised Data Stream Classification, and raised an error because the type was not recognized.

3.5 Deal with very large datasets

Very large datasets: when we eventually handle truly large datasets (e.g., 100 GB) that cannot be loaded into a data.table at once, we may need lower-level solutions (such as modifying the DuckDB backend dictionary mapping table col_info).

This is another engineering challenge caused by very large datasets. More challenges may appear later; this list only includes issues encountered so far.

3.6 Network constraints: user-configurable servers

For network-restricted hosts (e.g., AutoDL in China mainland) that cannot access the HF data server, we should not hard-code this URL. Users should be allowed to configure their own mirror endpoint.

3.7 Other considerations

We can learn from existing solutions: how do Python-based approaches solve these problems?

We need to study their source code to identify issues we may not have considered yet.

4. Try to do it at once [WIP]

I have already done the MVP implementation. Now with the lessons learned, why not do it at once?