Skip to content

Comments

GH-49340: [R] Preserve row order in write_dataset()#49343

Open
marberts wants to merge 5 commits intoapache:mainfrom
marberts:preserve_order
Open

GH-49340: [R] Preserve row order in write_dataset()#49343
marberts wants to merge 5 commits intoapache:mainfrom
marberts:preserve_order

Conversation

@marberts
Copy link

@marberts marberts commented Feb 20, 2026

Rationale for this change

write_dataset(df) need not preserve the row-ordering of df across partitions. The arrow C++ library was recently updated (since 21.0.0) so that row ordering can be preserved when writing across partitions. This is useful for cases where it is assumed that row-ordering is unchanged within each partition.

df <- tibble::tibble(x = 1:1.5e6, g = rep(1:15, each = 1e5))

df |>
  dplyr::group_by(g) |>
  arrow::write_dataset("test1", preserve_order = FALSE)

df |>
  dplyr::group_by(g) |>
  arrow::write_dataset("test2", preserve_order = TRUE)

test1 <- arrow::open_dataset("test1") |>
  dplyr::collect()

test2 <- arrow::open_dataset("test2") |>
  dplyr::collect()

# Current behavior.
all.equal(test1 |> sort_by(~ g), df)
#> [1] "Component \"x\": Mean relative difference: 0.0475804"

# Preserve order.
all.equal(test2 |> sort_by(~ g), df)
#> [1] TRUE

Created on 2026-02-20 with reprex v2.1.1

What changes are included in this PR?

Added an argument preserve_order to write_dataset() that sets FileSystemDatasetWriteOptions.preserve_order to true in the call to ExecPlan_Write().

Are these changes tested?

Partially. The change is small, so I haven't written unit tests. I can revisit this if necessary.

Are there any user-facing changes?

Yes, there is a new argument in write_dataset(). The default keeps the current behavior and the argument appears after all existing arguments, so the change in backwards compatible.

@github-actions
Copy link

⚠️ GitHub issue #49340 has been automatically assigned in GitHub to PR creator.

Copy link
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!

Would you mind please writing some tests for this behavior? Somewhere in https://github.com/apache/arrow/blob/main/r/tests/testthat/test-dataset-write.R (+ following similar patterns there) would be lovely.

Copy link
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tests, I have some suggestions about naming + slightly more idiomatic expectations.

It also looks like there are some cpp linting issues too: https://github.com/apache/arrow/actions/runs/22290480080/job/64535896409?pr=49343#step:6:42

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants