Skip to content

Comments

[SPARK-55600][PYTHON] Fix pandas to arrow loses row count when schema has 0 columns on classic#54382

Open
Yicong-Huang wants to merge 4 commits intoapache:masterfrom
Yicong-Huang:SPARK-55600/fix/pandas-arrow-zero-columns-row-count
Open

[SPARK-55600][PYTHON] Fix pandas to arrow loses row count when schema has 0 columns on classic#54382
Yicong-Huang wants to merge 4 commits intoapache:masterfrom
Yicong-Huang:SPARK-55600/fix/pandas-arrow-zero-columns-row-count

Conversation

@Yicong-Huang
Copy link
Contributor

@Yicong-Huang Yicong-Huang commented Feb 19, 2026

What changes were proposed in this pull request?

This PR fixes the row count loss issue when creating a Spark DataFrame from a pandas DataFrame with 0 columns in classic.

The issue occurs due to PyArrow limitations when creating RecordBatches or Tables with 0 columns - row count information is lost.

Why are the changes needed?

Before this fix:

import pandas as pd
from pyspark.sql.types import StructType

pdf = pd.DataFrame(index=range(5))  # 5 rows, 0 columns
df = spark.createDataFrame(pdf, schema=StructType([]))
df.count()  # Returns 0 (wrong!)

After this fix:

df.count()  # Returns 5 (correct!)

Does this PR introduce any user-facing change?

Yes. Creating a DataFrame from a pandas DataFrame with 0 columns now correctly preserves the row count in Classic Spark.

How was this patch tested?

Added unit test test_from_pandas_dataframe_with_zero_columns in test_creation.py that tests both Arrow-enabled and Arrow-disabled paths.

Was this patch authored or co-authored using generative AI tooling?

No

@holdenk
Copy link
Contributor

holdenk commented Feb 23, 2026

CC @devin-petersohn if you've got the time for a quick review would appreciate it, I don't really know the semantics of a 0 column dataframe.

Copy link
Contributor

@devin-petersohn devin-petersohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yicong-Huang Arrow batches keep pandas metadata around, would that be a better, more efficient way of keeping track of the original length?

In [3]: pa.RecordBatch.from_pandas(pd.DataFrame(index=range(5))).schema.pandas_metadata
Out[3]: 
{'index_columns': [{'kind': 'range',
   'name': None,
   'start': 0,
   'stop': 5,
   'step': 1}],
 'column_indexes': [{'name': None,
   'field_name': None,
   'pandas_type': 'int64',
   'numpy_type': 'int64',
   'metadata': None}],
 'columns': [],
 'creator': {'library': 'pyarrow', 'version': '18.1.0'},
 'pandas_version': '2.2.3'}

@Yicong-Huang
Copy link
Contributor Author

@Yicong-Huang Arrow batches keep pandas metadata around, would that be a better, more efficient way of keeping track of the original length?

Thanks @devin-petersohn for the suggestion. I tested locally indeed for this edge case, using metadata would be much more efficient. changes pushed.

Copy link
Contributor

@devin-petersohn devin-petersohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes, LGTM

@Yicong-Huang Yicong-Huang force-pushed the SPARK-55600/fix/pandas-arrow-zero-columns-row-count branch from 634bdda to af89044 Compare February 23, 2026 22:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants