[feat](Variant) Support NestedGroup for variant columns with search DSL NESTED clause#60847
[feat](Variant) Support NestedGroup for variant columns with search DSL NESTED clause#60847eldenmoon wants to merge 12 commits intoapache:masterfrom
Conversation
Add protobuf definitions for NestedGroup storage support: - NestedGroupInfoPB for nested group statistics - NestedOffsetsIndexPB for nested offsets index - Fields in ColumnPathInfo for nested group metadata - NESTED_OFFSETS_INDEX enum value Add thrift definition for NESTED search clause: - NESTED clause type in TSearchClause - nested_path field for NESTED clause
Introduce the NestedGroup framework for tracking nested array structures in variant columns: - nested_group_path.h: Path utilities for nested group identification - nested_group_provider.h/cpp: Provider interface with default no-op impl - nested_offsets_mapping_index.h/cpp: Offsets mapping index for nested arrays - offset_manager.h/offset_manager_impl.h: Offset tracking utilities Tests: - nested_group_path_test.cpp: Path utility tests - nested_group_path_filter_test.cpp: Path filter tests - nested_group_provider_test.cpp: Provider contract tests
Add NESTED clause to the search DSL for querying nested array structures within variant columns. FE changes: - SearchLexer/Parser.g4: Add NESTED token and grammar rules - SearchDslParser.java: Parse NESTED clause with path specification - SearchPredicate.java: Support NESTED clause type BE changes: - vsearch.cpp: Evaluate NESTED clause in search expressions - function_search.cpp/h: Implement NESTED function evaluation Tests: - SearchDslParserTest.java: FE parser tests - function_search_nested_test.cpp: BE NESTED function tests - vsearch_expr_test.cpp: Search expression tests
Improve FE rewrite rules for variant sub-path pruning and nested column pruning to better support variant column access patterns. FE changes: - AccessPathExpressionCollector.java: New collector for access paths - AccessPathPlanCollector.java: Enhanced plan collection - NestedColumnPruning.java: Refactored nested column pruning logic - VariantSubPathPruning.java: Enhanced sub-path pruning - ExpressionAnalyzer.java: Expression analysis updates - Explode/ExplodeOuter.java: NestedGroup array support BE changes: - vexpr.cpp/h: Expression node enhancements - vexpr_context.h: Context changes for pruning support Tests: - PruneNestedColumnTest.java: Nested column pruning tests - VariantPruningLogicTest.java: Variant pruning logic tests
Integrate NestedGroup support into the variant column read/write path and related utility components. Core reader/writer: - variant_column_reader.cpp/h: NestedGroup-aware read path - variant_column_writer_impl.cpp/h: NestedGroup-aware write path - variant_statistics.h: Statistics tracking for nested groups - column_reader.cpp/h: Column reader NestedGroup support - segment_iterator.cpp: Iterator integration - segment_creator.cpp: Creator integration Utilities: - path_in_data.cpp/h: Path utilities for nested structures - variant_util.cpp/h: Variant utility enhancements - json_parser.cpp: JSON parser improvements - column_variant.cpp/h: Column variant changes - config.cpp/h: variant_nested_group_max_depth config - exec_env.h: Query cache setter Tests & regression: - Modified existing tests for compatibility - 4 new regression test suites for variant pruning
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
3271773 to
f9c0e67
Compare
There was a problem hiding this comment.
Pull request overview
This PR introduces NestedGroup-related infrastructure for VARIANT (multi-level array/object nesting) and adds a NESTED(path, query) clause to the search DSL, alongside expanded access-path/sub-path pruning support and coverage tests across FE/BE and regression suites.
Changes:
- Extend FE search DSL parsing/AST + thrift explain to support a top-level
NESTED()clause with anested_path. - Add NestedGroup metadata/index plumbing (proto, iterators, mapping index) and related BE execution support for nested search evaluation.
- Expand VARIANT access-path/sub-path pruning support (including numeric array indices) and add regression + unit tests.
Reviewed changes
Copilot reviewed 73 out of 74 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| regression-test/suites/variant_p0/test_variant_mixed_object_array_pruning_debugpoint.groovy | New regression coverage for mixed object/array pruning with debugpoints |
| regression-test/suites/variant_p0/test_variant_compaction_root_empty_after_full.groovy | New regression to ensure compaction doesn’t empty VARIANT root |
| regression-test/suites/variant_p0/test_variant_access_path_pruning_debugpoint.groovy | New regression coverage for access path pruning via debugpoints |
| regression-test/suites/variant_p0/test_sub_path_pruning.groovy | Adds explode-related pruning assertions/explain checks |
| gensrc/thrift/Exprs.thrift | Adds NESTED clause support via nested_path in thrift |
| gensrc/proto/segment_v2.proto | Adds NestedGroup-related metadata and a nested offsets index proto |
| fe/fe-core/src/test/java/org/apache/doris/nereids/trees/expressions/functions/scalar/SearchDslParserTest.java | Adds FE unit tests for NESTED() parsing/validation rules |
| fe/fe-core/src/test/java/org/apache/doris/nereids/rules/rewrite/VariantPruningLogicTest.java | New FE pruning logic tests for VARIANT access paths/subcolumns |
| fe/fe-core/src/test/java/org/apache/doris/nereids/rules/rewrite/PruneNestedColumnTest.java | Extends pruning tests to cover VARIANT access path propagation |
| fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/functions/scalar/SearchDslParser.java | Implements NESTED() parsing, AST fields, and top-level validation |
| fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/functions/generator/ExplodeOuter.java | Allows EXPLODE_OUTER to accept VARIANT inputs |
| fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/functions/generator/Explode.java | Allows EXPLODE to accept VARIANT inputs |
| fe/fe-core/src/main/java/org/apache/doris/nereids/rules/rewrite/VariantSubPathPruning.java | Supports integer-like literal indices in VARIANT subpaths |
| fe/fe-core/src/main/java/org/apache/doris/nereids/rules/rewrite/SlotTypeReplacer.java | Includes VARIANT in slot replacement/pruning decision |
| fe/fe-core/src/main/java/org/apache/doris/nereids/rules/rewrite/RewriteSearchToSlots.java | Minor cleanup around variant-subcolumn comment/handling |
| fe/fe-core/src/main/java/org/apache/doris/nereids/rules/rewrite/NestedColumnPruning.java | Extends pruning entry conditions & access-path handling for VARIANT |
| fe/fe-core/src/main/java/org/apache/doris/nereids/rules/rewrite/AccessPathPlanCollector.java | Collects access paths for VARIANT and adjusts generate handling |
| fe/fe-core/src/main/java/org/apache/doris/nereids/rules/rewrite/AccessPathExpressionCollector.java | Builds VARIANT access paths including numeric indices |
| fe/fe-core/src/main/java/org/apache/doris/nereids/rules/analysis/ExpressionAnalyzer.java | Marks VARIANT usage as “hasNestedColumns” for pruning pipeline |
| fe/fe-core/src/main/java/org/apache/doris/analysis/SearchPredicate.java | Includes nested_path in explain and thrift conversion |
| fe/fe-core/src/main/antlr4/org/apache/doris/nereids/search/SearchParser.g4 | Adds parser rule for nestedQuery |
| fe/fe-core/src/main/antlr4/org/apache/doris/nereids/search/SearchLexer.g4 | Adds lexer mode/tokens for nested path scanning |
| be/test/vec/function/function_search_test.cpp | Updates clause categorization/mapping for NESTED; moves tests out |
| be/test/vec/function/function_search_nested_test.cpp | New BE unit tests focused on NESTED clause behavior |
| be/test/vec/exprs/vsearch_expr_test.cpp | Updates inverted index context creation; adds NESTED fallback test |
| be/test/vec/exprs/vexpr_evalute_inverted_index_test.cpp | Updates IndexExecContext construction signature |
| be/test/vec/exprs/try_cast_expr_test.cpp | Adds include needed after context/header refactor |
| be/test/olap/rowset/segment_v2/segment_iterator_no_need_read_data_test.cpp | Updates test expectation for extracted VARIANT read-data behavior |
| be/test/olap/rowset/segment_v2/nested_group_provider_test.cpp | New tests for default NestedGroup provider contracts (CE behavior) |
| be/test/olap/rowset/segment_v2/nested_group_path_test.cpp | New tests for nested group path utilities/constants |
| be/test/olap/rowset/segment_v2/nested_group_path_filter_test.cpp | New tests for NestedGroup path filtering behavior |
| be/test/olap/rowset/segment_v2/external_col_meta_util_test.cpp | Removes non-English comment note |
| be/src/vec/runtime/vparquet_transformer.cpp | Fully-qualifies parquet types/exceptions to avoid ambiguity |
| be/src/vec/olap/olap_data_convertor.cpp | Adds convertor for unsigned bigint (offset/length internal cols) |
| be/src/vec/json/path_in_data.h | Adds array-depth helper + new path utility declarations |
| be/src/vec/json/path_in_data.cpp | Implements prefix stripping + path append helpers |
| be/src/vec/json/json_parser.cpp | Adjusts nested array handling & array alignment behavior |
| be/src/vec/functions/function_search.h | Adds nested-search APIs and forward declarations |
| be/src/vec/functions/function_search.cpp | Implements NESTED evaluation flow + execution context refactor |
| be/src/vec/exprs/vsearch.cpp | Passes IndexExecContext/segment info; improves VARIANT subcolumn binding |
| be/src/vec/exprs/vexpr_context.h | Extends IndexExecContext with segment pointer + iterator options |
| be/src/vec/exprs/vexpr.h | Moves contains_blockable_function definition out-of-header |
| be/src/vec/exprs/vexpr.cpp | Implements new contains_blockable_function; SEARCH_EXPR treated as slot-acting |
| be/src/vec/exec/scan/olap_scanner.cpp | Uses parent_unique_id fallback for access-path maps |
| be/src/vec/common/variant_util.h | Tracks presence of nested groups in variant extended info |
| be/src/vec/common/variant_util.cpp | Nullable-aware array base-type detection; compaction skips extended schema when nested groups exist |
| be/src/vec/columns/column_variant.h | Adds missing include for column vector usage |
| be/src/vec/columns/column_variant.cpp | Minor include cleanup/formatting |
| be/src/runtime/exec_env.h | Adds setter for inverted index query cache |
| be/src/olap/rowset/segment_v2/variant/variant_statistics.h | Adds NestedGroup stats, pb (de)serialization updates, renames doc-value map |
| be/src/olap/rowset/segment_v2/variant/variant_external_meta_reader.cpp | Removes non-English comment note |
| be/src/olap/rowset/segment_v2/variant/variant_column_writer_impl.h | Introduces NestedGroup write provider integration and stats merging hooks |
| be/src/olap/rowset/segment_v2/variant/variant_column_writer_impl.cpp | Wires NestedGroup provider into write pipeline; stats merge + writer refactors |
| be/src/olap/rowset/segment_v2/variant/variant_column_reader.h | Adds NestedGroup reader structs and planning hooks |
| be/src/olap/rowset/segment_v2/variant/offset_manager_impl.h | New shared implementation for offset padding/reading utilities |
| be/src/olap/rowset/segment_v2/variant/offset_manager.h | New offset manager interface for NestedGroup offset handling |
| be/src/olap/rowset/segment_v2/variant/nested_offsets_mapping_index.h | New mapping index interface + reader/writer declarations |
| be/src/olap/rowset/segment_v2/variant/nested_offsets_mapping_index.cpp | Implements nested offsets mapping index build/read/map logic |
| be/src/olap/rowset/segment_v2/variant/nested_group_provider.h | Defines NestedGroup provider interfaces, filters, and factories |
| be/src/olap/rowset/segment_v2/variant/nested_group_provider.cpp | Default CE providers (no-op/disabled) + default path matching |
| be/src/olap/rowset/segment_v2/variant/nested_group_path.h | New helpers/constants for nested-group path encoding |
| be/src/olap/rowset/segment_v2/variant/hierarchical_data_iterator.cpp | Skips NestedGroup subcolumns in hierarchical merge; adds debug checks |
| be/src/olap/rowset/segment_v2/segment_iterator.cpp | Uses full path for extracted column field_name; updates inverted-index type inference; passes segment/iter opts into IndexExecContext |
| be/src/olap/rowset/segment_v2/index_writer.cpp | Uses column path when constructing extracted-column field_name |
| be/src/olap/rowset/segment_v2/index_file_writer.cpp | Include tweak for atomic usage |
| be/src/olap/rowset/segment_v2/column_writer.h | Adds NestedGroup provider/stats members for variant writers |
| be/src/olap/rowset/segment_v2/column_reader_cache.cpp | Removes stray preprocessor character |
| be/src/olap/rowset/segment_v2/column_reader.h | Exposes lazy-loading accessors for zone-map and ordinal indexes |
| be/src/olap/rowset/segment_v2/column_reader.cpp | Implements lazy-loading accessors; recognizes new index type in init |
| be/src/olap/rowset/segment_creator.cpp | Minor header/include cleanup and param renames |
| be/src/common/config.h | Adds config knob for max nested group depth |
| be/src/common/config.cpp | Defines default for nested group max depth |
Comments suppressed due to low confidence (1)
be/src/vec/json/json_parser.cpp:35
json_parser.cppreferencesconfig::variant_max_json_key_lengthbut no longer includescommon/config.h, so this should fail to compile withconfigundefined. Add back the#include "common/config.h"(keep the IWYU pragma if needed).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
be/src/olap/rowset/segment_v2/variant/nested_offsets_mapping_index.h
Outdated
Show resolved
Hide resolved
...c/main/java/org/apache/doris/nereids/trees/expressions/functions/scalar/SearchDslParser.java
Show resolved
Hide resolved
f9c0e67 to
c61965d
Compare
caa1183 to
bfef111
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 65 out of 65 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
...c/main/java/org/apache/doris/nereids/trees/expressions/functions/scalar/SearchDslParser.java
Show resolved
Hide resolved
| uint32_t idx = | ||
| static_cast<uint32_t>(atoi(rel_str.substr(bucket_prefix.size()).c_str())); |
There was a problem hiding this comment.
The cast to uint32_t from int could be unsafe if atoi returns a negative value or the parsed integer is larger than UINT32_MAX. Consider using std::stoul with error handling instead of atoi, or add validation to ensure the parsed value is within valid range.
| if (rel_str.find(DOC_VALUE_COLUMN_PATH) != std::string::npos) { | ||
| size_t bucket = rel_str.rfind('b'); | ||
| int bucket_value = std::stoi(rel_str.substr(bucket + 1)); | ||
| uint32_t bucket_value = static_cast<uint32_t>(std::stoi(rel_str.substr(bucket + 1))); |
There was a problem hiding this comment.
Similar to line 1040, this cast from stoi result to uint32_t is potentially unsafe. The stoi function can throw exceptions for invalid input and may return negative values. Consider using std::stoul with proper error handling or add validation checks.
| CHECK(column->size() == nrows) << "column->size()=" << column->size() << ", nrows=" << nrows | ||
| << ", path=" << node.path.get_path(); |
There was a problem hiding this comment.
The CHECK macro will cause the program to abort on failure. For production code, consider using a more graceful error handling approach that returns a Status error instead of aborting the process, unless data corruption is guaranteed and aborting is the safest option.
| CHECK(column->size() == nrows) << "column->size()=" << column->size() << ", nrows=" << nrows | |
| << ", path=" << node.path.get_path(); | |
| if (column->size() != nrows) { | |
| return Status::InternalError( | |
| "Column size {} not equal to nrows {} at path {}", | |
| column->size(), nrows, node.path.get_path()); | |
| } |
| const std::unordered_map<std::string, int>& field_name_to_column_id, | ||
| std::shared_ptr<roaring::Roaring>& result_bitmap) const { | ||
| (void)field_name_to_column_id; |
There was a problem hiding this comment.
The field_name_to_column_id parameter is marked with (void) to suppress unused warnings, but it's passed through the call chain. If this parameter is intended for future use, add a TODO comment explaining its purpose. If it's truly unused, consider removing it from the function signature to reduce complexity.
| const std::unordered_map<std::string, int>& field_name_to_column_id, | |
| std::shared_ptr<roaring::Roaring>& result_bitmap) const { | |
| (void)field_name_to_column_id; | |
| [[maybe_unused]] const std::unordered_map<std::string, int>& field_name_to_column_id, | |
| std::shared_ptr<roaring::Roaring>& result_bitmap) const { | |
| // TODO: Use field_name_to_column_id to map nested field names to column IDs | |
| // when evaluating nested queries, so that nested_path and its children | |
| // can be resolved consistently with the outer query. |
bfef111 to
d71cce4
Compare
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 28886 ms |
TPC-DS: Total hot run time: 183590 ms |
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 28840 ms |
TPC-DS: Total hot run time: 183630 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
5e66e1a to
922ae43
Compare
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 28583 ms |
TPC-DS: Total hot run time: 184918 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
skip check_coverage |
1 similar comment
|
skip check_coverage |
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage |
NESTED clause
Introduce NestedGroup support for tracking nested array structures in variant columns, including storage layer definitions, core data structures, read/write path integration, FE pruning rule enhancements, and NESTED clause in search DSL.
IDL definitions:
Add NestedGroupInfoPB, NestedOffsetsIndexPB in segment_v2.proto
Add ColumnPathInfo fields for nested group metadata
Add NESTED clause type and nested_path field in Exprs.thrift
Core data structures (BE):
nested_group_path.h: Path utilities for nested group identification
nested_group_provider.h/cpp: Provider interface with default no-op implementation
Offsets mapping index and offset tracking utilities for nested arrays
Variant reader/writer integration (BE):
Integrate NestedGroup-aware read path in variant_column_reader
Integrate NestedGroup-aware write path in variant_column_writer_impl
Add segment_iterator and segment_creator integration
Enhance path_in_data, variant_util, json_parser for nested structures
Add variant_nested_group_max_depth config
Update variant_statistics for nested group tracking
FE pruning and rewrite rules:
Add AccessPathExpressionCollector for access path collection
Refactor NestedColumnPruning and VariantSubPathPruning logic
Enhance ExpressionAnalyzer and expression context for pruning support
Add NestedGroup array support in Explode/ExplodeOuter
Search DSL NESTED clause:
Add NESTED token and grammar rules in SearchLexer/Parser.g4
Implement NESTED clause parsing in SearchDslParser
Implement NESTED function evaluation in function_search and vsearch
Tests:
nested_group_path_test, nested_group_path_filter_test, nested_group_provider_test
function_search_nested_test, vsearch_expr_test updates
PruneNestedColumnTest, VariantPruningLogicTest, SearchDslParserTest (FE)
4 new regression test suites for variant pruning