Skip to content

[fix](search) fix MATCH_ALL_DOCS losing occur attribute in multi-field expansion#60873

Merged
airborne12 merged 1 commit intoapache:masterfrom
airborne12:worktree-24561
Feb 27, 2026
Merged

[fix](search) fix MATCH_ALL_DOCS losing occur attribute in multi-field expansion#60873
airborne12 merged 1 commit intoapache:masterfrom
airborne12:worktree-24561

Conversation

@airborne12
Copy link
Member

What problem does this PR solve?

Issue Number: close #DORIS-24561

Problem Summary:

In lucene mode with multi-field queries (e.g., best_fields with fields: ["title", "content"]), the query "Lauren Boebert" OR * returns results inconsistent with Elasticsearch.

Root cause: During multi-field expansion (expandNodeCrossFields, deepCopyWithField, setFieldOnLeaves), MATCH_ALL_DOCS nodes are recreated without preserving the occur attribute (e.g., SHOULD). The BE defaults occur=null to MUST, changing the query semantics:

  • Expected (ES behavior): SHOULD(phrase) OR SHOULD(match_all) = all documents
  • Actual (bug): SHOULD(phrase) AND MUST(match_all) = only phrase-matching documents

ES explain for "Lauren Boebert" OR *:

(title:"lauren boebert" | content:"lauren boebert") (ConstantScore(FieldExistsQuery [field=content]) | ConstantScore(FieldExistsQuery [field=title]))

Returns 1,000,000 docs (all). Doris was returning only phrase-matching docs.

Fix: Preserve the occur attribute when creating new MATCH_ALL_DOCS nodes in all three multi-field expansion methods.

Release note

Fix search() function returning incorrect results for lucene mode queries containing standalone wildcard * with multi-field expansion (e.g., "phrase" OR * with best_fields).

Check List (For Author)

  • Test

    • Unit Test
    • Regression test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • Yes. Multi-field lucene mode queries with standalone * (MATCH_ALL_DOCS) now correctly preserve the occur attribute during cross-field expansion, matching ES behavior.
  • Does this need documentation?

    • No.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

…field expansion

In lucene mode with multi-field queries like '"phrase" OR *', the standalone
wildcard '*' is parsed as MATCH_ALL_DOCS with occur=SHOULD. However, during
the cross-field expansion (expandNodeCrossFields, deepCopyWithField,
setFieldOnLeaves), new MATCH_ALL_DOCS nodes were created without preserving
the occur attribute. This caused the BE to default to MUST, changing the
query semantics from "phrase OR match_all = all docs" to "phrase AND
match_all = only phrase matches", producing results inconsistent with ES.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Thearas
Copy link
Contributor

Thearas commented Feb 27, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@airborne12
Copy link
Member Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 28527 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit e707f4177bc5693eeb0b3b8aabb8b3f4d32cbecb, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17612	4525	4255	4255
q2	q3	10655	784	511	511
q4	4679	361	256	256
q5	7551	1192	1039	1039
q6	177	174	145	145
q7	772	869	676	676
q8	9289	1473	1378	1378
q9	4851	4472	4712	4472
q10	6865	1892	1641	1641
q11	445	267	252	252
q12	768	571	470	470
q13	17779	4219	3403	3403
q14	230	220	211	211
q15	940	802	787	787
q16	754	724	673	673
q17	706	846	450	450
q18	6237	5398	5237	5237
q19	1132	1003	632	632
q20	513	491	393	393
q21	4539	1885	1408	1408
q22	335	285	238	238
Total cold run time: 96829 ms
Total hot run time: 28527 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4420	4352	4349	4349
q2	q3	1762	2200	1748	1748
q4	866	1167	784	784
q5	4004	4270	4337	4270
q6	182	174	143	143
q7	1724	1590	1492	1492
q8	2429	2625	2531	2531
q9	7355	7441	7501	7441
q10	2657	2857	2399	2399
q11	494	433	414	414
q12	514	605	454	454
q13	3993	4425	3635	3635
q14	309	357	274	274
q15	878	830	835	830
q16	714	755	728	728
q17	1208	1567	1294	1294
q18	7115	6811	6615	6615
q19	972	965	964	964
q20	2133	2210	1991	1991
q21	3964	3679	3338	3338
q22	485	465	391	391
Total cold run time: 48178 ms
Total hot run time: 46085 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 183734 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit e707f4177bc5693eeb0b3b8aabb8b3f4d32cbecb, data reload: false

query5	5121	643	524	524
query6	330	216	217	216
query7	4223	473	280	280
query8	338	247	231	231
query9	8738	2763	2780	2763
query10	551	370	354	354
query11	17001	17470	17062	17062
query12	196	130	125	125
query13	1292	491	371	371
query14	7301	3379	3242	3242
query14_1	3074	3020	2986	2986
query15	215	204	191	191
query16	1063	486	480	480
query17	1150	779	732	732
query18	3753	504	383	383
query19	240	231	197	197
query20	151	138	135	135
query21	225	158	137	137
query22	5648	5086	4681	4681
query23	17313	16771	16599	16599
query23_1	16708	16787	16602	16602
query24	7121	1629	1266	1266
query24_1	1243	1247	1240	1240
query25	559	503	395	395
query26	1219	256	152	152
query27	2773	464	289	289
query28	4452	1838	1861	1838
query29	809	558	472	472
query30	312	247	207	207
query31	856	726	647	647
query32	78	75	73	73
query33	516	344	286	286
query34	915	900	588	588
query35	649	675	592	592
query36	1070	1149	1025	1025
query37	133	94	88	88
query38	2951	2903	2919	2903
query39	1003	874	960	874
query39_1	814	823	839	823
query40	234	153	134	134
query41	60	65	58	58
query42	103	101	101	101
query43	379	374	346	346
query44	
query45	206	187	181	181
query46	871	990	612	612
query47	2109	2127	2022	2022
query48	307	321	226	226
query49	628	469	376	376
query50	716	280	208	208
query51	4154	4081	4063	4063
query52	101	106	97	97
query53	293	336	282	282
query54	298	263	273	263
query55	94	85	82	82
query56	315	320	318	318
query57	1379	1337	1255	1255
query58	287	276	262	262
query59	2542	2700	2460	2460
query60	326	335	317	317
query61	151	148	148	148
query62	629	585	554	554
query63	323	290	278	278
query64	4819	1266	991	991
query65	
query66	1378	456	360	360
query67	16259	16335	16221	16221
query68	
query69	396	312	283	283
query70	1019	924	951	924
query71	341	303	307	303
query72	2936	2690	2265	2265
query73	539	541	320	320
query74	10011	9883	9795	9795
query75	2863	2754	2468	2468
query76	2310	1053	678	678
query77	351	408	319	319
query78	11175	11293	10729	10729
query79	3352	788	588	588
query80	1779	625	570	570
query81	583	280	248	248
query82	985	152	114	114
query83	343	265	248	248
query84	259	126	101	101
query85	930	493	434	434
query86	491	300	302	300
query87	3142	3094	3018	3018
query88	3613	2673	2661	2661
query89	429	368	349	349
query90	1940	176	172	172
query91	160	159	131	131
query92	78	78	69	69
query93	1765	850	517	517
query94	646	320	286	286
query95	573	397	317	317
query96	637	521	229	229
query97	2485	2503	2411	2411
query98	230	224	220	220
query99	963	999	920	920
Total cold run time: 260157 ms
Total hot run time: 183734 ms

@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Feb 27, 2026
@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

Copy link
Member

@eldenmoon eldenmoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@airborne12 airborne12 merged commit a4091ad into apache:master Feb 27, 2026
29 of 31 checks passed
@airborne12 airborne12 deleted the worktree-24561 branch February 27, 2026 15:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants