Skip to content

Comments

Add work-in-progress implementation of a new Python parser#20856

Open
JukkaL wants to merge 194 commits intomasterfrom
new-parser
Open

Add work-in-progress implementation of a new Python parser#20856
JukkaL wants to merge 194 commits intomasterfrom
new-parser

Conversation

@JukkaL
Copy link
Collaborator

@JukkaL JukkaL commented Feb 21, 2026

The new "native" parser (mypy.nativeparse) will eventually replace the current parser (mypy.fastparse). The native parser uses a Rust extension that wraps the Ruff parser to generate a serialized AST, and mypy will deserialize the AST directly into a mypy AST. The binary format is the same one we already use for mypy fixed-format incremental caches.

This is still work in progress and some features aren't supported. The most important missing feature is probably function type comments. Also, the Rust extension needs to be manually compiled from https://github.com/mypyc/ast_serialize. Refer to the ast_serialize repository for instructions. There is no CI support for the new parser right now -- there are tests, but they are skipped unless the ast_serialize extension is installed, and it isn't installed in CI right now.

Once the Rust extension is installed, use --native-parser to enable the new parser. The main type checker test suite can be run using the native parser via TEST_NATIVE_PARSER=1 pytest mypy/test/testheck.py (the TEST_NATIVE_PARSER environment variable needs to be set). A bunch of tests are still failing.

Related issue with more context: #19776

Remaining work is tracked here for now: https://github.com/mypyc/ast_serialize/issues

Here are the expected benefits over the old mypy parser, adapted from the docstring of mypy/nativeparse.py:

  • No intermediate non-mypyc Python-level AST created, to improve performance
  • Parsing doesn't need GIL => can use multithreading to construct serialized ASTs in parallel
  • Produce import dependencies without having to build an AST => helps parallel type checking
  • Support all Python syntax even if mypy is running on an older Python version
  • Generate an AST even if there are syntax errors
  • Potential to support incremental parsing (quickly process modified sections in a file)
  • Stripping function bodies in third-party code can happen earlier, for extra performance
  • We have the option to easily add support for # mypy: ignore comments

Most of the code is straightforward and repetitive deserialization code. I used plenty of coding agent assist to implement deserialization and to add tests. The tests are separate from the pre-existing parser tests, but we can unify them later (or delete the old tests once we delete the old parser).

@ilevkivskyi contributed to this PR.

JukkaL and others added 27 commits February 15, 2026 14:08
…RSER is set

Example: `TEST_NATIVE_PARSER=1 pytest mypy/test/testheck.py`.
This is the mypy counter-part of
mypyc/ast_serialize#12. Depends on that PR to
work.
This is the mypy counter-part of
mypyc/ast_serialize#13

(I am not actually using the new flag yet in `build.py`, I will do this
later when the branch is in master)
@JukkaL JukkaL requested a review from ilevkivskyi February 21, 2026 13:23
@github-actions
Copy link
Contributor

According to mypy_primer, this change doesn't affect type check results on a corpus of open source code. ✅

Copy link
Member

@ilevkivskyi ilevkivskyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG, thanks! Here are some comments, these are mostly minor. If you want to, you can address them in a follow-up PR (but then please don't forget to, because I will).

import os
from typing import Any, Final, cast

import ast_serialize # type: ignore[import-untyped, import-not-found, unused-ignore]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import-untyped should not be needed anymore, we now ship the stub in latest ast_serialize.

class State:
def __init__(self, options: Options) -> None:
self.options = options
self.errors: list[dict[str, Any]] = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to use a TypedDict here.

1 -> An IfStmt if the reachability of it can't be inferred,
i.e. the truth value is unknown.
"""
infer_reachability_of_if_statement(stmt, options)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like doing double-work, we already infer reachability of if-blocks in ast_serialize, right? Or am I missing something?


def native_parse(
filename: str, options: Options, skip_function_bodies: bool = False
) -> tuple[MypyFile, list[dict[str, Any]], TypeIgnores]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, we should return a TypedDict (or maybe even a trivial instance, like ParseError).


Returns:
A tuple containing:
- MypyFile: The parsed AST as a mypy AST node
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain which attributes the caller should set manually (I see the caller in parse.py adds ignored_lines and is_stub).

code="misc",
)

# Process keyword arguments
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, multiple pointless comments here and below.

bin_ops: Final = ["+", "-", "*", "@", "/", "%", "**", "<<", ">>", "|", "^", "&", "//"]
bool_ops: Final = ["and", "or"]
cmp_ops: Final = ["==", "!=", "<", "<=", ">", ">=", "is", "is not", "in", "not in"]
unary_ops: Final = ["~", "not", "+", "-"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mention that order of these must be kept in sync with ast_serialize.

read_loc(data, expr)
expect_end_tag(data)
return expr
elif tag == nodes.BIG_INT_EXPR:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need both INT_EXPR and BIG_INT_EXPR? Can we simplify this?

read_loc(data, expr)
expect_end_tag(data)
return expr
elif tag == nodes.NAMED_EXPR:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tag name is easy to confuse with NAME_EXPR, it may be better to rename it to ASSIGNMENT_EXPR.

def read_expression(state: State, data: ReadBuffer) -> Expression:
tag = read_tag(data)
expr: Expression
if tag == nodes.CALL_EXPR:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be beneficial to manually order branches here in terms of how "hot" they are (probably also for statements and/or types), unless you already did this. I did this kind of "manual PGO" for types (by looking at how many instances we create for each during mypy self-check) to help the compiler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants