Add work-in-progress implementation of a new Python parser#20856
Add work-in-progress implementation of a new Python parser#20856
Conversation
…RSER is set Example: `TEST_NATIVE_PARSER=1 pytest mypy/test/testheck.py`.
This is the mypy counter-part of mypyc/ast_serialize#12. Depends on that PR to work.
This is the mypy counter-part of mypyc/ast_serialize#13 (I am not actually using the new flag yet in `build.py`, I will do this later when the branch is in master)
This is the mypy counterpart of mypyc/ast_serialize#17
This is mypy counterpart for mypyc/ast_serialize#18
|
According to mypy_primer, this change doesn't affect type check results on a corpus of open source code. ✅ |
ilevkivskyi
left a comment
There was a problem hiding this comment.
LG, thanks! Here are some comments, these are mostly minor. If you want to, you can address them in a follow-up PR (but then please don't forget to, because I will).
| import os | ||
| from typing import Any, Final, cast | ||
|
|
||
| import ast_serialize # type: ignore[import-untyped, import-not-found, unused-ignore] |
There was a problem hiding this comment.
import-untyped should not be needed anymore, we now ship the stub in latest ast_serialize.
| class State: | ||
| def __init__(self, options: Options) -> None: | ||
| self.options = options | ||
| self.errors: list[dict[str, Any]] = [] |
There was a problem hiding this comment.
I think it is better to use a TypedDict here.
| 1 -> An IfStmt if the reachability of it can't be inferred, | ||
| i.e. the truth value is unknown. | ||
| """ | ||
| infer_reachability_of_if_statement(stmt, options) |
There was a problem hiding this comment.
This looks like doing double-work, we already infer reachability of if-blocks in ast_serialize, right? Or am I missing something?
|
|
||
| def native_parse( | ||
| filename: str, options: Options, skip_function_bodies: bool = False | ||
| ) -> tuple[MypyFile, list[dict[str, Any]], TypeIgnores]: |
There was a problem hiding this comment.
Same as above, we should return a TypedDict (or maybe even a trivial instance, like ParseError).
|
|
||
| Returns: | ||
| A tuple containing: | ||
| - MypyFile: The parsed AST as a mypy AST node |
There was a problem hiding this comment.
Explain which attributes the caller should set manually (I see the caller in parse.py adds ignored_lines and is_stub).
| code="misc", | ||
| ) | ||
|
|
||
| # Process keyword arguments |
There was a problem hiding this comment.
Again, multiple pointless comments here and below.
| bin_ops: Final = ["+", "-", "*", "@", "/", "%", "**", "<<", ">>", "|", "^", "&", "//"] | ||
| bool_ops: Final = ["and", "or"] | ||
| cmp_ops: Final = ["==", "!=", "<", "<=", ">", ">=", "is", "is not", "in", "not in"] | ||
| unary_ops: Final = ["~", "not", "+", "-"] |
There was a problem hiding this comment.
Mention that order of these must be kept in sync with ast_serialize.
| read_loc(data, expr) | ||
| expect_end_tag(data) | ||
| return expr | ||
| elif tag == nodes.BIG_INT_EXPR: |
There was a problem hiding this comment.
Why do we need both INT_EXPR and BIG_INT_EXPR? Can we simplify this?
| read_loc(data, expr) | ||
| expect_end_tag(data) | ||
| return expr | ||
| elif tag == nodes.NAMED_EXPR: |
There was a problem hiding this comment.
This tag name is easy to confuse with NAME_EXPR, it may be better to rename it to ASSIGNMENT_EXPR.
| def read_expression(state: State, data: ReadBuffer) -> Expression: | ||
| tag = read_tag(data) | ||
| expr: Expression | ||
| if tag == nodes.CALL_EXPR: |
There was a problem hiding this comment.
It may be beneficial to manually order branches here in terms of how "hot" they are (probably also for statements and/or types), unless you already did this. I did this kind of "manual PGO" for types (by looking at how many instances we create for each during mypy self-check) to help the compiler.
The new "native" parser (
mypy.nativeparse) will eventually replace the current parser (mypy.fastparse). The native parser uses a Rust extension that wraps the Ruff parser to generate a serialized AST, and mypy will deserialize the AST directly into a mypy AST. The binary format is the same one we already use for mypy fixed-format incremental caches.This is still work in progress and some features aren't supported. The most important missing feature is probably function type comments. Also, the Rust extension needs to be manually compiled from https://github.com/mypyc/ast_serialize. Refer to the
ast_serializerepository for instructions. There is no CI support for the new parser right now -- there are tests, but they are skipped unless theast_serializeextension is installed, and it isn't installed in CI right now.Once the Rust extension is installed, use
--native-parserto enable the new parser. The main type checker test suite can be run using the native parser viaTEST_NATIVE_PARSER=1 pytest mypy/test/testheck.py(theTEST_NATIVE_PARSERenvironment variable needs to be set). A bunch of tests are still failing.Related issue with more context: #19776
Remaining work is tracked here for now: https://github.com/mypyc/ast_serialize/issues
Here are the expected benefits over the old mypy parser, adapted from the docstring of
mypy/nativeparse.py:# mypy: ignorecommentsMost of the code is straightforward and repetitive deserialization code. I used plenty of coding agent assist to implement deserialization and to add tests. The tests are separate from the pre-existing parser tests, but we can unify them later (or delete the old tests once we delete the old parser).
@ilevkivskyi contributed to this PR.