-
-
Notifications
You must be signed in to change notification settings - Fork 6.3k
Description
crawl4ai version
v0.8.0 (docker)
Expected Behavior
-
from_serializable_dictshould only apply theALLOWED_DESERIALIZE_TYPESallowlist to objects that are actually part of the Crawl4AI config model, i.e. the{ "type": "<ClassName>", "params": {...} }/{ "type": "dict", "value": {...} }shapes produced byto_serializable_dict. -
Plain nested dictionaries that happen to have a
"type"key (e.g. extraction schemas, JSON Schema fragments, etc.) should be treated as data, not as config classes, and should not be validated againstALLOWED_DESERIALIZE_TYPES.
In other words, the payloads below should deserialize successfully, and /crawl should either succeed or fail based on crawling/extraction logic, not on deserialization of innocent schema dictionaries.
Current Behavior
-
When a
/crawlrequest contains acrawler_configwith:- A
JsonCssExtractionStrategyschema where field dicts have"type": "text","type": "list", etc., or - An
LLMExtractionStrategyschema that uses JSON Schema-like"type": "string"/"type": "object",
the docker API fails during
CrawlerRunConfig.load(crawler_config)with:ValueError: Deserialization of type 'text' is not allowed. Only allowlisted configuration and strategy types can be deserialized.or:
ValueError: Deserialization of type 'string' is not allowed. Only allowlisted configuration and strategy types can be deserialized. - A
-
These errors bubble up and
/crawlreturns HTTP 500, causing tests like:tests/docker/test_rest_api_deep_crawl.py::TestDeepCrawlEndpoints::test_deep_crawl_with_css_extractiontests/docker/test_rest_api_deep_crawl.py::TestDeepCrawlEndpoints::test_deep_crawl_with_llm_extraction
to fail with
httpx.HTTPStatusError: Server error '500 Internal Server Error' for url 'http://localhost:11235/crawl'. -
The failure is not due to invalid crawl configuration; it is purely a deserialization regression in
from_serializable_dictafter the RCE fix.
Is this reproducible?
Yes
Inputs Causing the Bug
schema = {
"name": "Cryptocurrency Prices",
"baseSelector": "table[data-testid=\"prices-table\"] tbody tr",
"fields": [
{
"name": "asset_name",
"selector": "td:nth-child(2) p.cds-headline-h4steop",
"type": "text"
},
{
"name": "asset_symbol",
"selector": "td:nth-child(2) p.cds-label2-l1sm09ec",
"type": "text"
},
{
"name": "asset_image_url",
"selector": "td:nth-child(2) img[alt=\"Asset Symbol\"]",
"type": "attribute",
"attribute": "src"
},
{
"name": "asset_url",
"selector": "td:nth-child(2) a[aria-label^=\"Asset page for\"]",
"type": "attribute",
"attribute": "href"
},
{
"name": "price",
"selector": "td:nth-child(3) div.cds-typographyResets-t6muwls.cds-body-bwup3gq",
"type": "text"
},
{
"name": "change",
"selector": "td:nth-child(7) p.cds-body-bwup3gq",
"type": "text"
}
]
}
request = {
"urls": ["https://www.coinbase.com/explore"],
"crawler_config": {
"type": "CrawlerRunConfig",
"params": {
"extraction_strategy": {
"type": "JsonCssExtractionStrategy",
"params": {"schema": schema}
}
}
}
}Steps to Reproduce
1. **Start the docker API** from the repo root:
docker compose up
# API listens on http://0.0.0.0:11235 inside the container,
# accessible as http://localhost:11235 on host.
2. **Run the deep-crawl docker tests** from the host environment or run docker_example.py:
python3 tests/docker/test_rest_api_deep_crawl.py
or specifically:
python3 tests/docker/test_rest_api_deep_crawl.py -k 'css_extraction or llm_extraction'
3. Observe that:
- The health checks pass.
- Earlier deep-crawl tests (basic, with filters, with scoring) pass.
- The following tests **fail with HTTP 500**:
- `test_deep_crawl_with_css_extraction`
- `test_deep_crawl_with_llm_extraction`Code snippets
#### Current `from_serializable_dict` logic (relevant part)
def from_serializable_dict(data: Any) -> Any:
"""
Recursively convert a serializable dictionary back to an object instance.
"""
if data is None:
return None
# Handle basic types
if isinstance(data, (str, int, float, bool)):
return data
# Handle typed data
if isinstance(data, dict) and "type" in data:
# Handle plain dictionaries
if data["type"] == "dict" and "value" in data:
return {k: from_serializable_dict(v) for k, v in data["value"].items()}
# Security: only allow known-safe types to be deserialized
type_name = data["type"]
if type_name not in ALLOWED_DESERIALIZE_TYPES:
raise ValueError(
f"Deserialization of type '{type_name}' is not allowed. "
f"Only allowlisted configuration and strategy types can be deserialized."
)
cls = None
module_paths = ["crawl4ai"]
for module_path in module_paths:
try:
mod = importlib.import_module(module_path)
if hasattr(mod, type_name):
cls = getattr(mod, type_name)
break
except (ImportError, AttributeError):
continue
if cls is not None:
# Handle Enum
if issubclass(cls, Enum):
return cls(data["params"])
if "params" in data:
# Handle class instances
constructor_args = {
k: from_serializable_dict(v) for k, v in data["params"].items()
}
return cls(**constructor_args)
# Handle lists
if isinstance(data, list):
return [from_serializable_dict(item) for item in data]
# Handle raw dictionaries (legacy support)
if isinstance(data, dict):
return {k: from_serializable_dict(v) for k, v in data.items()}
return dataOS
Linux
Python version
3.13
Browser
No response
Browser version
No response
Error logs & Screenshots (if applicable)
No response