Skip to content

[Bug]: Docker - from_serializable_dict rejects schema "type" fields in 0.8.x #1797

@SohamKukreti

Description

@SohamKukreti

crawl4ai version

v0.8.0 (docker)

Expected Behavior

  • from_serializable_dict should only apply the ALLOWED_DESERIALIZE_TYPES allowlist to objects that are actually part of the Crawl4AI config model, i.e. the { "type": "<ClassName>", "params": {...} } / { "type": "dict", "value": {...} } shapes produced by to_serializable_dict.

  • Plain nested dictionaries that happen to have a "type" key (e.g. extraction schemas, JSON Schema fragments, etc.) should be treated as data, not as config classes, and should not be validated against ALLOWED_DESERIALIZE_TYPES.

In other words, the payloads below should deserialize successfully, and /crawl should either succeed or fail based on crawling/extraction logic, not on deserialization of innocent schema dictionaries.

Current Behavior

  • When a /crawl request contains a crawler_config with:

    • A JsonCssExtractionStrategy schema where field dicts have "type": "text", "type": "list", etc., or
    • An LLMExtractionStrategy schema that uses JSON Schema-like "type": "string" / "type": "object",

    the docker API fails during CrawlerRunConfig.load(crawler_config) with:

    ValueError: Deserialization of type 'text' is not allowed. Only allowlisted configuration and strategy types can be deserialized.
    

    or:

    ValueError: Deserialization of type 'string' is not allowed. Only allowlisted configuration and strategy types can be deserialized.
    
  • These errors bubble up and /crawl returns HTTP 500, causing tests like:

    • tests/docker/test_rest_api_deep_crawl.py::TestDeepCrawlEndpoints::test_deep_crawl_with_css_extraction
    • tests/docker/test_rest_api_deep_crawl.py::TestDeepCrawlEndpoints::test_deep_crawl_with_llm_extraction

    to fail with httpx.HTTPStatusError: Server error '500 Internal Server Error' for url 'http://localhost:11235/crawl'.

  • The failure is not due to invalid crawl configuration; it is purely a deserialization regression in from_serializable_dict after the RCE fix.

Is this reproducible?

Yes

Inputs Causing the Bug

schema = {
  "name": "Cryptocurrency Prices",
  "baseSelector": "table[data-testid=\"prices-table\"] tbody tr",
  "fields": [
    {
      "name": "asset_name",
      "selector": "td:nth-child(2) p.cds-headline-h4steop",
      "type": "text"
    },
    {
      "name": "asset_symbol",
      "selector": "td:nth-child(2) p.cds-label2-l1sm09ec",
      "type": "text"
    },
    {
      "name": "asset_image_url",
      "selector": "td:nth-child(2) img[alt=\"Asset Symbol\"]",
      "type": "attribute",
      "attribute": "src"
    },
    {
      "name": "asset_url",
      "selector": "td:nth-child(2) a[aria-label^=\"Asset page for\"]",
      "type": "attribute",
      "attribute": "href"
    },
    {
      "name": "price",
      "selector": "td:nth-child(3) div.cds-typographyResets-t6muwls.cds-body-bwup3gq",
      "type": "text"
    },
    {
      "name": "change",
      "selector": "td:nth-child(7) p.cds-body-bwup3gq",
      "type": "text"
    }
  ]
}


    request = {
        "urls": ["https://www.coinbase.com/explore"],
        "crawler_config": {
            "type": "CrawlerRunConfig",
            "params": {
                "extraction_strategy": {
                    "type": "JsonCssExtractionStrategy",
                    "params": {"schema": schema}
                }
            }
        }
    }

Steps to Reproduce

1. **Start the docker API** from the repo root:

   
   docker compose up
   # API listens on http://0.0.0.0:11235 inside the container,
   # accessible as http://localhost:11235 on host.
   

2. **Run the deep-crawl docker tests** from the host environment or run docker_example.py:

   
   python3 tests/docker/test_rest_api_deep_crawl.py
   

   or specifically:

   
   python3 tests/docker/test_rest_api_deep_crawl.py -k 'css_extraction or llm_extraction'
   

3. Observe that:

   - The health checks pass.
   - Earlier deep-crawl tests (basic, with filters, with scoring) pass.
   - The following tests **fail with HTTP 500**:

     - `test_deep_crawl_with_css_extraction`
     - `test_deep_crawl_with_llm_extraction`

Code snippets

#### Current `from_serializable_dict` logic (relevant part)


def from_serializable_dict(data: Any) -> Any:
    """
    Recursively convert a serializable dictionary back to an object instance.
    """
    if data is None:
        return None

    # Handle basic types
    if isinstance(data, (str, int, float, bool)):
        return data

    # Handle typed data
    if isinstance(data, dict) and "type" in data:
        # Handle plain dictionaries
        if data["type"] == "dict" and "value" in data:
            return {k: from_serializable_dict(v) for k, v in data["value"].items()}

        # Security: only allow known-safe types to be deserialized
        type_name = data["type"]
        if type_name not in ALLOWED_DESERIALIZE_TYPES:
            raise ValueError(
                f"Deserialization of type '{type_name}' is not allowed. "
                f"Only allowlisted configuration and strategy types can be deserialized."
            )

        cls = None
        module_paths = ["crawl4ai"]
        for module_path in module_paths:
            try:
                mod = importlib.import_module(module_path)
                if hasattr(mod, type_name):
                    cls = getattr(mod, type_name)
                    break
            except (ImportError, AttributeError):
                continue

        if cls is not None:
            # Handle Enum
            if issubclass(cls, Enum):
                return cls(data["params"])

            if "params" in data:
                # Handle class instances
                constructor_args = {
                    k: from_serializable_dict(v) for k, v in data["params"].items()
                }
                return cls(**constructor_args)

    # Handle lists
    if isinstance(data, list):
        return [from_serializable_dict(item) for item in data]

    # Handle raw dictionaries (legacy support)
    if isinstance(data, dict):
        return {k: from_serializable_dict(v) for k, v in data.items()}

    return data

OS

Linux

Python version

3.13

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    🐞 BugSomething isn't working📌 Root causedidentified the root cause of bug

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions