Skip to content

chore: add disaster simulation script#259

Open
jason-lynch wants to merge 3 commits intomainfrom
chore/disaster-simulation
Open

chore: add disaster simulation script#259
jason-lynch wants to merge 3 commits intomainfrom
chore/disaster-simulation

Conversation

@jason-lynch
Copy link
Member

@jason-lynch jason-lynch commented Jan 28, 2026

Summary

Adds a script to simulate losing a host. This script has three different ways of simulating that loss to enable us to develop recovery steps for Swarm and Control Plane/Etcd in parallel.

Testing

NOTE - You can get this script without checking out this whole branch by doing:

git fetch origin chore/disaster-simulation:chore/disaster-simulation
git restore --source chore/disaster-simulation hack/simulate-disaster.sh

Then to use the script:

# Simulate losing Swarm on two hosts in order to lose quorum
./hack/simulate-disaster.sh swarm host-1 host-3

# Simulate losing Control Plane/Etcd on two hosts in order to lose quorum
./hack/simulate-disaster.sh etcd host-1 host-3

# Reset the fixture back to its initial state
./hack/simulate-disaster.sh reset

# Remember to include the fixture variant if you're using a non-default one
FIXTURE_VARIANT=small ./hack/simulate-disaster.sh reset

# Print the included help text to see more examples
./hack/simulate-disaster.sh --help

Adds a script to simulate losing a host. This script has three different
ways of simulating that loss to enable us to develop recovery steps for
Swarm and Control Plane/Etcd in parallel.
@coderabbitai
Copy link

coderabbitai bot commented Jan 28, 2026

📝 Walkthrough

Walkthrough

Introduces a new Bash script hack/simulate-disaster.sh that automates disaster simulations for Lima-based test fixtures. The script provides functions to simulate Swarm node loss, etcd node loss, full host loss, and recovery scenarios, with environment setup and make target integration for deployment and teardown operations.

Changes

Cohort / File(s) Summary
Disaster Simulation Script
hack/simulate-disaster.sh
Adds Bash script with functions to simulate failure modes: simulate_swarm_node_loss, simulate_etcd_node_loss, simulate_full_loss, reset, usage, and main for automating disaster recovery workflows against Lima test fixtures. Includes targeted cleanup and reinitialization steps.

Poem

Hopping through chaos with whiskers so keen,
We test every failure our clusters have seen,
Swarms tumble down, etcd nodes disappear—
But our recovery dances bring systems back here! 🐰
Disaster, we tame it, with scripts in our paws. 🔄

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The description covers Summary and Testing sections with clear usage examples, but lacks Changes section, Checklist items, and Notes for Reviewers as specified in the template. Complete the missing template sections: add a bulleted Changes section, fill out the Checklist items, and provide Notes for Reviewers highlighting key usage instructions or any risks.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and accurately describes the main change: adding a disaster simulation script to the codebase.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch chore/disaster-simulation

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Adds an option to reset the Lima E2E fixture back to its initial state
without tearing it down entirely. This can save a significant amount of
time between tests.
- Fix etcd simulation for client-mode servers
- Remove database services in etcd simulation
- Rebuild control-plane in reset
- Remove `set -x`
@jason-lynch jason-lynch marked this pull request as ready for review February 24, 2026 15:43
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@hack/simulate-disaster.sh`:
- Around line 171-173: The dispatch case labeled "full" calls a nonexistent
function simulate_full_node_loss; update that case to call the actual function
simulate_full_loss (replace simulate_full_node_loss with simulate_full_loss) so
the "full" branch invokes the defined function and won't fail at runtime.
- Around line 121-125: Update the usage text in the simulate-disaster.sh header:
fix the duplicated word "different different" and include the missing `reset`
option in the synopsis string (change "Usage: $1 <swarm|etcd|full> <host-id>
[host-id ...]" to include `reset`, e.g. "Usage: $1 <swarm|etcd|full|reset>
<host-id> [host-id ...]") and adjust the descriptive paragraph to remove the
duplicate word so it reads "three different types of disasters" (or similar).
Ensure you update both the usage line and the description near that header.
- Line 17: Fix the unquoted $@ expansions in the for-loops and correct the
misnamed function call: change the three loops that iterate host_id (the ones
using for host_id in $@) to use quoted expansion (for host_id in "$@") to
prevent word-splitting and globbing, update the calls that forward args to use
quoted slices where shown (simulate_swarm_node_loss "${@:2}",
simulate_etcd_node_loss "${@:2}"), replace the invalid simulate_full_node_loss
call with the actual function name simulate_full_loss and pass quoted args
(simulate_full_loss "${@:2}"), and ensure the script's entry call uses main "$@"
instead of unquoted arguments.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aceb011 and 07d41dc.

📒 Files selected for processing (1)
  • hack/simulate-disaster.sh

simulate_swarm_node_loss() {
local host_id

for host_id in $@; do
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

cat -n hack/simulate-disaster.sh

Repository: pgEdge/control-plane

Length of output: 6082


🌐 Web query:

shell script quoting $@ best practices word splitting globbing

💡 Result:

Best practices for $@ (and avoiding word splitting + globbing)

  • Forward all arguments exactly:
    Use "$@" (quoted). In bash, "$@" expands to separate words: "$1" "$2" ... (preserves spaces, tabs, newlines, and literal * ? [ characters in each argument). [1][2]

  • Don’t use $* for forwarding:
    $* (especially unquoted) merges args and then undergoes splitting/globbing; "$*" turns all args into one string separated by the first char of IFS. Prefer "$@". [1][3]

  • Loop over args safely:
    Prefer either:

    for arg in "$@"; do ...; done

    or (bash) simply:

    for arg; do ...; done

    Avoid for arg in $@; ... because unquoted $@ triggers word splitting and pathname expansion (globbing). [2]

  • Quote variable expansions by default:
    Unquoted expansions are subject to field splitting and then pathname expansion (globbing). Quoting suppresses both. [2][4]

  • When mixing variables with globs, quote only the variable parts:

    for f in "${dir}"/*.txt; do ...; done

    not "${dir}/*.txt" (that would disable the glob). [5]

  • If you actually want splitting, do it explicitly (not via unquoted $var/$@):
    Use read -r -a / mapfile (bash) or controlled IFS handling, rather than relying on “sloppy” splitting + globbing. [2]

Sources: Bash Reference Manual on special parameters @/* and quoting [1], ShellCheck guidance on quoting and "$@" [2][3][5], POSIX shell quoting/expansion rules [4].


Fix quoting of $@ expansions and correct the function name on line 172.

Unquoted $@ expansions in this script cause word-splitting and globbing issues, making them fragile with inputs containing spaces or glob characters. Additionally, line 172 calls simulate_full_node_loss which doesn't exist; the function is defined as simulate_full_loss on line 70, causing a runtime error.

Apply these fixes:

  • Line 17: for host_id in "$@"; do
  • Line 39: for host_id in "$@"; do
  • Line 73: for host_id in "$@"; do
  • Line 166: simulate_swarm_node_loss "${@:2}"
  • Line 169: simulate_etcd_node_loss "${@:2}"
  • Line 172: simulate_full_loss "${@:2}" (corrected function name and quoted)
  • Line 187: main "$@"
🧰 Tools
🪛 Shellcheck (0.11.0)

[error] 17-17: Double quote array expansions to avoid re-splitting elements.

(SC2068)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hack/simulate-disaster.sh` at line 17, Fix the unquoted $@ expansions in the
for-loops and correct the misnamed function call: change the three loops that
iterate host_id (the ones using for host_id in $@) to use quoted expansion (for
host_id in "$@") to prevent word-splitting and globbing, update the calls that
forward args to use quoted slices where shown (simulate_swarm_node_loss
"${@:2}", simulate_etcd_node_loss "${@:2}"), replace the invalid
simulate_full_node_loss call with the actual function name simulate_full_loss
and pass quoted args (simulate_full_loss "${@:2}"), and ensure the script's
entry call uses main "$@" instead of unquoted arguments.

Comment on lines +121 to +125
Usage: $1 <swarm|etcd|full> <host-id> [host-id ...]

Simulates disasters against the Lima test fixtures. Supports three different
different types of disasters to enable us to develop some recovery steps in
parallel:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Update usage text (typo + include reset).

The synopsis omits reset, and the description says “different different”.

✏️ Suggested fix
-Usage: $1 <swarm|etcd|full> <host-id> [host-id ...]
+Usage: $1 <swarm|etcd|full|reset> <host-id> [host-id ...]
@@
-Simulates disasters against the Lima test fixtures. Supports three different
-different types of disasters to enable us to develop some recovery steps in
+Simulates disasters against the Lima test fixtures. Supports three different
+types of disasters to enable us to develop some recovery steps in
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Usage: $1 <swarm|etcd|full> <host-id> [host-id ...]
Simulates disasters against the Lima test fixtures. Supports three different
different types of disasters to enable us to develop some recovery steps in
parallel:
Usage: $1 <swarm|etcd|full|reset> <host-id> [host-id ...]
Simulates disasters against the Lima test fixtures. Supports three different
types of disasters to enable us to develop some recovery steps in
parallel:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hack/simulate-disaster.sh` around lines 121 - 125, Update the usage text in
the simulate-disaster.sh header: fix the duplicated word "different different"
and include the missing `reset` option in the synopsis string (change "Usage: $1
<swarm|etcd|full> <host-id> [host-id ...]" to include `reset`, e.g. "Usage: $1
<swarm|etcd|full|reset> <host-id> [host-id ...]") and adjust the descriptive
paragraph to remove the duplicate word so it reads "three different types of
disasters" (or similar). Ensure you update both the usage line and the
description near that header.

Comment on lines +171 to +173
full)
simulate_full_node_loss ${@:2}
;;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix the full dispatch: undefined function call.

full currently calls simulate_full_node_loss, but the defined function is simulate_full_loss, so the full path will fail at runtime.

✅ Suggested fix
-		full)
-			simulate_full_node_loss ${@:2}
+		full)
+			simulate_full_loss ${@:2}
 			;;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
full)
simulate_full_node_loss ${@:2}
;;
full)
simulate_full_loss ${@:2}
;;
🧰 Tools
🪛 Shellcheck (0.11.0)

[error] 172-172: Double quote array expansions to avoid re-splitting elements.

(SC2068)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hack/simulate-disaster.sh` around lines 171 - 173, The dispatch case labeled
"full" calls a nonexistent function simulate_full_node_loss; update that case to
call the actual function simulate_full_loss (replace simulate_full_node_loss
with simulate_full_loss) so the "full" branch invokes the defined function and
won't fail at runtime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants