chore: add disaster simulation script by jason-lynch · Pull Request #259 · pgEdge/control-plane

jason-lynch · 2026-01-28T17:05:16Z

Summary

Adds a script to simulate losing a host. This script has three different ways of simulating that loss to enable us to develop recovery steps for Swarm and Control Plane/Etcd in parallel.

Testing

NOTE - You can get this script without checking out this whole branch by doing:

git fetch origin chore/disaster-simulation:chore/disaster-simulation
git restore --source chore/disaster-simulation hack/simulate-disaster.sh

Then to use the script:

# Simulate losing Swarm on two hosts in order to lose quorum
./hack/simulate-disaster.sh swarm host-1 host-3

# Simulate losing Control Plane/Etcd on two hosts in order to lose quorum
./hack/simulate-disaster.sh etcd host-1 host-3

# Reset the fixture back to its initial state
./hack/simulate-disaster.sh reset

# Remember to include the fixture variant if you're using a non-default one
FIXTURE_VARIANT=small ./hack/simulate-disaster.sh reset

# Print the included help text to see more examples
./hack/simulate-disaster.sh --help

Adds a script to simulate losing a host. This script has three different ways of simulating that loss to enable us to develop recovery steps for Swarm and Control Plane/Etcd in parallel.

coderabbitai · 2026-01-28T17:05:31Z

📝 Walkthrough

Walkthrough

Introduces a new Bash script hack/simulate-disaster.sh that automates disaster simulations for Lima-based test fixtures. The script provides functions to simulate Swarm node loss, etcd node loss, full host loss, and recovery scenarios, with environment setup and make target integration for deployment and teardown operations.

Changes

Cohort / File(s)	Summary
Disaster Simulation Script `hack/simulate-disaster.sh`	Adds Bash script with functions to simulate failure modes: `simulate_swarm_node_loss`, `simulate_etcd_node_loss`, `simulate_full_loss`, `reset`, `usage`, and `main` for automating disaster recovery workflows against Lima test fixtures. Includes targeted cleanup and reinitialization steps.

Poem

Hopping through chaos with whiskers so keen,
We test every failure our clusters have seen,
Swarms tumble down, etcd nodes disappear—
But our recovery dances bring systems back here! 🐰
Disaster, we tame it, with scripts in our paws. 🔄

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	❓ Inconclusive	The description covers Summary and Testing sections with clear usage examples, but lacks Changes section, Checklist items, and Notes for Reviewers as specified in the template.	Complete the missing template sections: add a bulleted Changes section, fill out the Checklist items, and provide Notes for Reviewers highlighting key usage instructions or any risks.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and accurately describes the main change: adding a disaster simulation script to the codebase.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch chore/disaster-simulation

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Adds an option to reset the Lima E2E fixture back to its initial state without tearing it down entirely. This can save a significant amount of time between tests.

- Fix etcd simulation for client-mode servers - Remove database services in etcd simulation - Rebuild control-plane in reset - Remove `set -x`

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@hack/simulate-disaster.sh`:
- Around line 171-173: The dispatch case labeled "full" calls a nonexistent
function simulate_full_node_loss; update that case to call the actual function
simulate_full_loss (replace simulate_full_node_loss with simulate_full_loss) so
the "full" branch invokes the defined function and won't fail at runtime.
- Around line 121-125: Update the usage text in the simulate-disaster.sh header:
fix the duplicated word "different different" and include the missing `reset`
option in the synopsis string (change "Usage: $1 <swarm|etcd|full> <host-id>
[host-id ...]" to include `reset`, e.g. "Usage: $1 <swarm|etcd|full|reset>
<host-id> [host-id ...]") and adjust the descriptive paragraph to remove the
duplicate word so it reads "three different types of disasters" (or similar).
Ensure you update both the usage line and the description near that header.
- Line 17: Fix the unquoted $@ expansions in the for-loops and correct the
misnamed function call: change the three loops that iterate host_id (the ones
using for host_id in $@) to use quoted expansion (for host_id in "$@") to
prevent word-splitting and globbing, update the calls that forward args to use
quoted slices where shown (simulate_swarm_node_loss "${@:2}",
simulate_etcd_node_loss "${@:2}"), replace the invalid simulate_full_node_loss
call with the actual function name simulate_full_loss and pass quoted args
(simulate_full_loss "${@:2}"), and ensure the script's entry call uses main "$@"
instead of unquoted arguments.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aceb011 and 07d41dc.

📒 Files selected for processing (1)

hack/simulate-disaster.sh

coderabbitai · 2026-02-24T15:48:04Z

hack/simulate-disaster.sh

+simulate_swarm_node_loss() {
+	local host_id
+
+	for host_id in $@; do


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

cat -n hack/simulate-disaster.sh

Repository: pgEdge/control-plane

Length of output: 6082

🌐 Web query:

shell script quoting $@ best practices word splitting globbing

💡 Result:

Best practices for $@ (and avoiding word splitting + globbing)

Forward all arguments exactly:
Use "$@" (quoted). In bash, "$@" expands to separate words: "$1" "$2" ... (preserves spaces, tabs, newlines, and literal * ? [ characters in each argument). [1][2]

Don’t use $* for forwarding:
$* (especially unquoted) merges args and then undergoes splitting/globbing; "$*" turns all args into one string separated by the first char of IFS. Prefer "$@". [1][3]

Loop over args safely:
Prefer either:

for arg in "$@"; do ...; done

or (bash) simply:

for arg; do ...; done

Avoid for arg in $@; ... because unquoted $@ triggers word splitting and pathname expansion (globbing). [2]

Quote variable expansions by default:
Unquoted expansions are subject to field splitting and then pathname expansion (globbing). Quoting suppresses both. [2][4]

When mixing variables with globs, quote only the variable parts:

for f in "${dir}"/*.txt; do ...; done

not "${dir}/*.txt" (that would disable the glob). [5]

If you actually want splitting, do it explicitly (not via unquoted $var/$@):
Use read -r -a / mapfile (bash) or controlled IFS handling, rather than relying on “sloppy” splitting + globbing. [2]

Sources: Bash Reference Manual on special parameters @/* and quoting [1], ShellCheck guidance on quoting and "$@" [2][3][5], POSIX shell quoting/expansion rules [4].

Fix quoting of $@ expansions and correct the function name on line 172.

Unquoted $@ expansions in this script cause word-splitting and globbing issues, making them fragile with inputs containing spaces or glob characters. Additionally, line 172 calls simulate_full_node_loss which doesn't exist; the function is defined as simulate_full_loss on line 70, causing a runtime error.

Apply these fixes:

Line 17: for host_id in "$@"; do

Line 39: for host_id in "$@"; do

Line 73: for host_id in "$@"; do

Line 166: simulate_swarm_node_loss "${@:2}"

Line 169: simulate_etcd_node_loss "${@:2}"

Line 172: simulate_full_loss "${@:2}" (corrected function name and quoted)

Line 187: main "$@"

🧰 Tools

🪛 Shellcheck (0.11.0)

[error] 17-17: Double quote array expansions to avoid re-splitting elements.

(SC2068)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@hack/simulate-disaster.sh` at line 17, Fix the unquoted $@ expansions in the for-loops and correct the misnamed function call: change the three loops that iterate host_id (the ones using for host_id in $@) to use quoted expansion (for host_id in "$@") to prevent word-splitting and globbing, update the calls that forward args to use quoted slices where shown (simulate_swarm_node_loss "${@:2}", simulate_etcd_node_loss "${@:2}"), replace the invalid simulate_full_node_loss call with the actual function name simulate_full_loss and pass quoted args (simulate_full_loss "${@:2}"), and ensure the script's entry call uses main "$@" instead of unquoted arguments.

coderabbitai · 2026-02-24T15:48:04Z

hack/simulate-disaster.sh

+Usage: $1 <swarm|etcd|full> <host-id> [host-id ...]
+
+Simulates disasters against the Lima test fixtures. Supports three different
+different types of disasters to enable us to develop some recovery steps in
+parallel:


⚠️ Potential issue | 🟡 Minor

Update usage text (typo + include reset).

The synopsis omits reset, and the description says “different different”.

✏️ Suggested fix

-Usage: $1 <swarm|etcd|full> <host-id> [host-id ...] +Usage: $1 <swarm|etcd|full|reset> <host-id> [host-id ...] @@ -Simulates disasters against the Lima test fixtures. Supports three different -different types of disasters to enable us to develop some recovery steps in +Simulates disasters against the Lima test fixtures. Supports three different +types of disasters to enable us to develop some recovery steps in

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

Usage: $1 <swarm|etcd|full> <host-id> [host-id ...]

Simulates disasters against the Lima test fixtures. Supports three different

different types of disasters to enable us to develop some recovery steps in

parallel:

Usage: $1 <swarm|etcd|full|reset> <host-id> [host-id ...]

Simulates disasters against the Lima test fixtures. Supports three different

types of disasters to enable us to develop some recovery steps in

parallel:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@hack/simulate-disaster.sh` around lines 121 - 125, Update the usage text in the simulate-disaster.sh header: fix the duplicated word "different different" and include the missing `reset` option in the synopsis string (change "Usage: $1 <swarm|etcd|full> <host-id> [host-id ...]" to include `reset`, e.g. "Usage: $1 <swarm|etcd|full|reset> <host-id> [host-id ...]") and adjust the descriptive paragraph to remove the duplicate word so it reads "three different types of disasters" (or similar). Ensure you update both the usage line and the description near that header.

coderabbitai · 2026-02-24T15:48:04Z

hack/simulate-disaster.sh

+		full)
+			simulate_full_node_loss ${@:2}
+			;;


⚠️ Potential issue | 🔴 Critical

Fix the full dispatch: undefined function call.

full currently calls simulate_full_node_loss, but the defined function is simulate_full_loss, so the full path will fail at runtime.

✅ Suggested fix

- full) - simulate_full_node_loss ${@:2} + full) + simulate_full_loss ${@:2} ;;

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

full)

simulate_full_node_loss ${@:2}

;;

full)

simulate_full_loss ${@:2}

;;

🧰 Tools

🪛 Shellcheck (0.11.0)

[error] 172-172: Double quote array expansions to avoid re-splitting elements.

(SC2068)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@hack/simulate-disaster.sh` around lines 171 - 173, The dispatch case labeled "full" calls a nonexistent function simulate_full_node_loss; update that case to call the actual function simulate_full_loss (replace simulate_full_node_loss with simulate_full_loss) so the "full" branch invokes the defined function and won't fail at runtime.

chore: add disaster simulation script

9125195

Adds a script to simulate losing a host. This script has three different ways of simulating that loss to enable us to develop recovery steps for Swarm and Control Plane/Etcd in parallel.

jason-lynch requested review from moizpgedge and tsivaprasad January 28, 2026 17:05

jason-lynch added 2 commits January 29, 2026 09:55

chore: add reset function to simulate-disaster.sh

2385fac

Adds an option to reset the Lima E2E fixture back to its initial state without tearing it down entirely. This can save a significant amount of time between tests.

chore: simulate-disaster improvements

07d41dc

- Fix etcd simulation for client-mode servers - Remove database services in etcd simulation - Rebuild control-plane in reset - Remove `set -x`

jason-lynch marked this pull request as ready for review February 24, 2026 15:43

tsivaprasad approved these changes Feb 24, 2026

View reviewed changes

coderabbitai bot reviewed Feb 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: add disaster simulation script#259

chore: add disaster simulation script#259
jason-lynch wants to merge 3 commits intomainfrom
chore/disaster-simulation

jason-lynch commented Jan 28, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Jan 28, 2026 •

edited

Loading

Walkthrough

Changes

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 24, 2026

Uh oh!

coderabbitai bot Feb 24, 2026

Uh oh!

coderabbitai bot Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jason-lynch commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

coderabbitai bot commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 24, 2026

Choose a reason for hiding this comment

Best practices for $@ (and avoiding word splitting + globbing)

Uh oh!

coderabbitai bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jason-lynch commented Jan 28, 2026 •

edited

Loading

coderabbitai bot commented Jan 28, 2026 •

edited

Loading

Best practices for `$@` (and avoiding word splitting + globbing)