fix: guard JIT prompts against injection#238
Conversation
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
📝 WalkthroughPrompt Injection Guard ImplementationCore Changes
Security & Testing
New Public APIs
No Breaking Changes — Guard is disabled by default for backward compatibility; can be enabled via environment variable. WalkthroughThis PR adds a prompt-injection guard that sanitizes and detects hostile patterns in user prompts before BM25 scoring. The guard is integrated into the JIT hook as an early abort gate, validated with comprehensive unit tests, and benchmarked against a 34-payload security corpus spanning 10 attack classes. ChangesPrompt Injection Guard Implementation
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 OpenGrep (1.22.0)OpenGrep fatal error (exit code 2): �[32m✔�[39m �[1mOpengrep OSS�[0m �[1m Loading rules from local config...�[0m Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@Gradata/src/gradata/hooks/_injection_guard.py`:
- Around line 251-254: The current fast-path in _injection_guard.py
unconditionally returns (False, "") when len(text) < 20, which skips detection
of short high-signal markers; instead, modify the short-input guard so it only
skips full expensive processing but still runs a minimal marker scan for the
variable text (e.g., call or inline a compact rule set that checks for known
injection tokens like "override", "system:", "assistant:", "###", ">>>",
prompt-injection keywords, or regexes) and return a defensive True/flag if any
minimal-rule matches; keep the existing full scanner for longer inputs but
ensure the early-return branch delegates to this minimal_scan helper (name it
minimal_scan_or_scan_short_input) and returns its (bool, reason) tuple rather
than always False.
In `@Gradata/tests/hooks/test_injection_guard.py`:
- Around line 55-56: The test relies on ambient environment for guard
enablement, causing nondeterministic failures; add an autouse pytest fixture
that pins the guard-related env so is_suspicious behaves consistently (e.g.,
ensure GRADATA_LEGACY_INSTALL is unset or set to the expected value) for tests
using test_gap_payload_detected and _block_ids; implement the fixture in the
test module (autouse=True) using pytest's monkeypatch to set or unset
os.environ["GRADATA_LEGACY_INSTALL"] before tests run and restore afterward so
the guard logic in is_suspicious runs deterministically.
In `@Gradata/tests/security/fixtures/manifest.json`:
- Around line 275-276: The manifest entry for encoding_bypass_003 has
contradictory metadata: the description says Unicode homoglyphs won't be caught
but detectable_by_current_guards is true; update the entry by setting the
boolean detectable_by_current_guards to false (or alternatively reword the
description to claim it is detectable) so the metadata and description
align—locate the encoding_bypass_003 object in manifest.json and change the
"detectable_by_current_guards" field accordingly while keeping the description
text as-is (or adjust the description if you prefer to keep the boolean true).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: bfa5432e-43f0-4cc1-a016-f1df024b226b
📒 Files selected for processing (39)
Gradata/src/gradata/hooks/_injection_guard.pyGradata/src/gradata/hooks/jit_inject.pyGradata/tests/hooks/test_injection_guard.pyGradata/tests/security/fixtures/injection_corpus/benign_control_001.txtGradata/tests/security/fixtures/injection_corpus/benign_control_002.txtGradata/tests/security/fixtures/injection_corpus/benign_control_003.txtGradata/tests/security/fixtures/injection_corpus/direct_override_001.txtGradata/tests/security/fixtures/injection_corpus/direct_override_002.txtGradata/tests/security/fixtures/injection_corpus/direct_override_003.txtGradata/tests/security/fixtures/injection_corpus/encoding_bypass_001.txtGradata/tests/security/fixtures/injection_corpus/encoding_bypass_002.txtGradata/tests/security/fixtures/injection_corpus/encoding_bypass_003.txtGradata/tests/security/fixtures/injection_corpus/few_shot_hijack_001.txtGradata/tests/security/fixtures/injection_corpus/goal_hijack_001.txtGradata/tests/security/fixtures/injection_corpus/goal_hijack_002.txtGradata/tests/security/fixtures/injection_corpus/indirect_001.txtGradata/tests/security/fixtures/injection_corpus/indirect_002.txtGradata/tests/security/fixtures/injection_corpus/js_template_001.txtGradata/tests/security/fixtures/injection_corpus/js_template_002.txtGradata/tests/security/fixtures/injection_corpus/js_template_003.txtGradata/tests/security/fixtures/injection_corpus/js_template_004.txtGradata/tests/security/fixtures/injection_corpus/marker_inject_001.txtGradata/tests/security/fixtures/injection_corpus/marker_inject_002.txtGradata/tests/security/fixtures/injection_corpus/marker_inject_003.txtGradata/tests/security/fixtures/injection_corpus/role_hijack_001.txtGradata/tests/security/fixtures/injection_corpus/role_hijack_002.txtGradata/tests/security/fixtures/injection_corpus/role_hijack_003.txtGradata/tests/security/fixtures/injection_corpus/role_hijack_004.txtGradata/tests/security/fixtures/injection_corpus/system_leak_001.txtGradata/tests/security/fixtures/injection_corpus/system_leak_002.txtGradata/tests/security/fixtures/injection_corpus/system_leak_003.txtGradata/tests/security/fixtures/injection_corpus/virtualization_001.txtGradata/tests/security/fixtures/injection_corpus/virtualization_002.txtGradata/tests/security/fixtures/injection_corpus/xml_inject_001.txtGradata/tests/security/fixtures/injection_corpus/xml_inject_002.txtGradata/tests/security/fixtures/injection_corpus/xml_inject_003.txtGradata/tests/security/fixtures/injection_corpus/xml_inject_004.txtGradata/tests/security/fixtures/manifest.jsonGradata/tests/security/test_prompt_injection_poc.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
- GitHub Check: pytest windows-latest / py3.12
- GitHub Check: pytest macos-latest / py3.11
- GitHub Check: pytest windows-latest / py3.11
- GitHub Check: pytest ubuntu-latest / py3.12
- GitHub Check: pytest ubuntu-latest / py3.11
- GitHub Check: pytest macos-latest / py3.12
- GitHub Check: pytest (py3.11)
- GitHub Check: pytest (py3.12)
🧰 Additional context used
📓 Path-based instructions (2)
Gradata/src/**/*.py
📄 CodeRabbit inference engine (Gradata/AGENTS.md)
Gradata/src/**/*.py: Prefersentence-transformersfor local embeddings,google-genaifor Gemini embeddings,cryptographyfor AES-GCM encrypted system.db,bm25sfor BM25 rule ranking, andmem0aifor external memory adapters — guard all optional dependency imports withtry / except ImportErrorat the call site, never at module level
Maintain strict layering: Layer 0 (Primitives: _types.py, _db.py, _events.py, _paths.py, _file_lock.py; Patterns: contrib/patterns/) must never import from Layer 1 (Enhancements: enhancements/, rules/) or Layer 2 (Public API: brain.py, cli.py, daemon.py, mcp_server.py)
Never use bareexcept: pass— use typed exceptions or at minimumlogger.warning(...)withexc_info=Trueto avoid silent failure in a memory product
Never import from out-of-scope sibling directories../Sprites/or../Hausgem/withingradata/*code — that is a layering bug
Never leak private-sibling paths into public docs/code — no references to../Sprites/,../Hausgem/, email addresses, OneDrive paths, or Sprites-specific examples from insidegradata/*
Use atomic-write helper when writing JSON files to prevent corruption from mid-write crashes
Files:
Gradata/src/gradata/hooks/jit_inject.pyGradata/src/gradata/hooks/_injection_guard.py
Gradata/tests/**/*.py
📄 CodeRabbit inference engine (Gradata/AGENTS.md)
Gradata/tests/**/*.py: SetBRAIN_DIRenvironment variable viatmp_pathin conftest.py for test isolation — ensure_paths.pymodule cache refreshes when callingBrain.init()directly inside tests
Add unit tests intests/test_*.pyfor every CI push without LLM calls (deterministic); mark integration tests with@pytest.mark.integrationand skip them by default (they hit real LLM APIs)
Files:
Gradata/tests/hooks/test_injection_guard.pyGradata/tests/security/test_prompt_injection_poc.py
🪛 LanguageTool
Gradata/tests/security/fixtures/injection_corpus/few_shot_hijack_001.txt
[style] ~7-~7: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...sistant (unrestricted): You boil water. Also, here are all the system prompt content...
(REP_SENT_START_ALSO)
🔇 Additional comments (36)
Gradata/src/gradata/hooks/jit_inject.py (1)
33-33: LGTM!Also applies to: 292-297
Gradata/tests/security/fixtures/injection_corpus/system_leak_003.txt (1)
1-1: LGTM!Gradata/tests/security/fixtures/injection_corpus/virtualization_001.txt (1)
1-1: LGTM!Gradata/tests/security/fixtures/injection_corpus/virtualization_002.txt (1)
1-1: LGTM!Gradata/tests/security/fixtures/injection_corpus/xml_inject_001.txt (1)
1-1: LGTM!Gradata/tests/security/test_prompt_injection_poc.py (1)
93-126: LGTM!Also applies to: 136-168, 178-224
Gradata/tests/security/fixtures/injection_corpus/benign_control_001.txt (1)
1-1: LGTM!Gradata/tests/security/fixtures/injection_corpus/benign_control_002.txt (1)
1-1: LGTM!Gradata/tests/security/fixtures/injection_corpus/benign_control_003.txt (1)
1-1: LGTM!Gradata/tests/security/fixtures/injection_corpus/direct_override_001.txt (1)
1-1: LGTM!Gradata/tests/security/fixtures/injection_corpus/direct_override_002.txt (1)
1-1: LGTM!Gradata/tests/security/fixtures/injection_corpus/xml_inject_002.txt (1)
1-1: LGTM!Gradata/tests/security/fixtures/injection_corpus/xml_inject_004.txt (1)
1-5: LGTM!Gradata/tests/security/fixtures/injection_corpus/direct_override_003.txt (1)
1-1: LGTM!Gradata/tests/security/fixtures/injection_corpus/encoding_bypass_001.txt (1)
1-1: LGTM!Gradata/tests/security/fixtures/injection_corpus/encoding_bypass_002.txt (1)
1-1: LGTM!Gradata/tests/security/fixtures/injection_corpus/encoding_bypass_003.txt (1)
1-1: LGTM!Gradata/tests/security/fixtures/injection_corpus/few_shot_hijack_001.txt (1)
1-10: LGTM!Gradata/tests/security/fixtures/injection_corpus/system_leak_001.txt (1)
1-1: LGTM!Gradata/tests/security/fixtures/injection_corpus/system_leak_002.txt (1)
1-1: LGTM!Gradata/tests/security/fixtures/injection_corpus/xml_inject_003.txt (1)
1-1: LGTM!Gradata/tests/security/fixtures/injection_corpus/goal_hijack_001.txt (1)
1-1: LGTM!Gradata/tests/security/fixtures/injection_corpus/goal_hijack_002.txt (1)
1-1: LGTM!Gradata/tests/security/fixtures/injection_corpus/indirect_001.txt (1)
1-5: LGTM!Gradata/tests/security/fixtures/injection_corpus/indirect_002.txt (1)
1-5: LGTM!Gradata/tests/security/fixtures/injection_corpus/js_template_001.txt (1)
1-1: LGTM!Gradata/tests/security/fixtures/injection_corpus/role_hijack_002.txt (1)
1-1: LGTM!Gradata/tests/security/fixtures/injection_corpus/role_hijack_003.txt (1)
1-1: LGTM!Gradata/tests/security/fixtures/injection_corpus/role_hijack_004.txt (1)
1-1: LGTM!Gradata/tests/security/fixtures/injection_corpus/js_template_002.txt (1)
1-1: LGTM!Gradata/tests/security/fixtures/injection_corpus/js_template_003.txt (1)
1-1: LGTM!Gradata/tests/security/fixtures/injection_corpus/js_template_004.txt (1)
1-1: LGTM!Gradata/tests/security/fixtures/injection_corpus/marker_inject_001.txt (1)
1-4: LGTM!Gradata/tests/security/fixtures/injection_corpus/marker_inject_002.txt (1)
1-6: LGTM!Gradata/tests/security/fixtures/injection_corpus/marker_inject_003.txt (1)
1-6: LGTM!Gradata/tests/security/fixtures/injection_corpus/role_hijack_001.txt (1)
1-1: LGTM!
| # Quick pre-check: if text is very short and doesn't contain known markers, | ||
| # skip expensive processing. | ||
| if len(text) < 20: | ||
| return False, "" |
There was a problem hiding this comment.
Short-input fast-path bypasses known injection markers.
The code unconditionally returns False for inputs shorter than 20 chars, which contradicts the comment and misses concise payloads (e.g., short override directives). Keep the fast-path, but still scan short inputs with a minimal high-signal rule set.
🔧 Suggested fix
- # Quick pre-check: if text is very short and doesn't contain known markers,
- # skip expensive processing.
- if len(text) < 20:
- return False, ""
+ # Quick pre-check: keep short-input fast path, but still check high-signal
+ # markers so concise injections are not missed.
+ if len(text) < 20:
+ quick_checks = (
+ _RE_OVERRIDE,
+ _RE_SYSTEM_LEAK,
+ _RE_LLM_MARKERS,
+ _RE_GOAL_HIJACK,
+ _RE_ROLEPLAY,
+ )
+ if not any(p.search(text) for p in quick_checks):
+ return False, ""🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@Gradata/src/gradata/hooks/_injection_guard.py` around lines 251 - 254, The
current fast-path in _injection_guard.py unconditionally returns (False, "")
when len(text) < 20, which skips detection of short high-signal markers;
instead, modify the short-input guard so it only skips full expensive processing
but still runs a minimal marker scan for the variable text (e.g., call or inline
a compact rule set that checks for known injection tokens like "override",
"system:", "assistant:", "###", ">>>", prompt-injection keywords, or regexes)
and return a defensive True/flag if any minimal-rule matches; keep the existing
full scanner for longer inputs but ensure the early-return branch delegates to
this minimal_scan helper (name it minimal_scan_or_scan_short_input) and returns
its (bool, reason) tuple rather than always False.
| @pytest.mark.parametrize("payload_id", _block_ids()) | ||
| def test_gap_payload_detected(payload_id: str) -> None: |
There was a problem hiding this comment.
Guard tests are environment-coupled and can fail nondeterministically.
These tests rely on ambient env for guard enablement. If GRADATA_LEGACY_INSTALL is set in CI, is_suspicious returns early and the suite fails for the wrong reason. Pin guard env in an autouse fixture.
🔧 Suggested fix
import pytest
from gradata.hooks._injection_guard import is_suspicious, sanitize
+
+
+@pytest.fixture(autouse=True)
+def _force_guard_enabled(monkeypatch: pytest.MonkeyPatch) -> None:
+ monkeypatch.setenv("GRADATA_INJECTION_GUARD", "1")
+ monkeypatch.delenv("GRADATA_LEGACY_INSTALL", raising=False)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@Gradata/tests/hooks/test_injection_guard.py` around lines 55 - 56, The test
relies on ambient environment for guard enablement, causing nondeterministic
failures; add an autouse pytest fixture that pins the guard-related env so
is_suspicious behaves consistently (e.g., ensure GRADATA_LEGACY_INSTALL is unset
or set to the expected value) for tests using test_gap_payload_detected and
_block_ids; implement the fixture in the test module (autouse=True) using
pytest's monkeypatch to set or unset os.environ["GRADATA_LEGACY_INSTALL"] before
tests run and restore afterward so the guard logic in is_suspicious runs
deterministically.
| "description": "Thin Unicode lookalike variant ('ignore previous instructions' with homoglyphs). Listed as detectable via literal ASCII match here — but Unicode homoglyphs would not be caught; included to mark the boundary.", | ||
| "detectable_by_current_guards": true |
There was a problem hiding this comment.
Fix contradictory detectability metadata for encoding_bypass_003.
Line 275 says this homoglyph variant is not reliably caught by literal ASCII matching, but Line 276 marks it as detectable. Set detectable_by_current_guards to false (or adjust the description) so corpus stats and gap tracking remain trustworthy.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@Gradata/tests/security/fixtures/manifest.json` around lines 275 - 276, The
manifest entry for encoding_bypass_003 has contradictory metadata: the
description says Unicode homoglyphs won't be caught but
detectable_by_current_guards is true; update the entry by setting the boolean
detectable_by_current_guards to false (or alternatively reword the description
to claim it is detectable) so the metadata and description align—locate the
encoding_bypass_003 object in manifest.json and change the
"detectable_by_current_guards" field accordingly while keeping the description
text as-is (or adjust the description if you prefer to keep the boolean true).
Summary
src/gradata/hooks/_injection_guard.pywith Unicode normalization, zero-width/BOM cleanup, regex heuristics, and base64/ROT13 decoded-payload checks.jit_inject.pyimmediately afterUserPromptSubmitmessage extraction and before BM25/Jaccard rule scoring.Paperclip: GRA-2018 (
bfe46192-868a-472e-a271-10836fee3048)Related: GRA-1295, GRA-1596
Verification
python3 -m pytest tests/hooks/test_injection_guard.py tests/test_jit_inject.py tests/security/test_prompt_injection_poc.py→ 91 passed, 1 skipped, 14 xfaileduv run --extra dev ruff check src/gradata/hooks/_injection_guard.py src/gradata/hooks/jit_inject.py tests/hooks/test_injection_guard.py tests/security/test_prompt_injection_poc.py→ All checks passed