AstroML is a research-driven Python framework for building dynamic graph machine learning models on the Stellar Development Foundation Stellar blockchain.
It treats blockchain data as a multi-asset, time-evolving graph, enabling advanced ML research on transaction networks such as fraud detection, anomaly detection, and behavioral modeling.
AstroML provides end-to-end tooling for:
- Ledger ingestion and normalization
- Dynamic transaction graph construction
- Feature engineering for blockchain accounts
- Graph Neural Networks (GNNs)
- Self-supervised node embeddings
- Anomaly detection
- Temporal modeling
- Reproducible ML experimentation
- Model registry with versioning and metrics tracking
The Model Registry provides version control for your trained models, enabling you to track model versions, performance metrics, and activate specific versions for production use.
Key Features:
- Register new model versions with autoβgenerated or custom version tags
- Track performance metrics alongside model artifacts
- Activate specific model versions for inference
- Configurable model storage location
For full documentation, see docs/model-registry.md
Blockchain networks are naturally graph-structured systems:
| Blockchain Concept | Graph Representation |
|---|---|
| Accounts | Nodes |
| Transactions | Directed edges |
| Assets | Edge types |
| Time | Dynamic dimension |
Most analytics tools rely on static heuristics or SQL queries.
AstroML instead enables:
- Dynamic graph learning
- Temporal GNNs
- Representation learning
- Research-grade experimentation
AstroML is designed for:
- ML researchers
- Graph ML engineers
- Fraud detection teams
- Blockchain data scientists
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AstroML: Ingestion β Graph β Train β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββ
β Stellar β
β Ledgers β
ββββββββ¬ββββββββ
β
ββββββββββββββββββΌβββββββββββββββββ
β 1. INGESTION LAYER β
β ββ Ledger backfill (Polars) β
β ββ Incremental ingestion β
β ββ State tracking (idempotent)β
β ββ PostgreSQL storage β
ββββββββββββββββββ¬βββββββββββββββββ
β
ββββββββββββββββββΌβββββββββββββββββ
β 2. NORMALIZATION LAYER β
β ββ Raw Stellar schema β
β β (Ledger, Transaction, Op) β
β ββ Graph mirror layer β
β β (GraphAccount, GraphEdge) β
β ββ Composite indexes β
β (account_id, timestamp) β
ββββββββββββββββββ¬βββββββββββββββββ
β
ββββββββββββββββββΌβββββββββββββββββ
β 3. GRAPH BUILDING LAYER β
β ββ Time-windowed snapshots β
β ββ Edge construction β
β ββ Node indexing β
β ββ Graph validation β
ββββββββββββββββββ¬βββββββββββββββββ
β
ββββββββββββββββββΌβββββββββββββββββ
β 4. FEATURE ENGINEERING β
β ββ Transaction frequency β
β ββ Asset diversity β
β ββ Structural importance β
β β (degree, betweenness, PR) β
β ββ Feature store & versioning β
β ββ Point-in-time queries β
ββββββββββββββββββ¬βββββββββββββββββ
β
ββββββββββββββββββΌβββββββββββββββββ
β 5. TRAINING LAYER β
β ββ Temporal train/test split β
β ββ Link prediction task β
β ββ Negative sampling β
β ββ PyTorch Geometric models β
β β (GCN, GraphSAGE, GAT) β
β ββ Early stopping β
ββββββββββββββββββ¬βββββββββββββββββ
β
ββββββββββββββββββΌβββββββββββββββββ
β 6. BENCHMARKING & EVALUATION β
β ββ Reproducible configs β
β ββ Random seed tracking β
β ββ Metric computation β
β β (AUC, Precision, Recall) β
β ββ Memory profiling β
β ββ Result persistence β
ββββββββββββββββββ¬βββββββββββββββββ
β
ββββββββΌβββββββ
β Baseline β
β Results β
βββββββββββββββ
Stellar Ledger Data
β
[Ingestion Service]
ββ Fetch ledgers (1000000-1100000)
ββ Track state (.astroml_state/ingestion_state.json)
ββ Store in PostgreSQL
β
[Database Schema]
ββ Raw Layer: Ledger, Transaction, Operation, Account, Asset
ββ Graph Layer: GraphAccount, GraphEdge, GraphTransactionDetail
ββ Indexes: (account_id, timestamp) composite
β
[Graph Snapshot]
ββ Query operations by time window
ββ Create Edge objects (src, dst, timestamp, asset, amount)
ββ Build node_index mapping
ββ Validate graph (isolated nodes, self-loops, density)
β
[Feature Store]
ββ Compute node features (frequency, diversity, centrality)
ββ Compute edge features (asset type, amount, direction)
ββ Version features with metadata
ββ Store in SQLite + Parquet
β
[Temporal Split]
ββ Sort edges by timestamp
ββ Split at cutoff (80% train, 20% test)
ββ Ensure no future data leaks into training
β
[Link Prediction Task]
ββ Context window: edges before cutoff
ββ Future window: edges after cutoff
ββ Positive labels: future edges
ββ Negative sampling: random non-edges
ββ Binary classification objective
β
[Model Training]
ββ LinkPredictor(encoder + decoder)
ββ Adam optimizer with early stopping
ββ Compute AUC, Precision, Recall, F1
ββ Track training/validation losses
β
[Benchmark Results]
ββ config.json (full configuration + seed)
ββ result.json (metrics + performance)
ββ metadata.json (run_id, timestamp, linking files)
astroml/
βββ ingestion/ # Ledger ingestion & state tracking
β βββ service.py # IngestionService (incremental, idempotent)
β βββ state.py # StateStore (tracks processed ledgers)
β βββ backfill.py # Bulk ledger loading
βββ db/ # Database layer
β βββ schema.py # SQLAlchemy ORM models
β βββ session.py # Database connection management
βββ features/ # Feature engineering
β βββ feature_store.py # Enterprise feature management
β βββ graph/
β β βββ snapshot.py # Time-windowed graph construction
β βββ frequency.py # Transaction frequency features
β βββ asset_diversity.py
β βββ gnn/ # Graph neural network layers
βββ models/ # ML models
β βββ link_predictor.py
β βββ gcn.py
β βββ sage.py
β βββ deep_svdd.py
βββ tasks/ # Training tasks
β βββ link_prediction_task.py
βββ training/ # Training utilities
β βββ temporal_split.py # Prevent data leakage
β βββ train_link_prediction.py
βββ benchmarking/ # Benchmarking framework
β βββ core.py # ModelBenchmark orchestrator
β βββ config.py # Configuration management
β βββ metrics.py # Metric computation
βββ quick_start.py # Quick start pipeline
βββ cli.py # Command-line interface
# Run quick start with default settings (100 ledgers, 50 accounts, 10 epochs)
make quickstart
# Run with more data for thorough testing
make quickstart-verbose# Run quick start with default settings
python -m astroml.quick_start
# Run with custom parameters
python -m astroml.quick_start --num-ledgers 200 --num-accounts 100 --epochs 20 --seed 42# Run quick start command
python -m astroml quickstart --num-ledgers 100 --num-accounts 50 --epochs 10 --seed 42The quick start pipeline:
- Generates sample data: Creates 100 synthetic ledgers with 50 accounts and realistic transactions
- Builds transaction graph: Constructs a time-windowed graph with ~2000 edges
- Validates graph: Checks for isolated nodes, self-loops, and computes statistics
- Trains baseline model: Trains a LinkPredictor model for 10 epochs
- Saves reproducible results: Stores config, results, and metadata for reproducibility
Output:
benchmark_results/quickstart/
βββ config.json # Full configuration with random seed
βββ result.json # Training metrics and performance
βββ metadata.json # Run metadata linking config and result
Example output:
================================================================================
AstroML Quick Start: Ingestion β Graph β Train Pipeline
================================================================================
[Step 1/5] Generating sample ledger data...
Generated 100 ledgers with 50 accounts
[Step 2/5] Building transaction graph...
Built graph with 2000 edges and 50 nodes
[Step 3/5] Creating benchmark configuration...
[Step 4/5] Training baseline model...
Epoch 0: Train Loss = 0.6931, Val Loss = 0.6892
Epoch 5: Train Loss = 0.4521, Val Loss = 0.4612
Training complete. Best metrics: {'auc': 0.92, 'precision': 0.88, 'recall': 0.85}
[Step 5/5] Saving benchmark results...
Saved config to benchmark_results/quickstart/config.json
Saved result to benchmark_results/quickstart/result.json
Saved metadata to benchmark_results/quickstart/metadata.json
β Quick start completed successfully!
Results saved to: benchmark_results/quickstart
================================================================================
For the quickest setup with all dependencies, use Docker:
# Clone and navigate to repository
git clone https://github.com/Traqora/astroml.git
cd astroml
# Start with Docker
cp .env.example .env
./scripts/docker-start.sh core
# Access services
curl http://localhost:8000 # API
open http://localhost:3000 # Grafanaπ Full Docker Setup: See DOCKER.md for comprehensive documentation including:
- Docker Quick Reference - Quick commands and common tasks
- Environment Configuration - Configuration guide
- Production Deployment - Production setup
- Troubleshooting - Common issues and solutions
git clone https://github.com/Traqora/astroml.git
cd astromlpython -m venv venv
source venv/bin/activate
pip install -r requirements.txtNote: Three requirements files are available. See REQUIREMENTS.md for guidance on which to use based on your environment (GPU training, CPU-only, or minimal config-only).
A lightweight Docker Compose setup is provided to spin up PostgreSQL and Redis with persistent volumes. Simply run:
docker compose up -dThis starts only the database and cache, letting you run Python scripts and training natively on your machine. Alternatively, you can configure your own database and update config/database.yaml.
Backfill ledgers:
python -m astroml.ingestion.backfill \
--start-ledger 1000000 \
--end-ledger 1100000Create a rolling time window graph:
python -m astroml.graph.build_snapshot --window 30dCreate benchmark datasets by injecting controlled fraud structures into a clean ledger copy:
python -m astroml.ingestion.synthetic_fraud_injector \
--input data/clean_ledger.jsonl \
--output data/ledger_with_fraud.jsonl \
--summary outputs/fraud_injection_summary.json \
--sybil-clusters 3 \
--sybil-cluster-size 8 \
--wash-loops 2 \
--wash-loop-size 5The injector appends transactions tagged with synthetic_fraud=true and fraud_pattern (sybil_cluster or wash_trading_loop) for downstream benchmarking.
python -m astroml.training.train_gcn- Liquidity Monitoring for the Stellar Community Fund
- Fraud / scam detection
- Account clustering
- Transaction risk scoring
- Temporal behavior modeling
- Self-supervised embeddings
- Network anomaly detection
AstroML emphasizes:
- Reproducibility
- Modular experimentation
- Scalable ingestion
- Temporal graph learning
- Production-ready ML pipelines
- Python
- PyTorch / PyTorch Geometric
- PostgreSQL
- NetworkX / graph tooling
- Real-time streaming ingestion
- Temporal GNN models
- Contrastive learning pipelines
- Feature store
- Model benchmarking suite
- Docker deployment
Contributions are welcome!
fork β branch β commit β PRPlease open issues for bugs or feature requests.
MIT License