active_learning is a backend-agnostic library for selecting samples by ID, attaching artifacts, and emitting results through explicit sinks.
This repository is the standalone home for the active_learning Python
package. The core package can be installed without autocrane-cloud; CRID and
Sama workflows live under explicit integration modules and require the CRID
runtime to provide the interface package.
For library development:
uv pip install -e .For the CRID-backed app and scripts from an autocrane-cloud/apps/crid shell:
uv pip install -e /Users/lukas/repos/active-learning[app,crid]The app launchers expect autocrane-cloud next to this checkout. Set
AUTOCRANE_CLOUD_PATH=/path/to/autocrane-cloud if it lives elsewhere.
The flow is:
source IDs -> image provider -> scorers -> selectors -> SelectionResult -> sinks
The core rules are:
SampleIdis the canonical sample representation.- Scorers derive per-sample artifacts and scores, including the cached brightness stats used for early filtering.
- Selectors choose the final subset.
- Sinks consume
SelectionResultand handle outputs or side effects. - Integrations adapt CRID and Sama to the core flow.
core/- shared runtime primitives: config loading, image provider, selection orchestration, and core types
providers/- model and inference utilities: Unet loading, batch extraction, and uncertainty scoring helpers
scorers/- score, artifact derivation, and brightness-based pre-filtering keyed by sample ID
selectors/- final subset selection from candidates plus artifacts
sinks/SelectionResultconsumers that emit outputs or side effects
integrations/- backend-specific adapters, currently CRID and Sama, to the core flow
strategies/- reusable selection recipes built from lower-level pieces
scripts/- thin CLI entrypoints around the library pieces
tests/- unit and integration coverage for the package
core/config.pyowns config parsing and validation.core/image_provider.pyowns image materialization and caching.providers/is separate fromcore/; it contains model/inference utilities, not image storage or CRID access.
The seed.py CLI accepts the following values for --strategy:
coreset- Pure diversity selection over image features. Uses the configured feature model and any labeled seed images as the reference set.
uncertainty_coreset- Computes uncertainty first, then balances uncertainty and diversity using coreset-style selection.
uncertainty_topk- Pure uncertainty ranking. Selects the
nmost uncertain images without a diversity stage.
- Pure uncertainty ranking. Selects the
uncertainty_topk_coreset- Two-stage uncertainty workflow: first keep the top uncertain candidates, then run coreset selection on that reduced pool.
alges- Active Learning with Gradient Embeddings for Segmentation. Builds ALGES gradient embeddings from the configured segmentation model and selects with k-means++.
alges_coreset- Two-stage ALGES workflow: run ALGES to form a candidate pool, then run coreset selection to diversify the final batch.
Related flags used by some strategies:
--provider {mc_dropout,entropy,bald}- Used by the uncertainty-based strategies.
--aggregation {mean,topk_mean,max}and--topk-fraction- Control how per-pixel uncertainty maps are reduced to one score per image.
--candidate-multiplier- Used by
uncertainty_topk_coresetto size the intermediate uncertainty shortlist.
- Used by
--feature-model- Used by
coreset,uncertainty_coreset,uncertainty_topk_coreset, and the coreset stage ofalges_coreset.
- Used by
--method {image,semantic}- Used by
algesandalges_coresetto choose the ALGES embedding variant.
- Used by
Use active-learning-local to run selection on a recursive directory of local
images without CRID or Sama:
active-learning-local --images-dir /path/to/images --strategy coreset -n 50The local runner scans .jpg, .jpeg, .png, .webp, and .bmp files,
uses POSIX-style relative paths as sample IDs, and writes a mosaic plus YAML
handoff next to the configured mosaic path. coreset is the recommended
starter strategy because it only needs image features; uncertainty and ALGES
strategies still require a configured UNet model.
For a CRID-backed active-learning run with ALGES and Sama export:
seed.pyloads the seed config, queries CRID for candidate sample IDs, and builds anImageProvider.providers/supplies the model side of the run: Unet loading, inference, and uncertainty utilities used by ALGES and uncertainty-based strategies.- Brightness filtering removes bad candidates, then scorers compute features, uncertainty, or ALGES embeddings.
- A selector chooses the final
SelectionResult. sinks/mosaic.pycan render a preview mosaic, andsinks/yaml.pywrites the interactive seed handoff.- If
sama_project_idis set, the CRID export sink submits the selected samples and the Sama sink creates the batch.
In practice, this is the path for "pick a batch of images from CRID, inspect the selection, and push it to Sama for annotation."