Skip to content

mevdschee/github-export

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

github-export

Export all GitHub issues, pull requests, discussions, releases, labels, milestones, and Projects (v2) from a repository into a local github-data/ folder as plain markdown files. Runs incrementally on subsequent invocations.

Designed so an agent can read, grep, and reason about GitHub data without API access.

Blog post: https://www.tqdev.com/2026-github-export-open-source-maintenance-ai/

Usage

export GITHUB_TOKEN=$(gh auth token)
./github-export [flags] [owner/repo] [output-dir]

The output directory defaults to github-data/. On the first run, all data is fetched. On subsequent runs, only issues and PRs updated since the last sync are re-fetched.

Run with no arguments to sync the export in the current directory — owner/repo are read from ./repo.yml, so this is the easiest way to update an existing export:

cd path/to/github-data && github-export

Flags

Flag Example What it does
--max-age --max-age=2y, 6mo, 4w, 30d, 12h Only fetch issues/PRs/projects updated within this window. Useful for the first sync of very large repos (cli/cli, kubernetes/kubernetes).

--max-age is a floor on the effective since: on later runs the more-recent synced_at from repo.yml takes over automatically, so you can leave the flag in your sync command. Months count as 30 days and years as 365 days (approximate, not calendar-correct).

What it syncs

Data Output Sync mode
Labels github-data/labels.yml Always full sync
Milestones github-data/milestones.yml Always full sync
Issues + PRs github-data/issues/0042.md Incremental
Projects (v2) github-data/projects/0001.md Incremental
Discussions github-data/discussions/0042.md Incremental
Releases github-data/releases/v1.0.0.md Always full sync
Repo metadata github-data/repo.yml Updated each run
Events github-data/events/*.md Since last sync

Issues and pull requests

Each issue or PR is stored as a single markdown file with YAML frontmatter. Comments, reviews, review comments, and events follow as additional YAML documents separated by ---, in chronological order.

For pull requests, the program fetches branch info and merge status from the Pulls REST API, and all PR reviews in one paginated GraphQL query (one query per ~100 PRs instead of one REST call per PR). PRs with more than 100 reviews log a warning and only the first 100 are exported.

Projects (v2)

Each project linked to the repository is exported as a single markdown file under github-data/projects/, named by the project number (0001.md). The frontmatter has the project's title, owner, URL, description, and field definitions (e.g. the Status column's options). Each item (issue or PR) on the project follows as an item sub-document with its current field values.

Only open projects are written. Closed ones are removed on next sync and emit a project_closed event. Draft issues (project-only items with no real issue number) are skipped.

When an issue or PR is on a project, the issue's frontmatter gains a projects: list. This is populated on the next sync that re-fetches the issue — for issues that haven't changed since the last sync, the field will appear the first time the issue is touched on GitHub.

Projects v2 is GraphQL-only, so this section uses the GitHub GraphQL endpoint instead of REST.

Discussions

One file per discussion at github-data/discussions/<number>.md. Frontmatter includes category, state, state_reason, and — for Q&A categories where a reply has been marked as the answer — answer_id, answer_chosen_at, answer_chosen_by. Top-level comments are emitted as document: comment and nested replies as document: reply with a parent_id field linking back to the comment they reply to.

Discussions are GraphQL-only. The list is fetched newest-first (UPDATED_AT DESC) and pagination stops when an item is older than the effective since, so --max-age is fully respected. Each page pulls up to 50 discussions with up to 100 comments and 50 replies per comment; if a single discussion exceeds those limits a warning is logged and only the first N entries are exported.

Incremental sync

On second run, the program reads synced_at from repo.yml and passes it as ?since= to the GitHub Issues API. Only issues with updated_at >= synced_at are re-fetched. Each touched issue file is fully rebuilt from the timeline endpoint.

When --max-age is set on the first sync, the same ?since= filter is used with the cutoff timestamp. The PR list is fetched sorted newest-first and pagination stops once it crosses the cutoff, so very large repos are practical to sync.

Rate limiting

Automatically sleeps when X-RateLimit-Remaining drops below 100, resuming when the reset window passes. A typical incremental sync of 50 updated issues costs ~200 API requests, well within the 5,000/hour limit.

Events

Each sync detects changes (new issues, closed PRs, new comments, etc.) and exports them as individual markdown files in github-data/events/. These files are designed as a handoff point: an agent reads them, acts on them, and deletes them. No configuration is needed; events are always exported when detected.

Event types:

  • Issues and PRs: issue_created, issue_closed, issue_reopened, pr_created, pr_merged, pr_closed, pr_reopened, comment_created
  • PR review loop: pr_review_requested, pr_reviewed (with review_state), pr_ready_for_review
  • Triage / activity: assigned, unassigned, label_added, label_removed, mentioned, linked_to_pr, duplicate_marked
  • Discussions: discussion_created, discussion_closed, discussion_answered, discussion_comment_created
  • Projects: project_created, project_closed, item_added, item_removed, item_status_changed, item_field_changed
  • Releases: release_published, prerelease_promoted

Timeline-derived events (comment_created, assigned, label_added, pr_reviewed, …) only fire on incremental syncs. The first full sync emits *_created events for each entity but skips the per-timeline history that already happened — otherwise it would flood events/ with years of activity.

Example event file (events/20240916-140000-000-issue_created-42.md):

---
event: issue_created
number: 42
title: Fix crash on empty input
author: octocat
state: open
labels:
  - bug
file: github-data/issues/0042.md
repo: owner/repo
url: https://github.com/owner/repo/issues/42
exported_at: 2024-09-16T14:00:00Z
---

Whenever I ...

Event detection adds no extra API calls — it compares the data already fetched during sync against the previous on-disk state.

Output format

See docs/data-model.md for the full format specification.

Example issue file:

---
number: 42
title: Fix crash on empty input
state: closed
state_reason: completed
created_at: 2024-09-15T10:30:00Z
updated_at: 2024-09-16T14:00:00Z
closed_at: 2024-09-16T14:00:00Z
author: octocat
assignees:
  - hubot
labels:
  - bug
milestone: v2.1
---

When passing an empty string to `parse()`, the application crashes.

---
document: comment
id: 100
author: hubot
created_at: 2024-09-15T11:00:00Z
---

I can reproduce this.

---
document: event
event: closed
actor: hubot
created_at: 2024-09-16T14:00:00Z
commit_sha: abc123f
---

Testing

go test -short ./...                       # unit tests only
GITHUB_TOKEN=$(gh auth token) go test ./...   # full end-to-end suite

The end-to-end suite (e2e_test.go) runs a real sync of this repository's TEST-prefixed fixtures (issues #1–#3, PRs #4–#6, discussions #7–#8, project, releases, milestones) and asserts the exported files match. It auto-skips when GITHUB_TOKEN is unset or -short is passed.

Installation

go build .
sudo install -m 755 github-export /usr/local/bin/

Requires Go 1.22+.

Requirements

  • GITHUB_TOKEN environment variable with repo read access
  • Get one via gh auth token (GitHub CLI) or create a personal access token

About

Sync Github issues, PRs and comments to Markdown for AI usage

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages