Replace the documentation build system with an AsciidoctorJ extension by Cole-Greer · Pull Request #3455 · apache/tinkerpop

Cole-Greer · 2026-06-09T17:49:27Z

Summary

This PR replaces TinkerPop's legacy shell/AWK documentation preprocessor + postprocessor pipeline with a Maven-based AsciidoctorJ extension (tools/tinkerpop-docs). The new extension walks each AsciiDoc book's AST, executes [gremlin-groovy] code blocks against a long-lived Gremlin Console subprocess, and renders
the console output as tabbed, syntax-highlighted HTML — producing output structurally equivalent to the published 3.7.7-SNAPSHOT docs while being easier to maintain, test, and run.

Motivation

The old build was a fragile pipeline of bash + awk scripts under docs/preprocessor/ and docs/postprocessor/ that was hard to test, OS-sensitive (required GNU coreutils on macOS), silently swallowed Gremlin execution errors, and depended on a manually configured pseudo-distributed Hadoop cluster. The replacement
is a single Maven module with unit tests, fail-fast error handling, and a local-filesystem Hadoop configuration that needs no daemons.

What changed

New AsciidoctorJ extension (tools/tinkerpop-docs)

GremlinTreeprocessor — AST walk, block execution, per-graph initialization, sugar-plugin handling, and multi-line statement grouping.
GremlinConsole — manages the bin/gremlin.sh subprocess, prompt-based output capture, and error-prompt detection.
TabbedHtmlBuilder / GremlinPostprocessor — tabbed HTML output, CodeRay syntax highlighting (via JRuby), callout/conum rendering, and version substitution.
ConsoleRestartHandler / PluginDirectoryRestartHandler — per-book plugin isolation (see below).
SPI registration + a docs-specific local-filesystem Hadoop config (hadoop-conf/core-site.xml).

Orchestration — bin/process-docs.sh rewritten to validate the console/server distributions, install plugins, start a Gremlin Server and Gephi mock, and invoke Maven. Supports --dryRun (render without executing).

Per-book plugin isolation — Neo4j 3.4 (Scala 2.11) and Spark (Scala 2.12) cannot share the console's flat classpath. A :gremlin-docs-plugins-exclude: section attribute drives a console restart with the conflicting plugin directories toggled aside, so both the Neo4j and Spark examples render correctly in the
same run. Plugin dependencies are installed into ext//plugin/ (not the shared lib/) so they can be isolated, and the toggle is idempotent/resilient to interrupted builds.

Docs source updates

Added :gremlin-docs-plugins-exclude: attributes to the neo4j, hadoop, spark, and gremlin-variants chapters.
Scoped the Hadoop hdfs.ls() examples to the copied graph file so rendered docs avoid listing the build machine's home directory.
Fixed an undefined-variable typo (marko → vMarko) and converted the Spark-on-YARN recipe to a static example (requires dependency on a live YARN cluster).
Rewrote the developer-doc "Documentation Environment" section to describe the new Maven/AsciidoctorJ build and removed the retired preprocessor references.

Removed — the entire docs/preprocessor/ and docs/postprocessor/ script trees (15 files).

Testing

92 unit tests in tools/tinkerpop-docs (console I/O, treeprocessor, tabbed HTML, postprocessor, dry-run, plugin-directory toggling), plus an integration fixture exercising gremlin blocks, manual/standalone tabs, existing, errors, callouts, and version replacement.
Full bin/process-docs.sh build completes BUILD SUCCESS with execution errors fatal.
Output diffed against the published 3.7.7-SNAPSHOT docs across all 8 books: structural metrics (headings, listing blocks, tab sections, callouts) match within ~2%; zero stacktrace bloat; all differences attributable to intended source updates, the file:/// vs hdfs:// environment, or branch-vs-snapshot content
drift.

Tips for reviewers

I've taken the liberty of redeploying the 3.7.7-SNAPSHOT docs from this branch. I would recommend focusing the review on evaluating the built docs. There are a few notable differences worth calling out:

The CSharp tabs now have functioning syntax highlighting (as seen in the Basic Gremlin section of the reference docs)
The HDFS examples have replaces calls to hdfs.ls() with hdfs.ls('tinkerpop-modern.kryo'). This is a minor workaround as the docs build substitutes in the filesystem from the host machine instead of running a local hadoop cluster. This change is to avoid dumping existing contents of the hosts home directory. The old format could be restored by having the docs system internally manage a MiniDFSCluster. This is a viable fix but I've left it out of scope from this PR to limit complexity.
The OLAP Spark YARN recipe has been converted to a static example, it is no longer executed during docs build.

Future

The goal of this work was to replace the old docs system with a goal of a 1:1 equivalency in docs output. I think this new extension gives us a better platform to build future enhancements on the docs.

For 3.8 and above, it becomes quite trivial to link the gremlin-lang translators into all of the gremlin-groovy examples, and automatically add tabs for all language variants (excluding groovy-specific examples)
There is some complexity in the system to load and unload console plugins depending on needs for each doc book (needed due to conflicting dependencies between spark and neo4j). This could be ripped out and simplified in master as neo4j and sparql plugins are no longer necessary.
I expect we can extend the new asciidoctor plugin to add new features to the docs, such as improved docs navigation and an integrated search capability.

Replace the awk-based preprocessor and shell postprocessor pipeline with a Java-based AsciidoctorJ extension (tinkerpop-docs module) that: - Processes gremlin-groovy listing blocks via GremlinConsole subprocess - Handles multi-line statement joining and callout stripping - Generates tabbed HTML output for language variants - Applies version substitution and callout fixes via postprocessor - Auto-restarts console on timeout with block-level retry - Falls back to dry-run output for blocks that fail after retry The new bin/process-docs.sh orchestrates console/server setup and passes attributes to the AsciidoctorJ plugin via Maven properties. Known issues to address: - Echo pattern in console output needs stripping - Second groovy tab (clean source) not yet generated - Version x.y.z substitution not wired - Missing CSS stylesheet in output - Some callout markers rendered as raw text

- Strip command echo (first line) from console results so output matches published format: gremlin> stmt / ==>result - Add second 'groovy' tab with clean source code (no prompts/output) to match the published two-tab format - Pass tinkerpop-version attribute to all asciidoctor executions so the GremlinPostprocessor can substitute x.y.z with actual version - Update tests for new tab count and graph.traversal() init

- Copy docs/{static,stylesheets} to staging area alongside docs/src/* to match old build behavior (provides tinkerpop.css to asciidoctor) - Revert timeout from 120s to 30s since legitimate blocks complete quickly; infrastructure-dependent blocks will fail fast and retry

- Wrap tab content in <div class='listingblock'><div class='content'> to match published structure - Render callout markers (<1>, <2>) as proper HTML conum elements with hide-when-copy spans instead of literal text - Preserve callouts in console tab display (separate display vs execution statement lists) - Process callouts per-line after HTML escaping

Use the CodeRay gem bundled with AsciidoctorJ to highlight generated code content in tabs. The JRuby runtime is accessed via JRubyRuntimeContext to call CodeRay::Duo directly, producing the same highlighted HTML spans as regular source blocks. Falls back to plain HTML escaping if CodeRay is unavailable.

CodeRay was escaping/mangling callout markers (<1>, <2>) during highlighting. Now callouts are extracted before highlighting, CodeRay processes clean source, and callout HTML is re-injected into the highlighted output at the correct line positions.

- Handle console startup timeout gracefully (skip block instead of crashing the build) - Cache CodeRay Duo encoder object in JRuby global variable - Use heredoc syntax for source input to avoid escaping issues - Build time reduced from 2.5 hours to under 2 minutes

Published docs show continuation lines with indentation only, not repeated gremlin> prompts. Also handle console startup failure gracefully during restart (skip block instead of crashing build).

Without -pl ., Maven runs process-resources across the entire reactor (30+ modules), adding hours of unnecessary processing. The asciidoc profile is only defined on the root pom.

evalScriptlet with embedded source code forced JRuby to parse a new Ruby script for every highlight call (~970 calls), taking 25-30s each. Now uses callMethod to invoke the cached CodeRay Duo object directly, passing source as a RubyString argument. Dry-run drops from hours to 16 seconds.

- Standalone tab groups now consume consecutive [source,lang] blocks even without the 'tab' attribute (matching published behavior) - Callout conums use class="conum" (not "conum invisible") matching published format; use class="comment" for // spans - Remove clear-shadow divs from tab HTML (not in published output) - Remove postprocessor invisible/hide-when-copy transformations that were overriding the correct format

The Hadoop/Spark blocks use traversal().withEmbedded(graph) which takes ~23s for anonymous TraversalSource resolution, plus SparkGraphComputer first-execution overhead (~10s). With 30s timeout, these were right at the edge and intermittently failing. 60s provides comfortable headroom for all legitimate operations while still failing fast on genuinely broken blocks.

For multi-line statements, the console echoes all lines with continuation prompts (......N>) before the actual results. Previously only the first echo line was skipped, leaving continuation prompts in the output and making it appear the block had no results. Now skips all lines matching the continuation prompt pattern.

Blocks using [gremlin-groovy,theCrew] were falling through to an empty TinkerGraph because only 'crew' was mapped. Added 'theCrew' as an alias for TinkerFactory.createTheCrew().

Multi-line SPARQL queries use triple-quoted strings that span lines without leading whitespace (e.g., WHERE clauses). Track open/close of triple-quote pairs to keep them as single statements. Reduces skipped blocks from 13 to 3.

The sparql-gremlin plugin requires: 1. plugin/ directory with main JAR for SPI discovery 2. Registration in plugins.txt for activation at startup 3. All dependency JARs on the main classpath (lib/) because the ext/ child classloader doesn't properly share Jena classes Also register hadoop, spark, and neo4j plugins in plugins.txt.

Standalone plugins (hadoop-gremlin, spark-gremlin) need: 1. A plugin/ directory with main JAR for SPI-based plugin discovery 2. All dependency JARs on the main classpath (lib/) for proper classloading of HadoopGraph, SparkGraphComputer, etc. This mirrors the fix already applied for non-standalone plugins (sparql-gremlin, neo4j-gremlin).

The nested g.inject(g.withComputer()...) block triggers cold Spark initialization which can take 50-60s. 90s gives comfortable headroom while still serving as a failsafe against genuine hangs.

Two fixes for the SPARQL/remote-connect cascade failure: 1. process-docs.sh: fail fast if port 8182 is already in use (stale server from a prior run). Previously the nc readiness check would pass against a stale/incompatible server, causing WebSocket handshake failures that dumped ~500-line Netty stacktraces into every :remote connect block. Also detect early server-process exit (e.g. bind failure) instead of waiting the full 30s timeout. 2. GremlinTreeprocessor: add a 2s delay after closing a dead console before restarting, letting the OS reclaim resources (ports, memory) from Spark/Hadoop blocks so the SPARQL section that follows can recover instead of cascading into repeated timeouts.

The ':remote connect' docs blocks target localhost:8182. Any process occupying that port (stale Gremlin Server or an unrelated service) causes our server to fail binding while nc -z still passes, so the console connects to the wrong service and WebSocket handshakes fail. Updated the fail-fast message to not assume the cause is a stale server and to point at lsof for identifying the actual process.

Two fixes for no-result blocks: 1. initGraphIfNeeded now re-initializes graph + g for EVERY non-existing block, matching the old preprocessor behavior. Previously it skipped re-init when the graph name matched the prior block, so a block that reassigned g (e.g. 'g = traversal().withEmbedded(graph).withComputer()') leaked the OLAP/mutated source into later blocks that expected a fresh OLTP 'g' (e.g. path().by() blocks returned nothing under GraphComputer). 2. Detect sugar syntax (g.V, g.V[0..2], etc.) and call SugarLoader.load() on a fresh console for those blocks, restarting afterward so the permanent Groovy metaclass mutation doesn't leak into other blocks.

Rework the docs build so a genuine Gremlin error fails the build rather than rendering as a silently-empty block. Investigation confirmed no executed gremlin-groovy block is expected to error: error output goes to the console's stderr (the "Display stack trace?" prompt), which neither the old nor new build captured, and all error examples in the docs are hand-authored [source,text] blocks. GremlinConsole now records the stderr error prompt and surfaces it from execute() as a GremlinExecutionException; the treeprocessor propagates it (fatal) and the silent dry-run fallback is removed. The 90s timeout remains a failsafe and single restart-and-retry recovery is preserved. Enabling this surfaced several previously-masked setup issues, also fixed: - buildStatements tracks bracket depth so multi-line Groovy closures (e.g. "(1..10).each { ... }; []") stay grouped instead of hanging at a continuation prompt. - initGraphIfNeeded closes the prior graph and clears /tmp/neo4j and /tmp/tinkergraph.kryo before each block, so a stale Neo4j store lock no longer hangs Neo4jGraph.open (mirrors the old preprocessor). - SugarLoader runs on a freshly restarted console so it takes effect on a pristine Groovy metaclass. - process-docs.sh resolves the Neo4j DB impl onto the console classpath, strips only the conflicting io.netty 4.1.24 (keeping netty-3.9.x that Neo4j needs), and registers TinkerGraph and Credential plugins. REQUIREMENTS.md FR-4 updated. A remaining Neo4j/Spark Scala classpath conflict requiring per-book plugin isolation is tracked separately. (tinkerpop-6jq.14) Assisted-by: Kiro:claude-opus-4.8 [Kiro CLI]

Neo4j 3.4 (Scala 2.11) and Spark (Scala 2.12) cannot share the docs console's flat classpath. Wire the previously-unwired plugin-exclusion scaffolding so the console restarts with conflicting plugins removed: - Add PluginDirectoryRestartHandler that toggles ext/<plugin> dirs and keeps ext/plugins.txt in sync (the console drops unlisted plugins whose jars vanish on restart). - GremlinTreeprocessor: detect :gremlin-docs-plugins-exclude: at section granularity during the AST walk (document baseline + per-section override with latching), bouncing the console when the set changes. Default to the directory handler in production; tests inject their own. Also :set max-iteration 100 on console start to match published output. - process-docs.sh: install each plugin's deps into ext/<plugin>/plugin/ (deduped vs lib/) instead of the shared lib/, so conflicting deps are isolatable; write plugins.txt deterministically to avoid stale state. - Add :gremlin-docs-plugins-exclude: attributes to the neo4j, hadoop, spark and gremlin-variants chapters with explanatory comments; update the stale reference index comment and developer docs. - Fix an undefined-variable typo (marko -> vMarko) and render the olap-spark-yarn recipe (which needs a real YARN/HDFS cluster) as a non-executed block with hardcoded output. - Set asciidoctor.gemPath under target/ so the JRuby gem extraction no longer creates a gems/ directory at the repo root. Assisted-by: Kiro:claude-opus-4.8 [kiro-cli]

…rpop-6jq.11) The old shell/AWK preprocessor and postprocessor directories have been removed, but the developer documentation still described that system. Rewrite the "Documentation Environment" section to describe the Maven-based AsciidoctorJ extension: it now states the build is Maven-driven, runs OLAP examples against the local filesystem (fs.defaultFS=file:///) so no Hadoop cluster is required, notes the Spark-on-YARN recipe is rendered from pre-captured output, and adds the prerequisite distribution build and --dryRun option. Drop the obsolete pseudo-distributed Hadoop / yarn-site / mapred-site instructions and the AWK/GNU-utils requirements. Point the OLAP jar-conflict note at the new per-book plugin exclusion mechanism, and update stale "preprocessor" wording in the committer docs. Assisted-by: Kiro:claude-opus-4.8 [kiro-cli]

…erpop-6jq.7) processStandaloneTabGroup emitted tab code via TabbedHtmlBuilder.codeTab, which passes the source through verbatim, so standalone [source,<lang>,tab] groups and manual language-variant tab groups (e.g. the driver-connection examples in the "Gremlin Server" / connecting-gremlin-server section) rendered without CodeRay highlighting, unlike the published docs. Route these blocks through highlightAsSource + codeTabHighlighted, the same path the gremlin-groovy tabs use. Reference-book CodeRay span count now matches/exceeds published and the connecting-gremlin-server examples render with the expected keyword/string/comment highlighting. Assisted-by: Kiro:claude-opus-4.8 [kiro-cli]

…q.7) The docs build runs the Hadoop-Gremlin OLTP/OLAP examples against the local filesystem (fs.defaultFS=file:///). Hadoop's RawLocalFileSystem resolves a bare hdfs.ls() to getHomeDirectory(), which reads the JVM user.home, so the rendered docs listed the entire contents of the build machine's home directory instead of the clean HDFS home the published docs show. Change the bare hdfs.ls() calls in implementations-hadoop-start and implementations-hadoop-end to hdfs.ls('tinkerpop-modern.kryo') so they list only the graph file the example just copied -- deterministic output with no home-directory leakage, and no change to the shared hadoop-gryo input location. (A full MiniDFSCluster would reproduce the published HDFS output exactly; that is tracked separately under tinkerpop-6jq.12.) Assisted-by: Kiro:claude-opus-4.8 [kiro-cli]

…nkerpop-6jq.7) PluginDirectoryRestartHandler moved ext/<plugin> to ext-disabled/<plugin> with Files.move(REPLACE_EXISTING), which throws DirectoryNotEmptyException when ext-disabled/<plugin> already exists as a non-empty directory. An interrupted docs build leaves such a directory, poisoning the next run with "Failed to restart console with excluded plugins". Make the toggle idempotent and source-authoritative: clear any stale destination before moving when disabling, and when enabling drop a leftover disabled duplicate if the plugin is already present in ext/. Also clear ext-disabled/ at the start of bin/process-docs.sh so each build begins from a known state. Adds PluginDirectoryRestartHandlerTest covering the round-trip, double-exclude, and stale-state scenarios. Assisted-by: Kiro:claude-opus-4.8 [kiro-cli]

Cole-Greer added 28 commits May 21, 2026 14:00

Fix multi-line statement display: only first line gets gremlin> prompt

8e5a3d4

Published docs show continuation lines with indentation only, not repeated gremlin> prompts. Also handle console startup failure gracefully during restart (skip block instead of crashing build).

Restrict docs Maven build to root module only (-pl .)

7909b2c

Without -pl ., Maven runs process-resources across the entire reactor (30+ modules), adding hours of unnecessary processing. The asciidoc profile is only defined on the root pom.

Add theCrew graph name mapping for doc blocks

d7d10d2

Blocks using [gremlin-groovy,theCrew] were falling through to an empty TinkerGraph because only 'crew' was mapped. Added 'theCrew' as an alias for TinkerFactory.createTheCrew().

Handle triple-quoted strings in statement grouping

e679cfa

Multi-line SPARQL queries use triple-quoted strings that span lines without leading whitespace (e.g., WHERE clauses). Track open/close of triple-quote pairs to keep them as single statements. Reduces skipped blocks from 13 to 3.

Increase timeout to 90s to accommodate Block 184 (shortestPath)

95ff473

The nested g.inject(g.withComputer()...) block triggers cold Spark initialization which can take 50-60s. 90s gives comfortable headroom while still serving as a failsafe against genuine hangs.

cleanup

6d064e3

Cole-Greer mentioned this pull request Jun 9, 2026

Replace AWK/shell docs preprocessing with AsciidoctorJ extension #3418

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace the documentation build system with an AsciidoctorJ extension#3455

Replace the documentation build system with an AsciidoctorJ extension#3455
Cole-Greer wants to merge 28 commits into
masterfrom
docs-3.7

Cole-Greer commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Cole-Greer commented Jun 9, 2026

Summary

Motivation

What changed

Testing

Tips for reviewers

Future

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant