Skip to content

Replace the documentation build system with an AsciidoctorJ extension#3455

Open
Cole-Greer wants to merge 28 commits into
masterfrom
docs-3.7
Open

Replace the documentation build system with an AsciidoctorJ extension#3455
Cole-Greer wants to merge 28 commits into
masterfrom
docs-3.7

Conversation

@Cole-Greer

Copy link
Copy Markdown
Contributor

Summary

This PR replaces TinkerPop's legacy shell/AWK documentation preprocessor + postprocessor pipeline with a Maven-based AsciidoctorJ extension (tools/tinkerpop-docs). The new extension walks each AsciiDoc book's AST, executes [gremlin-groovy] code blocks against a long-lived Gremlin Console subprocess, and renders
the console output as tabbed, syntax-highlighted HTML — producing output structurally equivalent to the published 3.7.7-SNAPSHOT docs while being easier to maintain, test, and run.

Motivation

The old build was a fragile pipeline of bash + awk scripts under docs/preprocessor/ and docs/postprocessor/ that was hard to test, OS-sensitive (required GNU coreutils on macOS), silently swallowed Gremlin execution errors, and depended on a manually configured pseudo-distributed Hadoop cluster. The replacement
is a single Maven module with unit tests, fail-fast error handling, and a local-filesystem Hadoop configuration that needs no daemons.

What changed

New AsciidoctorJ extension (tools/tinkerpop-docs)

  • GremlinTreeprocessor — AST walk, block execution, per-graph initialization, sugar-plugin handling, and multi-line statement grouping.
  • GremlinConsole — manages the bin/gremlin.sh subprocess, prompt-based output capture, and error-prompt detection.
  • TabbedHtmlBuilder / GremlinPostprocessor — tabbed HTML output, CodeRay syntax highlighting (via JRuby), callout/conum rendering, and version substitution.
  • ConsoleRestartHandler / PluginDirectoryRestartHandler — per-book plugin isolation (see below).
  • SPI registration + a docs-specific local-filesystem Hadoop config (hadoop-conf/core-site.xml).

Orchestration — bin/process-docs.sh rewritten to validate the console/server distributions, install plugins, start a Gremlin Server and Gephi mock, and invoke Maven. Supports --dryRun (render without executing).

Per-book plugin isolation — Neo4j 3.4 (Scala 2.11) and Spark (Scala 2.12) cannot share the console's flat classpath. A :gremlin-docs-plugins-exclude: section attribute drives a console restart with the conflicting plugin directories toggled aside, so both the Neo4j and Spark examples render correctly in the
same run. Plugin dependencies are installed into ext//plugin/ (not the shared lib/) so they can be isolated, and the toggle is idempotent/resilient to interrupted builds.

Docs source updates

  • Added :gremlin-docs-plugins-exclude: attributes to the neo4j, hadoop, spark, and gremlin-variants chapters.
  • Scoped the Hadoop hdfs.ls() examples to the copied graph file so rendered docs avoid listing the build machine's home directory.
  • Fixed an undefined-variable typo (marko → vMarko) and converted the Spark-on-YARN recipe to a static example (requires dependency on a live YARN cluster).
  • Rewrote the developer-doc "Documentation Environment" section to describe the new Maven/AsciidoctorJ build and removed the retired preprocessor references.

Removed — the entire docs/preprocessor/ and docs/postprocessor/ script trees (15 files).

Testing

  • 92 unit tests in tools/tinkerpop-docs (console I/O, treeprocessor, tabbed HTML, postprocessor, dry-run, plugin-directory toggling), plus an integration fixture exercising gremlin blocks, manual/standalone tabs, existing, errors, callouts, and version replacement.
  • Full bin/process-docs.sh build completes BUILD SUCCESS with execution errors fatal.
  • Output diffed against the published 3.7.7-SNAPSHOT docs across all 8 books: structural metrics (headings, listing blocks, tab sections, callouts) match within ~2%; zero stacktrace bloat; all differences attributable to intended source updates, the file:/// vs hdfs:// environment, or branch-vs-snapshot content
    drift.

Tips for reviewers

I've taken the liberty of redeploying the 3.7.7-SNAPSHOT docs from this branch. I would recommend focusing the review on evaluating the built docs. There are a few notable differences worth calling out:

  • The CSharp tabs now have functioning syntax highlighting (as seen in the Basic Gremlin section of the reference docs)
  • The HDFS examples have replaces calls to hdfs.ls() with hdfs.ls('tinkerpop-modern.kryo'). This is a minor workaround as the docs build substitutes in the filesystem from the host machine instead of running a local hadoop cluster. This change is to avoid dumping existing contents of the hosts home directory. The old format could be restored by having the docs system internally manage a MiniDFSCluster. This is a viable fix but I've left it out of scope from this PR to limit complexity.
  • The OLAP Spark YARN recipe has been converted to a static example, it is no longer executed during docs build.

Future

The goal of this work was to replace the old docs system with a goal of a 1:1 equivalency in docs output. I think this new extension gives us a better platform to build future enhancements on the docs.

  • For 3.8 and above, it becomes quite trivial to link the gremlin-lang translators into all of the gremlin-groovy examples, and automatically add tabs for all language variants (excluding groovy-specific examples)
  • There is some complexity in the system to load and unload console plugins depending on needs for each doc book (needed due to conflicting dependencies between spark and neo4j). This could be ripped out and simplified in master as neo4j and sparql plugins are no longer necessary.
  • I expect we can extend the new asciidoctor plugin to add new features to the docs, such as improved docs navigation and an integrated search capability.

Cole-Greer added 28 commits May 21, 2026 14:00
Replace the awk-based preprocessor and shell postprocessor pipeline with
a Java-based AsciidoctorJ extension (tinkerpop-docs module) that:

- Processes gremlin-groovy listing blocks via GremlinConsole subprocess
- Handles multi-line statement joining and callout stripping
- Generates tabbed HTML output for language variants
- Applies version substitution and callout fixes via postprocessor
- Auto-restarts console on timeout with block-level retry
- Falls back to dry-run output for blocks that fail after retry

The new bin/process-docs.sh orchestrates console/server setup and passes
attributes to the AsciidoctorJ plugin via Maven properties.

Known issues to address:
- Echo pattern in console output needs stripping
- Second groovy tab (clean source) not yet generated
- Version x.y.z substitution not wired
- Missing CSS stylesheet in output
- Some callout markers rendered as raw text
- Strip command echo (first line) from console results so output
  matches published format: gremlin> stmt / ==>result
- Add second 'groovy' tab with clean source code (no prompts/output)
  to match the published two-tab format
- Pass tinkerpop-version attribute to all asciidoctor executions so
  the GremlinPostprocessor can substitute x.y.z with actual version
- Update tests for new tab count and graph.traversal() init
- Copy docs/{static,stylesheets} to staging area alongside docs/src/*
  to match old build behavior (provides tinkerpop.css to asciidoctor)
- Revert timeout from 120s to 30s since legitimate blocks complete
  quickly; infrastructure-dependent blocks will fail fast and retry
- Wrap tab content in <div class='listingblock'><div class='content'>
  to match published structure
- Render callout markers (<1>, <2>) as proper HTML conum elements
  with hide-when-copy spans instead of literal text
- Preserve callouts in console tab display (separate display vs
  execution statement lists)
- Process callouts per-line after HTML escaping
Use the CodeRay gem bundled with AsciidoctorJ to highlight generated
code content in tabs. The JRuby runtime is accessed via
JRubyRuntimeContext to call CodeRay::Duo directly, producing the same
highlighted HTML spans as regular source blocks.

Falls back to plain HTML escaping if CodeRay is unavailable.
CodeRay was escaping/mangling callout markers (<1>, <2>) during
highlighting. Now callouts are extracted before highlighting, CodeRay
processes clean source, and callout HTML is re-injected into the
highlighted output at the correct line positions.
- Handle console startup timeout gracefully (skip block instead of
  crashing the build)
- Cache CodeRay Duo encoder object in JRuby global variable
- Use heredoc syntax for source input to avoid escaping issues
- Build time reduced from 2.5 hours to under 2 minutes
Published docs show continuation lines with indentation only, not
repeated gremlin> prompts. Also handle console startup failure
gracefully during restart (skip block instead of crashing build).
Without -pl ., Maven runs process-resources across the entire reactor
(30+ modules), adding hours of unnecessary processing. The asciidoc
profile is only defined on the root pom.
evalScriptlet with embedded source code forced JRuby to parse a new
Ruby script for every highlight call (~970 calls), taking 25-30s each.
Now uses callMethod to invoke the cached CodeRay Duo object directly,
passing source as a RubyString argument. Dry-run drops from hours to
16 seconds.
- Standalone tab groups now consume consecutive [source,lang] blocks
  even without the 'tab' attribute (matching published behavior)
- Callout conums use class="conum" (not "conum invisible") matching
  published format; use class="comment" for // spans
- Remove clear-shadow divs from tab HTML (not in published output)
- Remove postprocessor invisible/hide-when-copy transformations that
  were overriding the correct format
The Hadoop/Spark blocks use traversal().withEmbedded(graph) which takes
~23s for anonymous TraversalSource resolution, plus SparkGraphComputer
first-execution overhead (~10s). With 30s timeout, these were right at
the edge and intermittently failing. 60s provides comfortable headroom
for all legitimate operations while still failing fast on genuinely
broken blocks.
For multi-line statements, the console echoes all lines with
continuation prompts (......N>) before the actual results. Previously
only the first echo line was skipped, leaving continuation prompts in
the output and making it appear the block had no results. Now skips
all lines matching the continuation prompt pattern.
Blocks using [gremlin-groovy,theCrew] were falling through to an
empty TinkerGraph because only 'crew' was mapped. Added 'theCrew'
as an alias for TinkerFactory.createTheCrew().
Multi-line SPARQL queries use triple-quoted strings that span lines
without leading whitespace (e.g., WHERE clauses). Track open/close
of triple-quote pairs to keep them as single statements. Reduces
skipped blocks from 13 to 3.
The sparql-gremlin plugin requires:
1. plugin/ directory with main JAR for SPI discovery
2. Registration in plugins.txt for activation at startup
3. All dependency JARs on the main classpath (lib/) because the
   ext/ child classloader doesn't properly share Jena classes

Also register hadoop, spark, and neo4j plugins in plugins.txt.
Standalone plugins (hadoop-gremlin, spark-gremlin) need:
1. A plugin/ directory with main JAR for SPI-based plugin discovery
2. All dependency JARs on the main classpath (lib/) for proper
   classloading of HadoopGraph, SparkGraphComputer, etc.

This mirrors the fix already applied for non-standalone plugins
(sparql-gremlin, neo4j-gremlin).
The nested g.inject(g.withComputer()...) block triggers cold Spark
initialization which can take 50-60s. 90s gives comfortable headroom
while still serving as a failsafe against genuine hangs.
Two fixes for the SPARQL/remote-connect cascade failure:

1. process-docs.sh: fail fast if port 8182 is already in use (stale
   server from a prior run). Previously the nc readiness check would
   pass against a stale/incompatible server, causing WebSocket
   handshake failures that dumped ~500-line Netty stacktraces into
   every :remote connect block. Also detect early server-process exit
   (e.g. bind failure) instead of waiting the full 30s timeout.

2. GremlinTreeprocessor: add a 2s delay after closing a dead console
   before restarting, letting the OS reclaim resources (ports, memory)
   from Spark/Hadoop blocks so the SPARQL section that follows can
   recover instead of cascading into repeated timeouts.
The ':remote connect' docs blocks target localhost:8182. Any process
occupying that port (stale Gremlin Server or an unrelated service)
causes our server to fail binding while nc -z still passes, so the
console connects to the wrong service and WebSocket handshakes fail.
Updated the fail-fast message to not assume the cause is a stale
server and to point at lsof for identifying the actual process.
Two fixes for no-result blocks:

1. initGraphIfNeeded now re-initializes graph + g for EVERY non-existing
   block, matching the old preprocessor behavior. Previously it skipped
   re-init when the graph name matched the prior block, so a block that
   reassigned g (e.g. 'g = traversal().withEmbedded(graph).withComputer()')
   leaked the OLAP/mutated source into later blocks that expected a fresh
   OLTP 'g' (e.g. path().by() blocks returned nothing under GraphComputer).

2. Detect sugar syntax (g.V, g.V[0..2], etc.) and call SugarLoader.load()
   on a fresh console for those blocks, restarting afterward so the
   permanent Groovy metaclass mutation doesn't leak into other blocks.
Rework the docs build so a genuine Gremlin error fails the build rather
than rendering as a silently-empty block. Investigation confirmed no
executed gremlin-groovy block is expected to error: error output goes to
the console's stderr (the "Display stack trace?" prompt), which neither
the old nor new build captured, and all error examples in the docs are
hand-authored [source,text] blocks.

GremlinConsole now records the stderr error prompt and surfaces it from
execute() as a GremlinExecutionException; the treeprocessor propagates it
(fatal) and the silent dry-run fallback is removed. The 90s timeout
remains a failsafe and single restart-and-retry recovery is preserved.

Enabling this surfaced several previously-masked setup issues, also fixed:
- buildStatements tracks bracket depth so multi-line Groovy closures
  (e.g. "(1..10).each { ... }; []") stay grouped instead of hanging at a
  continuation prompt.
- initGraphIfNeeded closes the prior graph and clears /tmp/neo4j and
  /tmp/tinkergraph.kryo before each block, so a stale Neo4j store lock no
  longer hangs Neo4jGraph.open (mirrors the old preprocessor).
- SugarLoader runs on a freshly restarted console so it takes effect on a
  pristine Groovy metaclass.
- process-docs.sh resolves the Neo4j DB impl onto the console classpath,
  strips only the conflicting io.netty 4.1.24 (keeping netty-3.9.x that
  Neo4j needs), and registers TinkerGraph and Credential plugins.

REQUIREMENTS.md FR-4 updated. A remaining Neo4j/Spark Scala classpath
conflict requiring per-book plugin isolation is tracked separately.

(tinkerpop-6jq.14)
Assisted-by: Kiro:claude-opus-4.8 [Kiro CLI]
Neo4j 3.4 (Scala 2.11) and Spark (Scala 2.12) cannot share the docs
console's flat classpath. Wire the previously-unwired plugin-exclusion
scaffolding so the console restarts with conflicting plugins removed:

- Add PluginDirectoryRestartHandler that toggles ext/<plugin> dirs and
  keeps ext/plugins.txt in sync (the console drops unlisted plugins whose
  jars vanish on restart).
- GremlinTreeprocessor: detect :gremlin-docs-plugins-exclude: at section
  granularity during the AST walk (document baseline + per-section
  override with latching), bouncing the console when the set changes.
  Default to the directory handler in production; tests inject their own.
  Also :set max-iteration 100 on console start to match published output.
- process-docs.sh: install each plugin's deps into ext/<plugin>/plugin/
  (deduped vs lib/) instead of the shared lib/, so conflicting deps are
  isolatable; write plugins.txt deterministically to avoid stale state.
- Add :gremlin-docs-plugins-exclude: attributes to the neo4j, hadoop,
  spark and gremlin-variants chapters with explanatory comments; update
  the stale reference index comment and developer docs.
- Fix an undefined-variable typo (marko -> vMarko) and render the
  olap-spark-yarn recipe (which needs a real YARN/HDFS cluster) as a
  non-executed block with hardcoded output.
- Set asciidoctor.gemPath under target/ so the JRuby gem extraction no
  longer creates a gems/ directory at the repo root.

Assisted-by: Kiro:claude-opus-4.8 [kiro-cli]
…rpop-6jq.11)

The old shell/AWK preprocessor and postprocessor directories have been
removed, but the developer documentation still described that system.
Rewrite the "Documentation Environment" section to describe the
Maven-based AsciidoctorJ extension: it now states the build is
Maven-driven, runs OLAP examples against the local filesystem
(fs.defaultFS=file:///) so no Hadoop cluster is required, notes the
Spark-on-YARN recipe is rendered from pre-captured output, and adds the
prerequisite distribution build and --dryRun option. Drop the obsolete
pseudo-distributed Hadoop / yarn-site / mapred-site instructions and the
AWK/GNU-utils requirements. Point the OLAP jar-conflict note at the new
per-book plugin exclusion mechanism, and update stale "preprocessor"
wording in the committer docs.

Assisted-by: Kiro:claude-opus-4.8 [kiro-cli]
…erpop-6jq.7)

processStandaloneTabGroup emitted tab code via TabbedHtmlBuilder.codeTab,
which passes the source through verbatim, so standalone [source,<lang>,tab]
groups and manual language-variant tab groups (e.g. the driver-connection
examples in the "Gremlin Server" / connecting-gremlin-server section)
rendered without CodeRay highlighting, unlike the published docs.

Route these blocks through highlightAsSource + codeTabHighlighted, the
same path the gremlin-groovy tabs use. Reference-book CodeRay span count
now matches/exceeds published and the connecting-gremlin-server examples
render with the expected keyword/string/comment highlighting.

Assisted-by: Kiro:claude-opus-4.8 [kiro-cli]
…q.7)

The docs build runs the Hadoop-Gremlin OLTP/OLAP examples against the
local filesystem (fs.defaultFS=file:///). Hadoop's RawLocalFileSystem
resolves a bare hdfs.ls() to getHomeDirectory(), which reads the JVM
user.home, so the rendered docs listed the entire contents of the build
machine's home directory instead of the clean HDFS home the published
docs show.

Change the bare hdfs.ls() calls in implementations-hadoop-start and
implementations-hadoop-end to hdfs.ls('tinkerpop-modern.kryo') so they
list only the graph file the example just copied -- deterministic output
with no home-directory leakage, and no change to the shared hadoop-gryo
input location. (A full MiniDFSCluster would reproduce the published
HDFS output exactly; that is tracked separately under tinkerpop-6jq.12.)

Assisted-by: Kiro:claude-opus-4.8 [kiro-cli]
…nkerpop-6jq.7)

PluginDirectoryRestartHandler moved ext/<plugin> to ext-disabled/<plugin>
with Files.move(REPLACE_EXISTING), which throws DirectoryNotEmptyException
when ext-disabled/<plugin> already exists as a non-empty directory. An
interrupted docs build leaves such a directory, poisoning the next run
with "Failed to restart console with excluded plugins".

Make the toggle idempotent and source-authoritative: clear any stale
destination before moving when disabling, and when enabling drop a
leftover disabled duplicate if the plugin is already present in ext/.
Also clear ext-disabled/ at the start of bin/process-docs.sh so each
build begins from a known state. Adds PluginDirectoryRestartHandlerTest
covering the round-trip, double-exclude, and stale-state scenarios.

Assisted-by: Kiro:claude-opus-4.8 [kiro-cli]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant