- Fix
send_observation_tree_arrayflag propagation — flag was set in config but never reached the Python callback, so CSV files were always written. Fixed: flag now flows config.yml → train.py/transfer.py → create_callback → callback constructor → guards all write/append sites. - Guard tree array computation in Java — when flag=false, SimulationStepInfo no longer calls
getInfrastructureObservation()(was computing full tree then throwing it away). RL'sinfr_state(Observation object) still always computed — correct behavior. - Use
shadowJarfor fat JAR —./gradlew shadowJarnotjar;make build-gatewaynow works correctly. - Remove Py4J —
use_grpcconfig key, Py4J gym env, and all Py4J-related code fully removed. - Remove batch/parallel mode —
run_mode: serialonly;run_mode: batchand all orchestration code removed. - Lombok migration —
@Value,@Getter,@Setterannotations applied to SimulationSettings, SimulationStepInfo, SimulationStepResult, SimulationResetResult, Observation, CloudletDescriptor; replaced 30+ boilerplate getter methods. - All System.out/err replaced with SLF4J — no raw print statements remain in Java source.
- Remove
MultiSimulationEnvironment.java— referenced deleted Py4JGatewayServer; file deleted. - Fix
configureLoggingNPE —System.getProperty()with null default caused NPE on missing property; fixed with?.trim()?.toUpperCase() ?: 'INFO'. - Upgrade shadow plugin —
com.bmuschko.docker-java-applicationbuilt-in shadow incompatible with Gradle 9; switched tocom.gradleup.shadow:9.4.1. - JDK 25 toolchain — foojay resolver auto-provisions JDK 25; no local JDK installation required.
- Centralized version management —
versions.gradleis single source of truth for managerVersion, gatewayVersion, gradleVersion. - Docker reproducibility —
.dockerignoreexcludes bytecode and build artifacts;pip install --no-depsprevents layer conflicts on rebuild. - Per-experiment Java logs —
Main.javagenerateslogback.xmlat runtime pointing tologs/experiment_${EXPERIMENT_ID}/;misc.pypassesexperiment.idandlog.destinationas-Dproperties to JVM.
-
Guard
getInfrastructureObservation()— NOT DONE: double-call still exists:step()still callsgetInfrastructureObservation()twice per step (once for SimulationStepInfo, once for Observation). Refactor to compute once and pass to both. Low priority — RL obs must always compute it. -
Batch gRPC calls — send multiple steps per roundtrip to reduce roundtrip frequency (16x fewer roundtrips). Requires changes to proto schema and both client/server.
-
Async gRPC client + thread pool — use async Python gRPC with a thread pool to overlap gRPC wait times across workers. Keeps architecture, no proto changes needed.
-
SubprocVecEnv with post-fork gRPC — establish gRPC channels after
fork(). Risky/complex due to Channel non-picklability. -
Optional proto field — make
observation_tree_arrayoptional in proto so Java can skip sending entirely when flag=false. Requires proto recompile on both sides.
| Config | Start FPS | End FPS | Wall time to 4096 steps |
|---|---|---|---|
| 1 CPU | 77 | 34 | 1.53 min |
| 16 CPU | 207 | ~35 | 1.17 min |
Key finding: FPS degradation happens even with 1 CPU (77→34). This means:
- The degradation is NOT caused by DummyVecEnv queue buildup
- The 16-CPU parallelism helps overall (~24% faster) but the degradation curve is similar
- Both converge to ~34-35 FPS regardless of parallelism
Current hypothesis: CloudSim event-driven simulation complexity changes over episode lifetime — as the simulation state evolves, per-step cost increases. Or: policy update time grows as model trains.
- Batch gRPC calls — easiest: send N steps per roundtrip, reduces gRPC overhead 16x
- Async gRPC + threading — async Python client with thread pool to overlap waits
- SubprocVecEnv post-fork — true Python parallelism
- In-process Java via JNI — no network, fastest but most invasive
num_cpu: 1gives same final FPS asnum_cpu: 16— confirms parallelism isn't the bottlenecksend_observation_tree_array: false— tree skip helps CPU but not the fundamental bottleneck
- 16 JVMs: each spawned as subprocess by
_create_grpc_env_for_rank()inmisc.py, each runningGrpcServeron ports 50051-50066 - DummyVecEnv:
vectorize_env()usesDummyVecEnvfor gRPC because gRPC Channel can't be pickled for SubprocVecEnv IPC - No sleep() in step path: Java CloudSim simulation itself is fast; gRPC serialization + sequential Python execution is the bottleneck
- Callback:
SaveOnBestTrainingRewardCallback._save_timestep_details()appends to 8 lists every step, plusload_results()reads CSV on every episode end — minor overhead