Skip to content

test: fix flaky integration tests across multiple packages#1417

Open
pranav-new-relic wants to merge 6 commits into
mainfrom
fix/flaky-integration-tests
Open

test: fix flaky integration tests across multiple packages#1417
pranav-new-relic wants to merge 6 commits into
mainfrom
fix/flaky-integration-tests

Conversation

@pranav-new-relic
Copy link
Copy Markdown
Member

Summary

Fixes several integration test failures observed in CI that pass locally but fail under concurrent load or after entity creation:

  • fleetcontrol: Switch to NEW_RELIC_FLEET_TEST_* credentials (isolated account for fleet resources); rewrite tests into three self-contained workflows (fleet lifecycle, configuration lifecycle, deployment/managed-entity read-only search); remove all hardcoded GUIDs; add package-level comment explaining why managed-entity and deployment write paths are excluded; fix defer cleanup registration order so resources are never leaked on mid-test assertion failures
  • keytransaction: Add 5-second sleep after create before update — KeyTransactionUpdate resolves serviceLevel.indicators on the newly created entity, which NerdGraph cannot resolve until the entity is indexed; without the sleep the call exhausts all 3 retries under CI load
  • servicelevel: GetIndicators was called with the SLI's own GUID instead of the parent application entity GUID; serviceLevel { indicators } only resolves on the owning entity, so the SLI GUID always returns nothing; also add defer-based best-effort cleanup so SLIs are deleted even on mid-test failures
  • usermanagement: UserManagementGroupCleanupForIntegrationTests propagated "Could not find the target or you are unauthorized" from UserManagementDeleteGroup, causing TestIntegrationGroupManagementWithUsers to fail when parallel tests race to delete the same leftover group
  • entityrelationship: Fix type assertion (ExternalEntityMobileApplicationEntity) and refresh stale test entity GUIDs to active Mobile Application entities
  • CI (GitHub Actions): Wire NEW_RELIC_FLEET_TEST_* secrets into the integration test job

Test plan

  • CI integration tests pass for all affected packages: fleetcontrol, keytransaction, servicelevel, usermanagement, entityrelationship
  • Fleet tests skip gracefully when NEW_RELIC_FLEET_TEST_API_KEY is not set
  • No dangling resources left after a mid-test assertion failure (defer guard pattern verified)

…nup to servicelevel

- Add NewFleetIntegrationTestConfig and GetFleetTestAccountID helpers to
  testhelpers so fleet tests use NEW_RELIC_FLEET_TEST_* credentials instead
  of the default account
- Rewrite fleetcontrol integration tests into three self-contained workflows:
  fleet lifecycle, configuration lifecycle, and read-only deployment/managed
  entity search; remove all hardcoded GUIDs; add package-level comment
  documenting why managed-entity and deployment write paths are excluded
- Add defer-based best-effort cleanup to servicelevel integration tests so
  SLIs are deleted even when a test fails mid-run (mirrors nrqldroprules pattern)
- Wire NEW_RELIC_FLEET_TEST_* secrets into the GitHub Actions integration
  test job
…esource leaks

Register the defer before the create call (with an empty-ID guard) so that
any assertion failure between creation and the original defer site cannot
leave a dangling fleet or configuration entity. Mirrors the pattern used
in the servicelevel and nrqldroprules integration tests.
…efore update

KeyTransactionUpdate resolves a deeply nested entity including
serviceLevel.indicators. On a brand-new entity under CI load, NerdGraph
fails to resolve those fields with "An error occurred resolving this field"
and exhausts all 3 retries. Adding a 5-second sleep after create (matching
the servicelevel test pattern) gives the entity time to be indexed.

Also fix three secondary issues:
- require.NoError in the defer would panic/obscure failures; replaced with
  best-effort cleanup using the standard createdGUID + deleted guard pattern
- defer was registered after several require checks on the create response,
  leaving a window where a created entity would not be cleaned up
- cleanup block logged wrong variable (err instead of deleteErr)
…leanup

servicelevel: GetIndicators must receive the owning application entity GUID,
not the SLI's own GUID; using the SLI GUID returns no results.

usermanagement: swallow "Could not find the target or you are unauthorized"
errors in UserManagementGroupCleanupForIntegrationTests so parallel tests
racing to delete the same leftover groups do not fail in CI.
… GUIDs

Switch ExternalEntity to MobileApplicationEntity to match the actual entity
type of the test GUIDs, and update the source/target GUIDs to active Mobile
Application entities so the CRUD test passes in CI.
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 22, 2026

Codecov Report

❌ Patch coverage is 0% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 32.92%. Comparing base (b289aa1) to head (d7ac985).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
pkg/testhelpers/config.go 0.00% 7 Missing ⚠️
pkg/testhelpers/helpers.go 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1417      +/-   ##
==========================================
+ Coverage   32.70%   32.92%   +0.22%     
==========================================
  Files         141      141              
  Lines        6691     6700       +9     
==========================================
+ Hits         2188     2206      +18     
+ Misses       4292     4281      -11     
- Partials      211      213       +2     
Flag Coverage Δ
integration 0.22% <0.00%> (?)
unit 32.70% <0.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…o fix flaky CI

AccountManagementUpdateAccount fails with "Unable to look up organization
by account ID" when called too soon after create — the organization-lookup
service propagates new accounts asynchronously and 2 seconds is not enough
under CI load, causing all 3 retries to be exhausted.

Bump the post-create sleep from 2s to 10s so the backend has enough time
to index the new account before the update is attempted. Also bump the
post-cancel sleep from 2s to 5s for the same reason (isCanceled visibility
has the same propagation delay).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants