Skip to content

Conversation

@jmc-9304
Copy link

@jmc-9304 jmc-9304 commented Dec 25, 2025

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this does not need to be in the release notes.
  • The title of the PR states what changed and the related issues number (used for the release note).
  • The title of the PR conforms to the Title of the PR
  • I've included "Closes [ISSUE #]" or "Fixes [ISSUE #]" in the description to automatically close the associated issue.
  • [] I've updated both the CLI and UI to expose my feature, or I plan to submit a second PR with them.
  • Does this PR require documentation updates?
  • I've updated documentation as required by this PR.
  • I have signed off all my commits as required by DCO
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • My build is green (troubleshooting builds).
  • My new feature complies with the feature status guidelines.
  • I have added a brief description of why this PR is necessary and/or what this PR solves.
  • Optional. My organization is added to USERS.md.
  • Optional. For bug fixes, I've indicated what older releases this fix should be cherry-picked into (this may or may not happen depending on risk/complexity).

Problem Statement

When argocd-server starts up, it initializes informers for Application and AppProject resources to maintain an in-memory cache. During this initialization, the informer performs a list operation to populate its cache. In environments with a large number of Application resources (e.g., 3000+ applications), this initial list operation can be slow because it reads directly from etcd, taking approximately 15 seconds to complete. This delay impacts the server's startup time and readiness.

Solution

This PR optimizes the informer initialization by setting resourceVersion=0 in the ListOptions for both Application and AppProject informers. When resourceVersion=0 is specified, Kubernetes API server returns data from its watch cache instead of reading directly from etcd. This approach:

  • Speeds up initial list operations: The API server's watch cache is significantly faster than direct etcd reads
  • Maintains consistency: The watch cache is kept in sync with etcd, so the data is still consistent (though may be slightly stale)
  • Improves startup performance: Reduces informer initialization time from ~15 seconds to ~2 seconds in environments with 3000+ Application resources

Performance Impact

Based on testing in an environment with approximately 3000 Application resources:

  • Before: ~15 seconds for informer initialization
  • After: ~2 seconds for informer initialization
  • Improvement: ~87% reduction in initialization time

Technical Details

The change adds a tweakListOptions function that sets resourceVersion=0 when the ResourceVersion field is empty. This function is applied to both:

  • AppProject informer factory
  • Application informer factory

This optimization is safe because:

  1. The informer will catch up with any changes via the watch mechanism after initialization
  2. The watch cache is kept in sync with etcd by the API server
  3. This is a common optimization pattern used in Kubernetes controllers

Testing

  • Verified that informers still function correctly after the change
  • Confirmed that the watch mechanism continues to work properly after initial list
  • Tested in an environment with 3000+ Application resources to measure performance improvement

@jmc-9304 jmc-9304 requested a review from a team as a code owner December 25, 2025 16:58
@bunnyshell
Copy link

bunnyshell bot commented Dec 25, 2025

🔴 Preview Environment stopped on Bunnyshell

See: Environment Details | Pipeline Logs

Available commands (reply to this comment):

  • 🔵 /bns:start to start the environment
  • 🚀 /bns:deploy to redeploy the environment
  • /bns:delete to remove the environment

@jmc-9304 jmc-9304 force-pushed the perf/faster-informer-init branch from c1ca755 to bd1f151 Compare December 25, 2025 17:03
@jmc-9304 jmc-9304 changed the title fix(perf): use resourceVersion=0 for faster informer initialization fix(perf): use resourceVersion=0 for faster argocd-server informer initialization Dec 25, 2025
@jmc-9304 jmc-9304 force-pushed the perf/faster-informer-init branch from bd1f151 to 2655946 Compare December 25, 2025 17:44
…itialization

Set resourceVersion=0 in informer ListOptions to use API server's
watch cache instead of etcd for initial list operations. This
speeds up informer initialization when there are large numbers
of Application resources.

Signed-off-by: jmc-9304 <fdop104@gmail.com>
@jmc-9304 jmc-9304 force-pushed the perf/faster-informer-init branch from c9d0a5b to 2e7f0e4 Compare December 26, 2025 04:46
// resourceVersion=0 means use API server's watch cache (faster but may be slightly stale)
// This speeds up initial informer sync, especially with large numbers of Applications
if options.ResourceVersion == "" {
options.ResourceVersion = "0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your comment. 👍

I don't see any code in argocd-server that uses pagination List from the API-server.

What is the reason we need to consider pagination List?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me add, regarding KEP-4988 (Snapshottable API Server Cache), based on the implementation history, it seems that KEP-4988 is designed to create snapshots for list operations at specific resourceVersions (not for resourceVersion=0 which represents the latest available version).

Since our current implementation uses resourceVersion=0 for faster initial informer sync and doesn't use pagination. Additionally, unlike the controller, argocd-server is a backend server for the web UI, so focusing on the latest state to improve web response speed is more efficient.

Copy link
Member

@rumstead rumstead Dec 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same concerns that are outlined here argoproj/gitops-engine#617 (comment)

Copy link
Author

@jmc-9304 jmc-9304 Dec 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the feedback! :)

After reviewing the attached link, I understand that using resourceVersion=0 without pagination could potentially cause API server memory concerns, especially if a large number of Applications need to be loaded in a single list operation.

However, I'd like to point out an important difference between argocd-application-controller and argocd-server:

  1. Controller behavior: The argocd-application-controller performs periodic resync operations (configured via --app-resync flag, default 120 seconds) that trigger list operations at regular intervals. This means the API server would receive list requests repeatedly over time.
  2. Server behavior: The argocd-server informers are created with resyncPeriod=0 (see server/server.go:321-322), which means they only perform the initial list operation during startup. After that, they rely entirely on watch events for updates. There are no periodic resync operations in argocd-server.

(Sorry, I misunderstood and thought that the list operation was performed during the resync period.)

Therefore, while the initial list operation might be larger without pagination, it only happens once during server startup, not repeatedly like in the controller. This should significantly reduce the overall impact compared to the controller's behavior.

Additionally, as shown below, a rough estimation suggests that performing an Applications LIST operation on the API server without pagination may not result in significant memory pressure.

⚠️ To estimate the API server LIST memory overhead, let us assume there are 10,000 Application resources, with an average Application size of 10KB, and calculate the expected memory impact.

Average-based memory estimation

  • Measured avg Application size: 10 KB (in my case, avg: 9.870 KB)
  • 10,000 Applications total payload:
    • 10 × 10,000 ≈ 100,000,000 bytes ≈ 100 MiB
  • Empirically, transient memory peak ≈ 2–5× payload, resulting in
    ~200 MiB – 500 MiB additional memory

Therefore, the memory overhead of the LIST operation does not appear to be significant, and since it is performed only once during initialization rather than repeatedly, it seems reasonable for argocd-server to use resourceVersion=0.

That said, if the reviewer still has concerns about potential API server OOM risks associated with a non-paginated LIST operation, I am also fine with closing this PR. 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants