Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] managedLedgerMaxReadsInFlightSizeInMB limit doesn't work properly if managedLedgerReadEntryTimeoutSeconds isn't set #23506

Open
3 tasks done
lhotari opened this issue Oct 23, 2024 · 2 comments
Assignees
Labels
type/bug The PR fixed a bug or issue reported a bug

Comments

@lhotari
Copy link
Member

lhotari commented Oct 23, 2024

Search before asking

  • I searched in the issues and found nothing similar.

Read release policy

  • I understand that unsupported versions don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker.

Version

all released versions

Minimal reproduce step

Problem description:
When the system is loaded, an exception such as Time-out elapsed while acquiring enough permits on the memory limiter to read from ledger [ledgerid], [topic], estimated read size [read size] bytes for [dispatcherMaxReadBatchSize] entries (check managedLedgerMaxReadsInFlightSizeInMB) will happen when managedLedgerReadEntryTimeoutSeconds isn't set.
The solution expects that managedLedgerReadEntryTimeoutSeconds has been set.
In addition to this, the solution is inefficient since retries happen in a tight loop. It's also possible that ordering of reads gets mixed up since any next call will get the permits for doing the next read.

What did you expect to see?

managedLedgerMaxReadsInFlightSizeInMB limit should work also with setting managedLedgerReadEntryTimeoutSeconds

What did you see instead?

timeouts could happen when managedLedgerReadEntryTimeoutSeconds isn't set. "fairness" is missing for the waiting read requests. This could cause starvation type of issue when the system is overloaded.

Anything else?

InflightReadsLimiter should be refactored to return a CompletableFuture so that logic can be handled asynchronously and reactively. There should be a queue so that the next waiting acquire call will prevent other calls until there are sufficient amount of permits available.

Are you willing to submit a PR?

  • I'm willing to submit a PR!
@lhotari
Copy link
Member Author

lhotari commented Nov 1, 2024

I noticed also multiple other issues in InflightReadsLimiter / PendingReadsManager:

  • PendingReadsManager will acquire permits for partial reads in de-duplication which is wrong
    • This results in acquiring duplicate permits

@lhotari
Copy link
Member Author

lhotari commented Nov 1, 2024

Possible issue with PendingReadsManager:

  • when reads are de-duplicated, the same entries are returned to all callers in case of an exact match. This could be problematic since readerIndex is shared state of the ByteBuffer and that gets mutated in some cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

No branches or pull requests

1 participant