[Bug] Blue-Green Cluster Migration results in migration of healthcheck topic which causes timeout on GET /admin/v2/brokers/health #23572

frankjkelly · 2024-11-06T14:59:24Z

Search before asking

I searched in the issues and found nothing similar.

Read release policy

I understand that unsupported versions don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker.

Version

Latest on master i.e.4.1.0

Minimal reproduce step

Create two local pulsar clusters (for me it was in Minikube) with the same clustername (different services)
Confirm healthcheck is passing e.g.

kubectl port-forward service/pulsar-broker 8080:8080
curl http://localhost:8080/admin/v2/brokers/health

Enable Cluster Migration from one cluster to the other e.g.

kubectl port-forward service/pulsar-broker 8080:8080
./bin/pulsar-admin clusters update-cluster-migration blue --broker-url  pulsar://pulsar-green-broker:6650 --service-url http://pulsar-green-broker:8080 --migrated

Monitor healthcheck on the original cluster

while true; do curl http://localhost:8080/admin/v2/brokers/health; date; sleep 5; done
ok
Wed Nov  6 09:37:51 EST 2024
ok
Wed Nov  6 09:37:56 EST 2024
.
.
Wed Nov  6 09:42:19 EST 2024
{"reason":"\n --- An unexpected error occurred in the server ---\n\nMessage: request timeout {'durationMs': '30000', 'reqId':'4072026808692081594', 'remote':'pulsar-broker-0.pulsar-broker.cogito.svc.cluster.local/10.244.0.12:6650', 'local':'/10.244.0.12:43376'}\n\nStacktrace:\n\norg.apache.pulsar.client.api.PulsarClientException$TimeoutException: request timeout {'durationMs': '30000', 'reqId':'4072026808692081594', 'remote':'pulsar-broker-0.pulsar-broker.cogito.svc.cluster.local/10.244.0.12:6650', 'local':'/10.244.0.12:43376'}\n\tat org.apache.pulsar.client.impl.ClientCnx.checkRequestTimeout(ClientCnx.java:1422)\n\tat org.apache.pulsar.common.util.Runnables$CatchingAndLoggingRunnable.run(Runnables.java:54)\n\tat io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)\n\tat io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:162)\n\tat io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)\n\tat io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)\n\tat io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)\n\tat io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:408)\n\tat io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)\n\tat io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\n\tat io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)\n\tat java.base/java.lang.Thread.run(Unknown Source)\n"}

What did you expect to see?

The original "blue" cluster should appear to be healthy after cluster migration.

What did you see instead?

"Blue" cluster reports as not healthy.
The impact in my system is that we have microservices which have readiness and liveness checks which depend on the Pulsar broker health check (based on admin client / API) and since that fails, the microservices then start to fail when the blue cluster is technically still operational.

Anything else?

I'm not sure it even makes sense logically to migrate the healthcheck.
Looking here it is surprising that the healthcheck topic does not register as an Internal topic

pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentTopic.java

Lines 2923 to 2926 in c266db2

    
           public CompletableFuture<Void> checkClusterMigration() { 
        
               if (ExtensibleLoadManagerImpl.isInternalTopic(topic)) { 
        
                   return CompletableFuture.completedFuture(null); 
        
               }

pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/extensions/ExtensibleLoadManagerImpl.java

Lines 822 to 827 in c266db2

    
           public static boolean isInternalTopic(String topic) { 
        
               return INTERNAL_TOPICS.contains(topic) 
        
                       || topic.startsWith(TOPIC) 
        
                       || topic.startsWith(BROKER_LOAD_DATA_STORE_TOPIC) 
        
                       || topic.startsWith(TOP_BUNDLES_LOAD_DATA_STORE_TOPIC); 
        
           }

The healthcheck (heartbeat) topic name is defined here

pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/admin/impl/BrokersBase.java

Lines 422 to 427 in c266db2

    
           public static String getHeartbeatTopicName(String brokerId, ServiceConfiguration configuration, boolean isV2) { 
        
               NamespaceName namespaceName = isV2 
        
                       ? NamespaceService.getHeartbeatNamespaceV2(brokerId, configuration) 
        
                       : NamespaceService.getHeartbeatNamespace(brokerId, configuration); 
        
               return String.format("persistent://%s/%s", namespaceName, HEALTH_CHECK_TOPIC_SUFFIX); 
        
           }

I might be interested in submitting a PR but would need some clarity on whether to change the logic in PersistentTopic.java e.g.

   public CompletableFuture<Void> checkClusterMigration() {
        if (ExtensibleLoadManagerImpl.isInternalTopic(topic)) {
            return CompletableFuture.completedFuture(null);
        }
.
.
ADD LOGIC HERE

or if it makes more sense that ExtensibleLoadManagerImpl.isInternalTopic returns true for the healthcheck topic also.

Are you willing to submit a PR?

I'm willing to submit a PR!

The text was updated successfully, but these errors were encountered:

frankjkelly · 2024-11-06T15:10:46Z

cc: @rdhabalia This isn't as critical for me as the potential NonDurable subscription issue (which I am trying to get my arms around to help with reproduction).
Anyway I just wanted to make sure this was captured. If the fix is straightforward (see my question above) I'd be happy to submit it.

frankjkelly added the type/bug The PR fixed a bug or issue reported a bug label Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Blue-Green Cluster Migration results in migration of healthcheck topic which causes timeout on GET /admin/v2/brokers/health #23572

[Bug] Blue-Green Cluster Migration results in migration of healthcheck topic which causes timeout on GET /admin/v2/brokers/health #23572

frankjkelly commented Nov 6, 2024 •

edited

Loading

frankjkelly commented Nov 6, 2024

[Bug] Blue-Green Cluster Migration results in migration of healthcheck topic which causes timeout on GET /admin/v2/brokers/health #23572

[Bug] Blue-Green Cluster Migration results in migration of healthcheck topic which causes timeout on GET /admin/v2/brokers/health #23572

Comments

frankjkelly commented Nov 6, 2024 • edited Loading

Search before asking

Read release policy

Version

Minimal reproduce step

What did you expect to see?

What did you see instead?

Anything else?

Are you willing to submit a PR?

frankjkelly commented Nov 6, 2024

frankjkelly commented Nov 6, 2024 •

edited

Loading