Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flink Operator Loses Job Manager Contact during EKS upgrade #683

Open
guruguha opened this issue May 11, 2023 · 10 comments
Open

Flink Operator Loses Job Manager Contact during EKS upgrade #683

guruguha opened this issue May 11, 2023 · 10 comments

Comments

@guruguha
Copy link

We have Spotify operator v0.4.2 deployed. We also have our Flink pipelines to be rack-aware meaning that one flink cluster is deployed only one AZ mainly to reduce inter-AZ data-transfer costs. Although this helped us reduce data transfer cost, HA doesn't seem to work at all!

When there is an EKS node group upgrade happening, and a particular node that had one or more of the job managers for different clusters goes down, the Flink operator is not even aware of this scenario. The job manager goes down and the entire cluster is out.

Can someone help us understand this? I'm unable to provide any logs as the operator seems to think that the job manager is running as normal and there is no error that is logged anywhere. All the job and task manager logs are gone too.

@live-wire
Copy link
Contributor

live-wire commented May 23, 2023

Hi @guruguha
Are you sure you set all the high availability related flink properties?

Also, the serviceAccount with which the FlinkCluster runs should have permissions to create and edit configmaps in the corresponding namespace. You can verify that you have all the right properties set and the serviceaccount has all the roles needed by checking if there exists a configmap in the same namespace as the FlinkCluster called: <YOUR_FLINK_CLUSTER>-cluster-config-map when the FlinkCluster starts.

@guruguha
Copy link
Author

guruguha commented May 23, 2023

@live-wire thanks for responding! Yes, we have enabled HA for all our Flink clusters. All of them are Application Clusters requiring a job submitter to run/get deployed. Yes, we see the config maps created as well in the k8s namespace as well.

@regadas
Copy link
Contributor

regadas commented May 23, 2023

@guruguha would you be able to try this version https://github.com/spotify/flink-on-k8s-operator/releases/tag/v0.5.1-alpha.2? We recently addressed issues related to HA that might be impacting you.

@guruguha
Copy link
Author

guruguha commented May 23, 2023

@regadas We recently upgraded our operator to v0.5.0 release. The issue still persists. Let me check with v0.5.1-alpha.2 tag.

@guruguha
Copy link
Author

@regadas We tried with the above release - our application specific config map got deleted and all the pods were terminated during a node roll on EKS.

@live-wire
Copy link
Contributor

Hey @guruguha The configmap is what helps the job recover. That shouldn't be deleted so the job can recover.
Can you try to delete the job manager pod for a running job and see if it recovers? Also please share your FlinkCluster yaml so we can try to reproduce this scenario for you.

@guruguha
Copy link
Author

guruguha commented May 30, 2023

@live-wire Thanks for responding. We have a brief on this issue here: #690
Also, it would be great if you could share ideal / must have HA configs for a Flink cluster so we can compare with what we have right now.

At a high level, these are our HA configs:

flinkProperties:
    high-availability.storageDir: s3://PATH_TO_S3_DIR/
    s3.iam-role: ROLE_TO_ACCESS_S3

Other configs:

  job:
    autoSavepointSeconds: 3600
    savepointsDir: dummy_path
    restartPolicy: FromSavepointOnFailure
    maxStateAgeToRestoreSeconds: 21600
    takeSavepointOnUpdate: true

@live-wire
Copy link
Contributor

Hey @guruguha
These are the required HA flinkProperties for 1.16: link
(Depends on the Flink version)

kubernetes.cluster-id: <cluster-id>
high-availability: kubernetes
high-availability.storageDir: hdfs:///flink/recovery

@live-wire
Copy link
Contributor

Also, there should be no job-submitter pod when you use the Application mode. I noticed you mentioned you need a jobSubmitter?
Can you share your FlinkCluster yaml?

@guruguha
Copy link
Author

guruguha commented May 31, 2023

@live-wire I might have misunderstood the application mode with per-job mode. I’ll share the flinkcluster yaml shortly.

We do have this in our HA settings:

    high-availability: org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
    kubernetes.cluster-id: flink-ingestion-std-ha

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants