Skip to content

Latest commit

 

History

History
1225 lines (1142 loc) · 139 KB

CHANGELOG.md

File metadata and controls

1225 lines (1142 loc) · 139 KB

Changelog

v1.0.1-rc.2 (2021-01-27)

Full Changelog

Merged pull requests:

v1.0.1-rc.1 (2021-01-18)

Full Changelog

Closed issues:

  • checkCRDExists func return true when k8s cluster is not connected #1206
  • How to install it without kubeflow #1195
  • Pod get re-created after it exited and get garbage collected #1186
  • Surface Pod and other Errors that Prevent TFJob from starting #1131
  • Jobs failing when a node is preempted #999

Merged pull requests:

v1.0.1-rc.0 (2020-12-22)

Full Changelog

Closed issues:

  • tf-operator panic without worker role #1192
  • TFJob completion with active services/endpoints resources #1191
  • Having trouble viewing logs using Kubernetes dashboard #1189
  • [feature] Support SuccessPolicy/FailurePolicy Based on % of Succeeded/Failed Workers #1188
  • TFJob cannot utilize GPUs in the node. #1184
  • [bug] With Python SDK, TFJob won't stop running #1183
  • [bug] [Python SDK] tfjob_client.get_logs broken #1182
  • How to create a python sdk for mxnet-operator #1181
  • [feature] python sdk should report errors in created TFJobs #1180
  • Could not introduce k8s.io/kube-openapi@master #1174
  • can tf-operator used in distribute scene, such as Multi-node #1173
  • Multi-worker training with Keras only use one GPU #1169
  • NCCL WARN Failed to open libibverbs.so[.1] #1168
  • tf-job-operator pod restarts #1167
  • swagger-codegen-cli-2.4.6.jar not found #1166
  • Cut release for tf-operator project #1163
  • Replace reconciler implementation with kubeflow/common JobController #1161
  • Error while replicating mnist_with_summaries #1159
  • Non-OK-status: GpuLaunchKernel(FillPhiloxRandomKernelLaunch<Distribution>, num_blocks, block_size, 0, d.stream(), gen, data, size, dist) status: Internal: out of memory #1158
  • TFjob pods hang without explanation #1156
  • [Proposal] Support ClusterSpec Propagation Feature in TF 1.14 #1141
  • evaluator� should be set in TF_CONFIG when using Estimator distribute strategy #1139
  • Is there any case to run the different command in tfReplicaSpecs? #1138
  • should gpu resource be released when tfjob failed because of image pull problem? #1136
  • tf-job-operator CrashLoopBackOff #1135
  • How to change the log level of tf-job-operator #1132
  • Support getting the training process via Python SDK #1129
  • Popgroup is not created automatically. #1121
  • TFConfig should be demonstrated more specifically. #1115
  • [chore] Remove tfjob dashboard #1113
  • read TF_CONFIG env from configMap #1112
  • Long job names result in jobs stuck forever #1101
  • [Question] can't the base image "registry.access.redhat.com/ubi8/ubi:latest" in Dockerfile be replaced with "debian:buster" ? #1099
  • can i install tf-operator alone without kubeflow? #1096
  • c #1095
  • TFJob test is failing on master and v0.7 branch for kubeflow/kubeflow #1094
  • TFJob tests should use pytest #1093
  • Multiple Evaluator replicas gives InvalidTFJobSpec #1091
  • Java client for current version of TFjob #1090
  • [enhancement] Replace common with kubeflow/common #1087
  • Lack of documents for deployment #1086
  • Performance problem about pod informer #1079
  • [bug] Cannot initialize the training job with TF Estimator when the user uses 1 worker and 0 PS #1078
  • Separate cluster scoped and namespace scoped resources #1077
  • TFJob 1.0 #1076
  • [bug] Keep tf-job-role as deprecated label in this version #1068
  • GenLabels may select wrong Pods #1066
  • Can I create a tf-operator pod without using GO? #1065
  • tf-job-dashboard cannot work #1060
  • [discussion] Should We Add CleanPodPolicy PS? #1059
  • Refactor dockerfile #1058
  • remove v1beta1 in v0.5.3 cause incompatible issue when using go mod #1057
  • Invalid value: "v1beta1": must appear in spec.versions #1056
  • Example on EKS: Device or resource busy #1053
  • can we add PriorityClassName when we create TF-job Podgroup? #1048
  • TFjob still running while chief pod is completed #1045
  • Is there any document for how to run TFJob in AllReduce Strategy #1039
  • tf-operator version conficts #1035
  • Add E2E test for gang-scheduling #1033
  • gang schedule annotation #1031
  • [feature] Can we use one headless service for one job? #1030
  • Will tf-operator upgrading k8s to 1.13? #1029
  • no error log for create tfjob fail #1026
  • Creating tfjob in dashboard usability issues #1024
  • Deleting tf-job through the dashboard is not working #1019
  • Create common CRD validate and mutating webhook for all operator #1016
  • error with kubeflow instalation #996
  • Shall we consider upgrading k8s to 1.11.3 #985
  • TFJob Dashboard is not support pvc #980
  • ERROR handle object: patching object from cluster: merging object with existing state: unable to recognize "/var/folders/tl/zzfcr4zs53vgnpqqjq4n08sh0000gn/T/ksonnet-mergepatch020443124": no matches for kind "TFJob" in version "kubeflow.org/v1beta1" #976
  • Create CRD conversion webhook #967
  • Performance issue when there is a lot of completed jobs #965
  • Failed to marshal the object to TFJob; the spec is invalid: Failed to marshal the object to TFJob #964
  • Proposal for a Common Operator #960
  • Delete pod with unknown status in reconcilePods #956
  • Create distributed training example for TF 2.0 #953
  • Consider using KubeBuilder to reduce boilerplate code #925
  • e2e test for dashboard/backend/handler/api_handler.go #921
  • Use pod group instead of PDB for gang scheduling #916
  • shareProcessNamespace not working with TFJob #902
  • [feasibility-research] Handle machine failure #900
  • Should limit the size of logs of tf_operator container #888
  • Log message severity isn't properly reported in stackdriver #864
  • E2E test for invalid spec errors #810
  • [v1alpha2] Delete resources according to cleanuppolicy exactly once #804
  • refactor the code of TFJobController for unittest #757
  • e2e test for cleanupTFJob #756
  • [build] Replace Python with Make or Bazel #739
  • Export TF/Tensorboard/TF Summaries to prometheus #722
  • [discussion] Maintain Helm Chart #716
  • [discussion] Capacity planning #708
  • [v1alpha2] Generate CRD validation in Kubernetes 1.11 #622
  • Set labels and annotations for svc created by tf_operator #609
  • mnist test isn't part of CI #597
  • [v1alpha2] Push the example docker image to google or dockerhub registry #590
  • feat: use fake client-set and informer add controller unittest. #540
  • Run submit_release_job.sh in CI #519
  • Add environment name in ControllerConfig #450
  • [dashboard] How to handle storage? #449
  • [dashboard] GPU limits are not taken into account #448
  • [dashboard] Ability to create a TensorBoard instance #447
  • [examples] Add termination policy in examples/tf_job.yaml #438
  • add boilerplate header #430
  • [logging] Extra flag problem #427
  • [CI] Add hack/verify-codegen.sh in Travis CI #426
  • E2E workflows should ignore failures #423
  • [enhancement] Add OWNERS in subdirectories #415
  • [enhancement] Fix the warnings reported by goreportcard.com #394
  • [discussion] Separate the operator and UI dashboard #389
  • [enhancemnet] Separate release image and test image #385
  • [enhancement][CI] Replace Travis CI with Prow #382
  • use Python3 for all python code? #377
  • What to do about example TFJob YAML specs? #375
  • E2E test for non-default namespace #170
  • OpenAPI Client Generation for Java, Python #167
  • Prevent scheduling deadlocks #165
  • TfDebugger support #132
  • Refactor code in py into a proper python package #114
  • Update instructions and code to work with Kubernetes 1.8 #108
  • Build sample container as part of release process #81
  • Run lint (Python, Go) as a presubmit test #53
  • Optimize scheduling of TF Processes #35
  • E2E test that verifies invalid jobs are failed #30
  • E2E test(s) to verify that permanent and retryable errors are handled correctly. #29

Merged pull requests:

v1.0.0-rc.0 (2019-06-24)

Full Changelog

Closed issues:

  • Prometheus support in TF Job #988
  • TFJob 1.0 #968
  • Revisit Pdb calls during the reconciles while job is completed #824
  • RFC: adding more examples of TFJob for distributed learning tasks #436

Merged pull requests:

v0.5.3 (2019-06-03)

Full Changelog

Closed issues:

  • Podgroup is constantly created and deleted after tfjob is success or failure #1011
  • tfjob startTime should set immediately after create instead of wait pod of one replicaType are all running #1000
  • Create TFJob v1 documentation #990

Merged pull requests:

v0.5.2 (2019-05-23)

Full Changelog

Closed issues:

  • Failed to update TFJob status in version v1 #1003
  • tf-operator delete pod and service repeatedly #997
  • Update kustomize files for tf-operator v1 #991
  • Can not create tfjob using examples/v1beta1/dist-mnist/tf_job_mnist.yaml in self-created k8s cluster and tf-operator #975
  • Cannot running tfjob pod #944
  • [Test Flake] 503 accessing the test server exit handler #793

Merged pull requests:

v0.5.1 (2019-05-15)

Full Changelog

Closed issues:

  • tf-operator panic when cleanupTFJob #994
  • Create TFJob v1 API and controller from v1beta2 #989
  • MasterRole label initialization #987
  • Missing evaluator info from cluster section of TFCONFIG #972
  • [FeatureRequest] Support dynamic volume provisioning for TFJob and PyTorchJob #949
  • How to prevent tfjob from Running while there are still pods in Pending status? #948
  • tf operator ui could not list and create tf job #946
  • Consider restructuring tests under the shared control package #938
  • TF operator v1beta2 API #935
  • [v1beta2] Add ActiveDeadlineSeconds and BackoffLimit #550

Merged pull requests:

v0.5.0 (2019-03-26)

Full Changelog

Closed issues:

  • Support for multiple CRD versions #932
  • tf-job-operator RBAC #929
  • Use kube-batch as scheduler by default when gang-scheduling is enabled #920
  • Rename top level python package - py -> kubeflow-tf-job #914
  • [scalability testing] large number of replicas (100) #830
  • [scalability testing] large number of jobs (100?) running concurrently? #829
  • [doc] API Documentation #731

Merged pull requests:

v0.4.0 (2019-02-13)

Full Changelog

Closed issues:

  • Deprecate v1alpha2 controller and API #934
  • Failed to marshal the object to TFJob; the spec is invalid: Failed to marshal the object to TFJob #928
  • Use status subresource in TFJob CRD #927
  • Remove genclient:noStatus and call updateStatus() from controller #924
  • tfjob dashboard namespaced #923
  • TFJob with 1 replicas can't use gang-scheduling #922
  • [v1alpha2] Support for custom rpc_layer in TFConfig #906
  • is there any lighter way to deploy tf-operators? #904
  • When the distributed training job fails, the PS node and some worker node pod are deleted, and only worker 0 is retained. #903
  • [feasibility-research] TF AllReduce Strategy #901
  • Add validation for evaluator #894
  • Running TFJob on GPU only #887
  • There is a spelling mistake in developer_guide.md #882
  • PS failed but tfjob status is running #881
  • how can I get distributed tfjob log when set "cleanPodPolicy: All" #877
  • the information of "tfReplicaStatuses" is none when tfjob is in In termination state #889
  • Support custom defined cluster domain #875
  • TFJob doesn't properly handle PS error. #869
  • Code restructuring #866
  • Delete v1alpha1 controller and API #865
  • how to save the model on PVC #850
  • Support error handling for TF distributed strategies #844
  • TF operator UI not showing jobs #836
  • Why are lastTransitionTime's all the same #806
  • Kubernetes API review for TFOperator #742
  • [v1alpha2] Add e2e test cases for evaluator #651
  • E2E test to validate pod names #645
  • Distribution strategies #628
  • [discussion] specify total GPU count for distributed training #384
  • E2E tests should reuse clusters #214
  • Support Draft for packaging #136
  • Set termination timestamp #109

Merged pull requests:

v0.4.0-rc.1 (2018-11-28)

Full Changelog

Closed issues:

  • [v1alpha2] E2E test for replica restart policy #639

v0.4.0-rc.0 (2018-11-19)

Full Changelog

Closed issues:

  • create TFjob resource object successfully, but did not create pod #871
  • Create a script/tool to migrate users to v1beta1 API #858
  • Implement v1beta1 controller for TFjob #857
  • Add examples using TF distributed training #843
  • Add E2E tests for TensorFlow distribution strategies #842
  • run tfjob failded with self build image #840
  • Create v0.3-branch #838
  • Update kube-arbitrator to kube-batch #837
  • [docs] Add instructions about how to contribute e2e test cases #822
  • Build the tf-operator every night #747
  • Document how to use gang scheduling with TFJob #743
  • tf-operator should ensure that CRD exists #710
  • Improve our test harness to make it easy to write lots of E2E tests #373

Merged pull requests:

v0.3.0 (2018-09-22)

Full Changelog

Closed issues:

  • How to run in stand-alone mode #826
  • Event reporting pod exited with non-zero exit code is improperly formatted #818
  • invalid-tfjob test results don't show up in gubernator/test grid #816
  • Invalid TFJob spec can cause the TFJob operator pod to crash repeatedly #813
  • Should scheduleName be a TFJob field or is it sufficient to be a podTemplateField #801
  • reconcile should be triggered on update; even if no changes #800
  • Backwards compatibility support "Master" as chief #794
  • Add Pytorch V1alpha2 Implementation #785
  • [enhancement] Add SchedulerName in V1alpha2 #782
  • Ability to prefer using all gpus on a single node #781
  • test_runner.py is using wrong util module for JobTimeoutError #780
  • [Test Flake] Intermittent test failures: tensorflow.python.framework.errors_impl.UnavailableError: OS Error #778
  • Latest docker Image on wrong commit #775
  • PS still running after tfjob is complete #774
  • TF_CONFIG in tf-operator:v20180724-13863edf missing Environment: cloud #772
  • TF_CONFIG cluster spec has wrong FQDN name #770
  • Error syncing tfjob: Failed to found the port #768
  • Events don't show up in kubectl describe tfjobs #763
  • E2E test for TF estimator API #762
  • v1alpha2 doesn't work TF.estimator for TF <= 1.6 ; need to add environment:cloud to TF_CONFIG #761
  • Update and move README.md to website #760
  • Scope TFJob operator to only claim jobs in a given namespace #759
  • Surface invalid spec errors in a more user friendly way #755
  • TFJobs UI returns 500s and json parse errors displaying pod information or creating job #754
  • [v1alpha2] Job should be marked completed when worker 0 exits but other workers are still running #751
  • [testing] CleanPodPolicy needs E2E test #750
  • v1 and v2 E2E tests appear to be stomping on each other #748
  • [Test Flake] tf_job_client.py needs to handle case where conditions is none #744
  • tf-dashboard show workers of all the tfjobs when querying a specific tfjob #737
  • [build] Delete build/images/tf_operator/build_and_push.py #736
  • tf-operator synPdb failed when enable-gang-scheduler #729
  • not proper log message #727
  • Unable to check logs in TFJob ui for v1apha2 #723
  • Pod stuck in unknown status when kubernetes node is down #720
  • [proposal] cleanup jobs after finished #718
  • [v1alpha2] Remove redundant code about status #713
  • [v1alpha2] Invalid Job Status #712
  • Model exchange #709
  • [v1alpha2] Invalid job spec not reported in TFJob status #707
  • [v1alpha2] Invalid Job spec crashes operator #706
  • [v1alpha2] Support cluster spec via command line argument #705
  • [v1alpha2] Error when host name is not svc.cluster.local #703
  • unable to create a tfjob in the UI; namespace not set #701
  • Wrong comment when setting default CleanPodPolicy #698
  • how to upgrade smoothly from v1alpha1 to v1alpha2? #697
  • file_cache is unavailable when using oauth2client >= 4.0.0 #696
  • [v1alpha2] Validate the TFJob converted from unstructured #682
  • [v1alpha2] CreatedCondition is not set #680
  • [v1alpha2] ks apply on existing job; "unable to find api field in struct Unstructured for the json field "metadata"" #674
  • Make it easier to debug/develope E2E tests #655
  • [v1alpha2][log] Use logrus instead of glog in service_control #635
  • latest.Status.StartTime is nil:invalid memory address or nil pointer dereference #608
  • tf-operator throws runtime error: invalid memory address or nil pointer dereference #596
  • [v1alpha2] Add PDB of TFReplicaSet for gang scheduling by kube-arbitrator #575
  • Get rid of the restriction that the container should be named "tensorflow" #563
  • [proposal]TFJob condition for v1alpha2 #562
  • [feature] Add Cleanup Policy to TFJob Spec #536
  • Update releaser to use Argo. #400
  • Enable kube-arbitrator as scheduler for tensorflow #349

Merged pull requests:

v0.2.0-rc1 (2018-06-21)

Full Changelog

Closed issues:

  • [v1alpha2] Make restart policy a pointer #692
  • [v1alpha2] Need conditions Succeeded and Failed indicating when job is done #673
  • [v1alpha2] add pod label with job name (without namespace) #672
  • [v1alpha2] Pods not deleted when job finishes #671
  • [v1alpha2] conditions not updated #668
  • [v1alpha2] Move control interface to separate pakckage #665
  • [v1alpha2] Move test util to separate package #664
  • [feasibility study] Investigate strategy to stop PS after job is completed #661
  • Speedup E2E test by running build and setup cluster in parallel #659
  • In TFjob, when the workers Completed, i want the ps Completed too, how can i do? #657
  • [v1alpha2] service names are prefixed with namespace #654
  • [v1alpha2] Create a simple python server to be used for E2E tests of controller behavior #653
  • dep ensure give warning on k8s.io/apiserver #647
  • [v1alpha2] pod names don't include random salt #644
  • [v1alpha2]Unable to create pod #641
  • GPU tests failing; ks env doesn't exist #640
  • TFJob not marked as success when master exits but not workers #634
  • v1alpha2 - pod names don't include replica type #633
  • tensorflow on kubernetes how to pass in worker_host and ps_host to container if I use tf-operator #630
  • [v1alpha2] Set event for tfjob when spec is not valid #620
  • [v1alpha2] RealServiceControl does not set owner reference #616
  • tf_job_client blocks forever #606
  • [v1alpha2] Need to add the v1alpha2 binaries to our Docker image #600
  • [v1alpha2] Need ksonnet package #599
  • Support deploying v1alpha2 and v1alpha1 controllers simultaneously #598
  • [v1alpha2] Remove controller_utils.go #591
  • [v1alpha2] Add CI test #589
  • [question] dist_mnist example failed to run #588
  • [enhancement] Fix the gofmt support #586
  • can not set labels #580
  • v1alpha2 should use headless services #574
  • TFJob operator should pass through annotations to the pod #573
  • [test] Test failed because of ImagePullBackOff #567
  • [discussion] Do we need to maintain helm chart now? #564
  • TfJob operator stops working on invalid spec #561
  • Add a timeout flag in tf-operator to preserve resources after job completion for a given period #558
  • [go] Use dep instead of glide to reduce the size of vendor #556
  • [v1alpha2]tfjob restartPolicy for Never #555
  • Servable not found for request: Latest(mnist) #552
  • [v1alpha2] Enhance the logic about sync #547
  • [v1alpha2] The state of distributed model training. #544
  • [test] copy labels and anotations to pod from tfjob #543
  • [v1alpha2] Potential bugs when there is one worker succeeded #538
  • [v1alpha2] Use structured log #537
  • Unable to deploy the example TfJob in the user guide #535
  • [log] investigate zap #534
  • [v1alpha2] Try to not to always claim pods #533
  • [v1alpha2] Suppport customized port #532
  • [v1alpha2][test] Avoid potential data race problem #530
  • [v1alpha2] Do not set default to always for restartpolicy #524
  • [v1alpha2] start using kubeconfig #522
  • v1alpha2 integration #521
  • E2E test steps should exit with non zero exit code if test fails #514
  • TFJob operator surface queue metrics #503
  • [v1alpha2] Sync commits with v1alpha1 #490
  • [api] Remove pending pods from active pods #484
  • [enhancement] Set StartTime for TFJob status #475
  • [Feature] Support "eval" worker in tf-operator #444
  • Use OpenAPI validation for CRDs in k8s 1.9 #437
  • default install of kubeflow no longer install tf-job-dashboard #435
  • Add appropriate logging fields to the tf-operator log messages #424
  • Use DAG functionality of Argo in our E2E tests #422
  • [enhancement] Refactor docs #379
  • Post submits are failing with Argo #370
  • tf-job-operator pod hangs and doesn't restart if it can't delete one of the TfJob pods #366
  • Refactor TFJobStatus in CRD API #333
  • Deprecate the TfImage field #330
  • Deprecate TfPort and set default port for users #327
  • [enhancement] Add e2e test cases for recorder #317
  • Make the TfJob controller more event driven #314
  • Potential data race, maybe #302
  • [discussion] Differences between tensorflow/k8s and caicloud/kubeflow-controller #283
  • Does TfJob controller need to do master election? #263
  • Setup Prow PR Dashboard #255
  • API: some comments about API changes from PR #215 review #249
  • e2e test for the case that the chief is not master #235
  • Use conditions instead of phase #223
  • Submitted tfjobs cease to start running under unknown conditions #203
  • Tutorials #195
  • Don't leave pods running just to get logs #128
  • Add hyperparameter tuning? #112
  • Phase is wrong unexpected TfJob phase: Done #110
  • Copy chart to kubernetes/charts #93
  • Create a web page to list releases #70
  • tensorflow 1.4 and estimator support #61
  • Set a default value for restartPolicy #55
  • Use headless services for Training jobs #40
  • More validation of TfJob #25

Merged pull requests:

v0.1.0 (2018-03-29)

Full Changelog

Closed issues:

  • [v1alpha2] Implement condition update #502
  • E2E tests timing out; job appears to remain in running state even though job is done. #500
  • [v1alpha2] TF_CONFIG should be configurable by user #499
  • [test] All log is 404 in argo #496
  • Presubmit shows succeeded, but some test actually failed. #479
  • Waiting pods start too long #461
  • [test] Add unit test for pkg/controller #455
  • Create a suitable OWNERS file in /dashboard #443
  • Tide is misconfigured for this repository. #433
  • CI failed to setup the cluster #420
  • [docs] Add dashboard readme #411
  • Make coverall results advisory and not report as failure #406
  • Presubmits failing due to lint #404
  • [enhancement] Fix go vet errors which not caught by the compilers #395
  • User facing website for Kubeflow that details how to choose a stack #371
  • [discussion] How to set clusterspec #369
  • [enhancement] Rename the cmd/tf_operator to cmd/tf-operator #363
  • Local releaser fails due to version_tag #360
  • Helm test failure not reported to gubernator #355
  • [discussion] Whether to create CRD in helm charts #353
  • Should resourcelock be in the same namespace as controller? #352
  • Helm test tf-job does not pass validation #351
  • Move tensorflow/k8s to kubeflow/tf-operator #350
  • Get rid of TensorBoard replica #347
  • Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs #346
  • Deprecate the ENV MY_POD_NAMESPACE and MY_POD_NAME #341
  • [feature] Does tfJob support setting different label/envVar for each worker(replicas >1)? #340
  • [Discussion] Time to start tagging releases for the TF operator? #339
  • [discussion] Should group name be tensorflow.org or kubeflow.io or kubeflow.org? #337
  • dashboard silient error during calling non-existent tfjob #335
  • in dashboard, silent error when nonexistent namespace is specified #334
  • Deprecate the IsDefaultPS field #329
  • [Convention] Replace Tf with TF in CRD #328
  • Standardise labels for issues and PRs #326
  • Manage Pods directly instead of using Job controllers #325
  • TfJobs dashboard not showing jobs #324
  • TfJobs dashboard doesn't work with K8s API server proxy or envoy proxy #323
  • Recreating a failed/successful job with same name doesn't work #322
  • Releaser incorrectly tags images as "dirty" #321
  • Reenable the releaser #320
  • E2E tests are not isolated #318
  • Need to mark prow job as failed if any tests fail #315
  • Remove outdated branch wbuchwalter-patch-1 #311
  • E2E test delete and recreate job with same name #310
  • TrainingJob.reconcile not called periodically #309
  • rename master to chief #306
  • Assign resource quota for TensorBoard #304
  • Jobs evicted for lack of memory, potentially add resource field to tf-job prototype #301
  • [Discussion] Operators vs. controller pattern #300
  • [bug] Add a default pod template for PS #297
  • Bunch of pylint error messages #294
  • Fix Head #293
  • Operator deployment fails post-v20180108-190394d #292
  • Promote last known good release #290
  • [bug] metadata.ownerReferences.apiVersion is not set #288
  • fail to run example job. invalid job spec: tfReplicaSpec.TfPort can''t be nil #284
  • [bug] Build log 404 in https://prow.k8s.io/?repo=tensorflow%2Fk8s #282
  • [feature] Seperate the CRD and controller #281
  • Gaps in test coverage #280
  • Regression in flag name: controller-config-file #279
  • [bug] glog before flag.Parse() #275
  • build new code to new image and find some problem #274
  • Fix the releaser so we can build new images #270
  • deploy.py gives gcloud api error '... Version "1.8.1-gke.1" is invalid.' #268
  • Pods terminated without waiting #267
  • Attach appropriate header (copyright) to go files #266
  • suppose i've install the tfjob in my k8s cluster #265
  • what's the folder pkg for? #264
  • Build failing because of lint issues #256
  • what's the main change between version 0.2 and version 0.3? #247
  • SetupCluster failures unexpected keyword argument 'client_configuration' #242
  • GPU test marked as succeeded but airflow step is failing #240
  • Use Kubeflow & ksonnet to install TfJob #239
  • tf_smoke.py distributed computing doesn't work on minikube #238
  • example-job can not work in private k8s cluster #233
  • Test failures aren't properly reported in Gubernator #229
  • [CRD] Request for input and output dirs in TFJobSpec #224
  • TfJob should be marked as failed if setup fails #218
  • panic: runtime error: invalid memory address or nil pointer dereference can not run in k8s 1.8.5 #212
  • Rethink the TFJob CRD #209
  • ksonnet configs for deploying the TfJob CRD & Controller #208
  • Make default TfImage configurable by users #207
  • refactor the TfJob to use Informer and Controller #206
  • Use Argo workflow engine for CI/CD or releases #205
  • Potential issue with Tensorboard / value of simple best-practices example with tboard #202
  • Investigate using buildah to build our images #201
  • E2E tests pre & postsubmits are failing #196
  • Publishing a client to pypi #193
  • Don't require a master or chief #192
  • Make cloning the repo and building the artifacts separate commands in py/release.py #189
  • Handle the case where grpcServerFilePath is the empty string #188
  • Make Airflow logs accessible #185
  • Complement docs for Python 3rd party dependencies #181
  • Helm Test fails because grpcServerFilePath is the empty string #179
  • Helm should only set --controller_config_file conditionally #175
  • Troubleshooting Guide: no matches for tensorflow.org/, Kind=TfJob #174
  • no matches for tensorflow.org/, Kind=TfJob #173
  • Failed to build TFOperator #171
  • E2E test for GPUs #164
  • TfJob doesn't work on minikube #160
  • Deleted jobs re-starting #156
  • Use coveralls.io to report and check code coverage #155
  • Clarify scope of tensorflow/k8s #150
  • After init helm, install chart failed #149
  • Helm test; insufficient permissions on RBAC clusters #135
  • Need to trim trailing slash of host string in TfJobRestClient.Watch() #130
  • results of lint test aren't reported in junit file used by gubernator #126
  • Collaborators need to be K8s members to trigger tests #122
  • Extend Test Infrastructure to run multiple E2E tests in parallel #120
  • initResource() failed; findAllTfJobs returned error: #118
  • Latest tag on gcr.io is not up to date #116
  • duplicate #115
  • postsubmit results aren't showing up in testrgrid #113
  • TensorBoard replica set not deleted when job deleted. #107
  • helm permission issue on 1.8.1 #106
  • Run python unittests as part of pre/post/periodic tests #101
  • E2E tests are failing #96
  • E2E Test log should capture output from helm-test #95
  • Rename TfJob kind to remove mlkube.io #89
  • Setup travis for tensorflow/k8s #88
  • Update repo to use its new location tensorflow/k8s #86
  • mlkube.io -> tensorflow/k8s #85
  • Update prow to use repo tensorflow/k8s #84
  • periodic test is failing #83
  • runner.py needs to create build-log.txt with stdout/stderr of test #82
  • E2E tests leaking GKE clusters #80
  • No results show up if you click on mlkube-build-periodic #76
  • No results show up in prow test grid for presubmit jobs #75
  • Include TfJob name in labels #72
  • Simplify/Clarify Accelerators config #71
  • Clean up examples; don't require cloning the repo #68
  • How to create TF Jobs from the user side? #67
  • Change version from beta -> alpha #65
  • API Review #64
  • Setup release process for CRD #63
  • Post submit jobs don't correctly upload artifacts to GCS #62
  • presubmit test(bootstrap.py) doesn't properly check out PRs #59
  • E2E Test for default PS server #58
  • UI / Kubernetes Dashboard Integration #57
  • E2E test for GPUs #54
  • Integrate with Prow for Continuous Testing #46
  • Consider how we manage replicas (stateful sets, managing pods directly) #45
  • Use K8s Garbage Collection #42
  • func c.findAllTfJobs() in controller.go will never reach #41
  • Rename project #34
  • Structured (Json) logging for Tf Processes #32
  • Permanent errors don't cause job failure #28
  • If handling Add event fails, TfJob should be marked as failed with appropriate error #26
  • Structured Logging For the operator #24
  • Operator Log Spam; replicas.go:287] No container named: tensorflow found for pod; assuming POD is running #23
  • Provide a default value for TfPort, replicas, and tfReplicaType #22
  • Setup continuous build of containers #19
  • Should this be converted to a Custom Resource Definition (CRD) in anticipation of 1.7 #17
  • Run TensorFlow server for parameter servers by default #16
  • TensorBoard Integration #13
  • Dependency management #7
  • Better GPU support #6
  • TfJobRestClient.Create doesn't set kind appropriately #5
  • Add a creationTimestamp #4

Merged pull requests:

  • Fix outdated information about GPUs in README #513 (mindprince)
  • Don't leave pods running when a job completes. #512 (jlewi)
  • Fix bug with jobs not being marked as completed. #501 (jlewi)
  • release: Fix style #498 (gaocegege)
  • pkg: Fix the code changed in #486 #497 (gaocegege)
  • fixed some golint warning #486 (AK-ayush)
  • Support testing on minikube. #485 (jlewi)
  • add LabelsByIndex method to eliminate code duplication #474 (rc-zhang)
  • Use headless services for Training jobs #471 (rc-zhang)
  • Fix field selectors in controller #465 (wbuchwalter)
  • Run ks upgrade #464 (lluunn)
  • Fix owners file id #462 (lluunn)
  • Remove deprecated package retryutil #460 (ScorpioCPH)
  • Change test cluster to kubeflow-ci #459 (lluunn)
  • *: Remove APIExtension clientset #454 (gaocegege)
  • travis: Ignore generated code #453 (gaocegege)
  • Create PDB of TFReplicaSet for gang scheduling by kube-arbitrator #452 (mitake)
  • Add OWNERS file for dashboard #446 (wbuchwalter)
  • Make local release cross-platform + fix #445 (wbuchwalter)
  • Add proxying to front-end development server. #442 (wbuchwalter)
  • Fix dashboard + proxy incompatibility #441 (wbuchwalter)
  • change kubeflow.io to kubeflow.org #440 (Jimexist)
  • Remove unreachable code #434 (ScorpioCPH)
  • *: Remove type ContainerName #432 (gaocegege)
  • add boilerplate header for go file #431 (wackxu)
  • format the python files with yapf #429 (mitake)
  • clientset: Fix code which is changed manually #428 (gaocegege)
  • Delete Dockerfile to build a docker image to use for prow. #425 (jlewi)
  • Fix setup_cluster. #421 (jlewi)
  • Add ScorpioCPH as approver/reviewer #419 (ScorpioCPH)
  • Create resources (Services/Jobs) only once #418 (ScorpioCPH)
  • Dashboard: Dev Guide #417 (wbuchwalter)
  • Use logrus for structured logging #416 (ankushagarwal)
  • Create an initial OWNERS file. #414 (jlewi)
  • Docs should refer to Kubeflow user guide for deploying the TFJob operrator #412 (jlewi)
  • Run glide update to update glide.lock #410 (ankushagarwal)
  • Fix typo in Makefile #409 (ankushagarwal)
  • Add a field SchedulerName to TFJob for specifying a scheduler #408 (mitake)
  • Fix lint issues with python3 and a bug in lint script #405 (jlewi)
  • Support using our E2E workflow to build a Docker image for releases. #403 (jlewi)
  • add go 1.10 support in travis #402 (Jimexist)
  • use yapf to format python code #401 (Jimexist)
  • Fix bug with jobs not working if you recreate a job with same name as previous job #399 (jlewi)
  • Fixes go vet errors #397 (swiftdiaries)
  • Fixed-363: Rename cmd/tf_operator -> cmd/tf-operator #393 (AK-ayush)
  • README: Add community section and quick links #392 (gaocegege)
  • Remove TensorBoard related code in operator #391 (gaocegege)
  • Fix something after move to kubeflow/tf-operator #390 (sdf611097)
  • Add a prow_config.yaml file to configure our prow jobs. #388 (jlewi)
  • fix a typo in the README file. #387 (ChanYiLin)
  • *: Replace the repo name #386 (gaocegege)
  • travis: Add go build command #383 (gaocegege)
  • config.sh: Remove #381 (gaocegege)
  • Use ksonnet to easily define TFJobs to be run as tests #374 (jlewi)
  • Fix repo name env #372 (jose5918)
  • controller.go: Fix a glog typo #368 (gaocegege)
  • fix -version option: print version #367 (caogj)
  • *: Add copyright owner in go files #364 (gaocegege)
  • Fix local releaser #361 (jose5918)
  • nit: try to simplify e2e main.go #359 (Jimexist)
  • Use Argo rather than Airflow to run our E2E tests #358 (jlewi)
  • Add an option to release.py to specify the tag for the image to use. #357 (jlewi)
  • Fix helm test #356 (jose5918)
  • feat(group): Update CRD group to kubeflow.org #354 (gaocegege)
  • Deprecate the ENV MY_POD_NAME and use default namespace #348 (ScorpioCPH)
  • feat(crd): Separate CRD and controller #345 (gaocegege)
  • Create Pod instead of Job #344 (ScorpioCPH)
  • Deprecate IsDefaultPS in TFJob CRD API #343 (ScorpioCPH)
  • Update documentation #342 (jose5918)
  • feat(dashboard): Namespace handling #338 (wbuchwalter)
  • feat(dashboard): better error handling in dashboard code #336 (Jimexist)
  • Rename Tf to TF #332 (ScorpioCPH)
  • Delete binary file #331 (ScorpioCPH)
  • Take test failures into account when setting prow job status #319 (jlewi)
  • remove unused file rename.sh #316 (caogj)
  • add UpdateFunc to handle update events #313 (mqliang)
  • pkg: Add recorder support #312 (gaocegege)
  • Fix a bunch of problems in TfJob CRD that crept in while tests were broken #308 (jlewi)
  • replace TPR with CRD #307 (mqliang)
  • fix broken link #305 (caogj)
  • Fix python lint checks #303 (jlewi)
  • Fix setting defaults. #299 (jlewi)
  • Add service account name to dashboard if RBAC. #298 (ConnorDoyle)
  • The flag should be --controller-config-file. #295 (jlewi)
  • Fix the junit XML file format. #291 (jlewi)
  • *: Fix API Version #289 (gaocegege)
  • *: Implement the List interface for TfJobList #278 (gaocegege)
  • cmd: Fix the flag error caused by pflag #277 (gaocegege)
  • types.go: Fix CRDKind #276 (gaocegege)
  • Move around due to new directories layout #273 (ScorpioCPH)
  • bugfix: set faliures=true if failed deleting configmap #272 (mqliang)
  • Fix our continuous release process #271 (jlewi)
  • update initialClusterVersion to 1.7.11-gke.1 #269 (cwbeitel)
  • Misc Cleanup. #262 (jlewi)
  • Add proposed directories layout #261 (ScorpioCPH)
  • record event when tf_operator failover #260 (zjj2wry)
  • follow kubernetes flag convension #259 (zjj2wry)
  • refactor dashboard backend, use versioned tfjob clientset #258 (zjj2wry)
  • apply goimports -w to generated files #257 (Jimexist)
  • add gometaliner into travis build #254 (Jimexist)
  • fix(no-dup): reduce dup code in printVersion #253 (Jimexist)
  • Improve utilities for E2E tests. #251 (jlewi)
  • Fix leaking of clusters in E2E tests #80 #250 (jlewi)
  • feat(pipenv): Use pipenv to lock down python dependencies #248 (Jimexist)
  • fix(lint): add prop types and fix all eslint errors #246 (Jimexist)
  • refactor code and format imported package #245 (zjj2wry)
  • feat(lint): apply prettier to format frontend src/ code #244 (Jimexist)
  • feature(lint): use prettier and lint-staged for frontend javascript code #243 (Jimexist)
  • Fix issues with tf_job_gpu test #241 (jlewi)
  • Use the release/test python scripts pulled from the repo. #237 (jlewi)
  • Don't run glide install in travis builds. #236 (jlewi)
  • refactor the controller logic #234 (wackxu)
  • feat(coverage): add covealls support #232 (Jimexist)
  • use glide install --strip-vendor remove subpackage vendor #231 (zjj2wry)
  • update k8s dependency to stable version #230 (wackxu)
  • let tfJob image configurable #228 (zjj2wry)
  • remove todo, add gitSHA into version information #227 (zjj2wry)
  • controller.go: Fix a print error #226 (gaocegege)
  • replace tf-job-operator-config configmap when it already exist #225 (zjj2wry)
  • Add the vendor directory to the repository. #222 (zjj2wry)
  • allow using WORKER:0 as chief #221 (lluunn)
  • Fix issue with handling of json errors. #220 (jlewi)
  • Set state to failed if there is a problem initializing job #219 (jlewi)
  • On GKE mounting volumes should no longer be required for GPUs. #217 (jlewi)
  • update developer guide #216 (ddysher)
  • Refactor the TfJob to use K8s libraries #215 (wackxu)
  • Add a basic GPU job test as part of our E2E tests. #213 (jlewi)
  • minor spelling porxy => proxy #211 (cbockman)
  • Add terminationPolicy to TfJobSpec #204 (lluunn)
  • Split cloning the repo and building the images into two steps in our airflow pipeline #200 (jlewi)
  • Create separate commands to clone and build the repo #199 (jlewi)
  • Install yarn and nodejs inside the Airflow container. #198 (jlewi)
  • Update the Airflow deployment to use Docker images built from a clean tree #197 (jlewi)
  • Fix some cuda issues on Azure #194 (wbuchwalter)
  • Fixing front page documentation to have grpcServerFilePath #190 (hyperbolic2346)
  • Add an option to build Docker images with GCB. #187 (jlewi)
  • replace deprecated tf.initialize_all_variables #184 (DjangoPeng)
  • build_and_push.py: Support python3 #183 (gaocegege)
  • tf_job_design_doc: Fix the apiVersion #182 (gaocegege)
  • py: Add requirements.txt #180 (gaocegege)
  • resolve a merge conflict imported by commit ae8c31 #178 (DjangoPeng)
  • tf_job_design_doc.md: Fix a typo #177 (gaocegege)
  • Fix helm templates so that we don't require a configmap. #176 (jlewi)
  • replace Google and Golang repos with corresponding github repos #172 (DjangoPeng)
  • Stop hardcoding namespace for TfJob config map #169 (haitch)
  • Tooling to make it easier to run a bunch of TfJob tests. #168 (jlewi)
  • Run python lint and unittests as part of our E2E test pipeline #166 (jlewi)
  • A binary to run pylint and python unittests #163 (jlewi)
  • fix dev guide #162 (lluunn)
  • Integrate Airflow with Prow #158 (jlewi)
  • rename jlewi/mlkube.io in glide.yaml #153 (moon03432)
  • add Create(), Delete() in TfJobClient interface #152 (moon03432)
  • change jobname from task-runtimeid-index to jobname-task-runtimeid-index #151 (moon03432)
  • Create binaries to run steps in an E2E test pipeline. #148 (jlewi)
  • Fix a typo in the command line help. #147 (jlewi)
  • ignore too-many-locals. #146 (jlewi)
  • On RBAC clusters, test needs a service account with appropriate permissions #145 (jlewi)
  • Airflow pipeline to run our tests #144 (jlewi)
  • fix(*): amend the number of worker and ps in example yaml spec for a distributed job #142 (lienhua34)
  • fix a log issue #141 (moon03432)
  • rename clus to tfjob in controller.go #138 (moon03432)
  • rename InClusterConfig() to GetClusterConfig() #137 (moon03432)
  • Remove trailing slash of host #134 (ScorpioCPH)
  • Turn release.py into a binary to build the artifacts for all the different contexts #133 (jlewi)
  • Minor fix typo and redundancy #131 (ScorpioCPH)
  • Update developer_guide.md #129 (Jimexist)
  • Use K8s Garbage Collection #127 (jlewi)
  • Dashboard V1 #125 (wbuchwalter)
  • More verbose logging of resource deletion #124 (jlewi)
  • Fix rbac settings in chart. #123 (jlewi)
  • Fix issue in tpr_util.Delete() #121 (wbuchwalter)
  • Tag docker images with "latest". #119 (jlewi)
  • Update API group in the chart #117 (sozercan)
  • Helm instructions #111 (jlewi)
  • Name label #105 (jlewi)
  • Update helm install syntax in readme #104 (sozercan)
  • Change group to tensorflow.org and version to v1alpha1. #103 (jlewi)
  • [WIP] Notebook demonstrating use of TfJob on GKE #102 (jlewi)
  • Fix bugs in the release script. #100 (jlewi)
  • Fix bugs in the release script. #99 (jlewi)
  • Update release.py so we can run it continuously. #98 (jlewi)
  • Fix the E2E test by specifying cloud when deploying the helm package. #97 (jlewi)
  • Need to set environment to enable Estimators with TF <=1.3 #94 (jlewi)
  • Update README.md #92 (Jimexist)
  • Add python lint check to travis and fix python lint issues #91 (jlewi)
  • #71 Simplify accelerators config #90 (wbuchwalter)
  • Update test infrastructure to use repo tensorflow/k8s #87 (jlewi)
  • Create symbolic links in GCS to output of presubmit results. #79 (jlewi)
  • Fix periodic results (#76) #78 (jlewi)
  • Another attempt to fix periodic jobs. #77 (jlewi)
  • Fix location of the post submit results. #74 (jlewi)
  • Overhaul the documentation #73 (jlewi)
  • Release scripts #69 (jlewi)
  • Record latest green from postsubmit #66 (jlewi)
  • Fix presubmit jobs and periodic jobs #60 (jlewi)
  • Fix periodic test #56 (jlewi)
  • Updated chart with batch.jobs and extensions.deployments cluster roles #52 (sozercan)
  • Added RBAC support for tf-operator chart #51 (sozercan)
  • PR to test Prow presubmit integration. #50 (jlewi)
  • E2E test for the CRD #49 (jlewi)
  • Create configs for setting up Prow for continuous testing. #47 (jlewi)
  • Fix bug that prevents permanent errors from causing job failure. #44 (jlewi)
  • Always check for existing TfJobs and instantiate controllers for them. #43 (jlewi)
  • support multi namespaces #39 (loadwiki)
  • Use Jinja templates and a Python script to build example Docker images for examples [\#37](https://github.com/kubeflow/tf-operator/pull/37) ([jlewi](https://github.com/jlewi))
    
  • Parameter Server: Run TF server by default #36 (wbuchwalter)
  • Set default values for Replicas, TfPort, TfReplicaType. #31 (jlewi)
  • Fix a couple bugs. #27 (jlewi)
  • [WIP] Update to CustomResourceDefinition instead of ThirdPartyResource. #20 (jlewi)
  • Update glide config. #18 (jlewi)
  • Add TensorBoard Integration #15 (wbuchwalter)
  • Changes to support CI using Travis. #14 (jlewi)
  • Add Environment Variables in Controller Config #12 (wbuchwalter)
  • Fix tests #11 (wbuchwalter)
  • Helm charts renaming #10 (wbuchwalter)
  • Simplify GPU configuration process. #9 (jlewi)
  • Fix build, add Glide for dependency management. #8 (wbuchwalter)
  • Update links in README.md #3 (wbuchwalter)
  • A more thorough E2E test. #2 (jlewi)

* This Changelog was automatically generated by github_changelog_generator