-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add computation state cache, config loaders, and status pages #31133
Conversation
28a88b4
to
b497948
Compare
Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some initial comments, didn't get through everything
sampler, | ||
metricTrackingWindmillServer::refreshActiveWork, | ||
executorSupplier.apply("RefreshWork")); | ||
|
||
WorkerStatusPages workerStatusPages = | ||
WorkerStatusPages.create(DEFAULT_STATUS_PORT, memoryMonitor, () -> true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this last parameter be made optional via overload? or at least add a comment here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done. overloaded
WorkerStatusPages.create(DEFAULT_STATUS_PORT, memoryMonitor, () -> true); | ||
this.statusPages = | ||
windmillServiceEnabled | ||
? StreamingWorkerStatusPages.forStreamingEngine( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would a builder pattern be better than the separate forStreamingEngine and forAppliance factory methods?
There seem to be a lot of common things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
WindmillServerStub windmillServer = createWindmillServerStub(options, windmillStreamFactory); | ||
ComputationConfig.Fetcher configFetcher = | ||
options.isEnableStreamingEngine() | ||
? StreamingEngineConfigFetcher.forTesting( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
forTesting isn't a good name if this is in the regular path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed this was supposed to just use the create method
ComputationConfig.Fetcher configFetcher = | ||
options.isEnableStreamingEngine() | ||
? StreamingEngineConfigFetcher.forTesting( | ||
true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rm param if hard-coded
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is used in a different test
.setWindmillServiceEndpoints(ImmutableSet.of()); | ||
} | ||
|
||
public static StreamingPipelineConfig forAppliance( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just get rid of this and have the appliance config use builder() and set these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
maxWorkItemCommitBytes)) | ||
: new StreamingApplianceConfigFetcher( | ||
windmillServer, | ||
config -> consumeUserStepToStateFamilyName(config, stateNameMap), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like the consuming the config method could be the same for both SE and appliance, where unset fields are just ignored
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
ComputationStateCache computationStateCache = | ||
ComputationStateCache.create( | ||
configFetcher, workExecutor, windmillStateCache::forComputation); | ||
if (windmillServer instanceof GrpcWindmillServer) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this cast and lazy initialization is kind of gross
one idea could be to remove the windmillserver from the appliance config fetcher, it could just construct it's own channel and sync stub directly.
and then have a separate start method on the configfetcher taking the function to consume responses. Then you can make sure to call that after everything is initialized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm agreed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the issue would be if it is using the WindmillServerBase
implementation of WindmillServerStub, which uses JNI
i will try to find a way around this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i actually realized that appliance does not refresh heartbeats which is what that consumer of heartbeat responses is for.
we only call refresh active work on get data stream if streaming engine is enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
i changed the way this was created, still not super clean but much better than before.
I wonder if we can make GrpcWindmillServer
not implement WindmillServerStub
. Not sure it currently makes sense to group together Appliance and Engine client implementations and leads to wonky situtations like above.
failing test is unrelated and passes on local runs |
.collect(Collectors.toList()), | ||
workExecutor, | ||
stateCache::forComputation); | ||
computationStateCacheRef.set(computationStateCache); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about a VisibleForTesting annotated method to access the ComputationStateCache instead? Seems convoluted to pass in atomic ref to assign to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
60, | ||
TimeUnit.SECONDS); | ||
scheduledExecutors.add(statusPageTimer); | ||
} | ||
workCommitter.start(); | ||
workerStatusReporter.start(); | ||
activeWorkRefresher.start(); | ||
} | ||
|
||
public void startStatusPages() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be inlined into start() and removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
today we don't start them for tests (same across batch and streaming harnesses)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment that this may be omitted for lighterweight testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
Collection<LatencyAttribution> getWorkStreamLatencies) -> | ||
computationStateCache | ||
.get(computation) | ||
.ifPresent( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happened before on missing computation? In either case it seems like we should throw exception or log error as otherwise we're just dropping the work item silently which will be confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added more logging in the ComputationStateCache
public Optional<ComputationState> get(String computationId) {
try {
return Optional.ofNullable(computationCache.get(computationId));
} catch (ExecutionException | ComputationStateNotFoundException e) {
if (e.getCause() instanceof ComputationStateNotFoundException) {
LOG.error(
"Trying to fetch unknown computation={}, known computations are {}.",
computationId,
getAllComputationIds());
} else {
LOG.warn("Error occurred fetching computation for computationId={}", computationId, e);
}
}
/** Fetches computation config from Streaming Appliance. */ | ||
@Internal | ||
@ThreadSafe | ||
public final class StreamingApplianceConfigFetcher implements ComputationConfig.Fetcher { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be StreamingApplianceComputationConfigFetcher? ditto with SE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Consumer<StreamingPipelineConfig> onPipelineConfig, | ||
Function<MapTask, MapTask> fixMapTaskMultiOutputInfoFn) { | ||
this.windmillServer = windmillServer; | ||
this.onPipelineConfig = onPipelineConfig; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems odd to have this listener hidden in the fetcher
Maybe instead the listener should be on whatever is driving the fetching. I think that woudl be the cache.
Additionally I think that the FIX_MULTI_OUTPUT_INFOS_ON_PAR_DO_INSTRUCTIONS could move into the cache where fetches are being performed. As is it is injected into all teh fetchers and applied to the testing generated configs. That would let you just do it once and possibly coudl remove from StreamingDataflowWorker to the cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done and moved
changed it to be a member instead of static since we need the static global id generator from StreamingDataflowWorker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
3581abd
to
f30767c
Compare
f30767c
to
f7f2481
Compare
60, | ||
TimeUnit.SECONDS); | ||
scheduledExecutors.add(statusPageTimer); | ||
} | ||
workCommitter.start(); | ||
workerStatusReporter.start(); | ||
activeWorkRefresher.start(); | ||
} | ||
|
||
public void startStatusPages() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment that this may be omitted for lighterweight testing?
!computationState.getTransformUserNameToStateFamily().isEmpty() | ||
? computationState.getTransformUserNameToStateFamily() | ||
: stateNameMap, | ||
// !computationState.getTransformUserNameToStateFamily().isEmpty() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rm comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
ConcurrentMap<String, String> stateNameMap) { | ||
Function<MapTask, MapTask> fixMultiOutputInfosOnParDoInstructions = | ||
new FixMultiOutputInfosOnParDoInstructions(idGenerator); | ||
return new ComputationStateCache( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this instead just use create and then poke in to add initial state name map values?
via
cache = create(...);
cache.stateNameMap.addAll(stateNameMap);
It's a fair bit of setup to duplicate and it coudl drift meaning we're not testing the main code-path.
Or you could further reduce duplication of the parameters the create/forTesting methods and just expose a visiblefortesting method for the statenamemap and the current caller of this method could use that directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
} | ||
|
||
/** Returns a read-only view of all computations. */ | ||
public ImmutableList<ComputationState> getAllComputations() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be getAllPresentComputations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
"computationId is empty. Cannot fetch computation config without a computationId."); | ||
|
||
GetConfigResponse response = | ||
windmillServer.getConfig( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of taking in WindmillServer can you take in a functional interface matching this method?
That will help show that the full class isn't used and could simplify testing. It could help in the future break up windmillServer if we get rid of the jni class keeping it all tied together ATM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
default void stop() {} | ||
|
||
Optional<ComputationConfig> getConfig(String computationId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's call this fetchConfig, it makes it clearer that it is likely an rpc (and matches class name)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
private Optional<StreamingEnginePipelineConfig> getComputationConfigInternal( | ||
String computationId) { | ||
Optional<StreamingEnginePipelineConfig> streamingConfig = getConfigInternal(computationId); | ||
streamingConfig.ifPresent(onStreamingConfig); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we remove this listening from the computation config fetching? It seems like it should just be the global config from the background thread that runs onStreamingConfig.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
LOG.info("Initial global configuration received, harness is now ready"); | ||
} | ||
|
||
private Optional<StreamingEnginePipelineConfig> getComputationConfigInternal( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should just return the computationconfig not full StreamingEnginePipelineConfig. I think that StreamingEnginePipelineConfig can have the computation config removed from it, as we don't care about listening to computations as they are fetched.
That will then make it clearer that we only want the listener to trigger on the periodic background fetching, not the computation fetching driven by the cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
public abstract Map<String, String> userStepToStateFamilyNameMap(); | ||
|
||
public abstract Optional<StreamingComputationConfig> computationConfig(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see other comment, I think this class should just be for the global config, not related to a computation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
return getConfigInternal(null); | ||
} | ||
|
||
private Optional<StreamingEnginePipelineConfig> getConfigInternal(@Nullable String computation) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably need to change this to some template if we're returnign different types for global or per-computation config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Assigning reviewers. If you would like to opt out of this review, comment R: @riteshghorse added as fallback since no labels match configuration Available commands:
The PR bot will only process comments in the main thread (not review comments). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking pretty good, mostly just some test comments
@@ -54,6 +56,7 @@ public final class ComputationStateCache implements StatusDataProvider { | |||
|
|||
private static final Logger LOG = LoggerFactory.getLogger(ComputationStateCache.class); | |||
|
|||
private final ConcurrentMap<String, String> globalUsernameToStateFamilyNameMap; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: global has connotations and makes it sound like static global, how about pipelineUserNameToStateFamilyNameMap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
} | ||
} | ||
|
||
private static Optional<StreamingEnginePipelineConfig> createPipelineConfig( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't think this needs to return optional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
localRetryTimeoutMs); | ||
} | ||
|
||
private StreamingDataflowWorker makeWorker( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are 6 makeWorkers, it seems some autovalue builder in this test would make it more readable and less duplication of the default for various params.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done used a builder with some test parameters
private ComputationStateCache computationStateCache; | ||
|
||
private static Work createWork(long workToken, long cacheToken) { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: rm blank line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
...c/test/java/org/apache/beam/runners/dataflow/worker/streaming/ComputationStateCacheTest.java
Outdated
Show resolved
Hide resolved
...am/runners/dataflow/worker/streaming/config/StreamingEngineComputationConfigFetcherTest.java
Outdated
Show resolved
Hide resolved
asyncStartConfigLoader.start(); | ||
numExpectedRefreshes.await(); | ||
asyncStartConfigLoader.join(); | ||
assertThat(receivedPipelineConfig) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems safer to stop the fetcher's background thread before accessing received config (or use a concurrent set)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
when(mockDataflowServiceClient.getGlobalStreamingConfigWorkItem()) | ||
.thenReturn(Optional.of(firstConfig)) | ||
.thenReturn(Optional.of(secondConfig)) | ||
.thenReturn(Optional.of(thirdConfig)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will there be a flaky error if a 4th fetch comes in before the background thread stops? shoudl some sort of default response be setup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
...am/runners/dataflow/worker/streaming/config/StreamingEngineComputationConfigFetcherTest.java
Show resolved
Hide resolved
...am/runners/dataflow/worker/streaming/config/StreamingEngineComputationConfigFetcherTest.java
Show resolved
Hide resolved
b670812
to
099081c
Compare
…tructor StreamingDataflowWorker instance
099081c
to
f32844b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just final couple nits. I'm also running some dataflow specific tests to verify
defaultWorkerParams() | ||
.setInstructions(instructions) | ||
.publishCounters() | ||
.setOptions(createTestingPipelineOptions()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can setOptions(createTestingPipelineOptions()) since it's default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
@@ -3203,12 +3145,21 @@ public void testExceptionInvalidatesCache() throws Exception { | |||
WindowingStrategy.globalDefault()), | |||
makeSinkInstruction(StringUtf8Coder.of(), 1, GlobalWindow.Coder.INSTANCE)); | |||
|
|||
defaultWorkerParams() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rm, not used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
StreamingDataflowWorker worker = makeWorker(instructions, options, true /* publishCounters */); | ||
StreamingDataflowWorker worker = | ||
makeWorker( | ||
defaultWorkerParams() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does passing in "--activeWorkRefreshPeriodMillis=100" let you get rid of options?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
StreamingDataflowWorker worker = makeWorker(instructions, options, true /* publishCounters */); | ||
StreamingDataflowWorker worker = | ||
makeWorker( | ||
defaultWorkerParams() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, here and below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
computationState.activateWork(shardedKey, work1); | ||
computationState.activateWork(shardedKey, work2); | ||
computationState.activateWork(shardedKey, work3); | ||
|
||
// Activate 3 Work(s) for computationId2 | ||
Optional<ComputationState> maybeComputationState2 = computationStateCache.get(computationId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: how about just activating work1 on computationState and work2, work3 on comptuationState2. Then we're not deviating from normal usage of work (shouldn't be activated on multiple/different comptuatoins) in case it causes issues down the line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Add computation state cache and config loaders that load the computation state cache
move status pages out of StreamingDataflowWorker file
R: @scwhittle
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123
), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>
instead.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.