Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added PhiAccrualFailureDetector warning logging for slow heartbeats #4897

Conversation

Aaronontheweb
Copy link
Member

Ported from akka/akka#17389 and akka/akka#24701

Ideally, want to use this to help diagnose #4849 in busy systems that are generating large numbers of UNREACHABLE / REACHABLE warnings

Copy link
Contributor

@Arkatufus Arkatufus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Aaronontheweb Aaronontheweb enabled auto-merge (squash) April 5, 2021 16:55
@Aaronontheweb Aaronontheweb merged commit 4091508 into akkadotnet:dev Apr 5, 2021
@Aaronontheweb Aaronontheweb deleted the remote/add-phi-accrual-failure-detector-logging branch April 5, 2021 17:13
This was referenced Apr 28, 2021
Aaronontheweb added a commit that referenced this pull request Apr 28, 2021
* Added v1.4.19 placeholder

* close #4860 - use local deploy for TcpManager child actors. (#4862)

* close #4860 - use local deploy for TcpManager child actors.

* Use local deploy for TcpIncomingConnection.

* Use local deploy for Udp actors.

Co-authored-by: Erik Folstad <erikmafo@gmail.com>
Co-authored-by: Aaron Stannard <aaron@petabridge.com>

* Merge pull request #4875 from akkadotnet/dependabot/nuget/Hyperion-0.9.17

Bump Hyperion from 0.9.16 to 0.9.17

* Bump Newtonsoft.Json from 12.0.3 to 13.0.1 (#4866)

Bumps [Newtonsoft.Json](https://github.com/JamesNK/Newtonsoft.Json) from 12.0.3 to 13.0.1.
- [Release notes](https://github.com/JamesNK/Newtonsoft.Json/releases)
- [Commits](JamesNK/Newtonsoft.Json@12.0.3...13.0.1)

Signed-off-by: dependabot-preview[bot] <support@dependabot.com>

Co-authored-by: dependabot-preview[bot] <27856297+dependabot-preview[bot]@users.noreply.github.com>

* Fix ClusterMetricsExtensionSpec racy spec

* Clean up Akka.Stream file stream (#4874)

* Make sure that FileSubscriber shuts down cleanly when it dies

* Make sure that file all sink spec release the file handle if it fails

* Supress ActorSelectionMessage with DeadLetterSuppression (migrated from akka/akka#28341) (#4889)

* for example the Cluster InitJoin message is marked with DeadLetterSuppression
  but was anyway logged because sent with actorSelection
* for other WrappedMessage than ActorSelectionMessage we shouldn't unwrap and publish
  the inner in SuppressedDeadLetter because that might loose some information
* therefore those are silenced in the DeadLetterListener instead

Better deadLetter logging of wrapped messages (migrated from akka/akka#28253)

Logging of UnhandledMessage (migrated from akka/akka#28414)
* make use of the existing logging of dead letter
  also for UnhandledMessage

Add Dropped to Akka.Actor (migrated partially from akka/akka#27160)
Log Dropped from DeadLetterListener

* add CultureInfo for Turkish OS (#4880)

* add CultureInfo for Turkish OS

added English CultureInfo to fix ToUpper function causes error on Turkish OS. "warning"->"WARNİNG"

* fix LogLevel TR char error

Co-authored-by: Aaron Stannard <aaron@petabridge.com>

* Harden FileSink unit tests by using AwaitAssert to wait for file operations to complete (#4891)

* Harden FileSink unit tests by using AwaitAssert to wait for file operations to complete

* Use AwaitResult to improve readability

Co-authored-by: Aaron Stannard <aaron@petabridge.com>

* Handle CoordinatedShutdown exiting-completed when not joined (#4893)

* assertion failed: Nodes not part of cluster have marked the Gossip as seen
* trying to mark the Gossip as seen before it has joined, which may happen
  if CoordinatedShutdown is running before the node has joined

migrated from akka/akka#26835

* Persistence fixes (#4892)

* snapshot RecoveryTick ignored, part of akka/akka#20753

* lastSequenceNr should reflect the snapshot sequence and not start with 0 when journal is empty. Migrated from akka/akka#27496

* Enforce valid seqnr for deletes, migrated from akka/akka#25488

* api approval

* Added DoNotInherit annotation (#4896)

* Bump Microsoft.NET.Test.Sdk from 16.9.1 to 16.9.4 (#4894)

* Add Setup class for NewtonSoftJsonSerializer (#4890)

* Add Setup class for NewtonSoftJsonSerializer

* Use Setup as a settings modifier instead of a settings factory

* Update spec

* Update API Approval list

* Unit test can inject null ActorSystem into the serializer causing the Setup system to throw a NRE

* Add documentation

* fixed up copyright headers (#4898)

* Bump Google.Protobuf from 3.15.6 to 3.15.7 (#4900)

Bumps [Google.Protobuf](https://github.com/protocolbuffers/protobuf) from 3.15.6 to 3.15.7.
- [Release notes](https://github.com/protocolbuffers/protobuf/releases)
- [Changelog](https://github.com/protocolbuffers/protobuf/blob/master/generate_changelog.py)
- [Commits](protocolbuffers/protobuf@v3.15.6...v3.15.7)

Signed-off-by: dependabot-preview[bot] <support@dependabot.com>

Co-authored-by: dependabot-preview[bot] <27856297+dependabot-preview[bot]@users.noreply.github.com>

* Added PhiAccrualFailureDetector warning logging for slow heartbeats (#4897)

Ported from akka/akka#17389 and akka/akka#24701

* replace reflection magic in MNTR with reading of `MultiNodeConfig` properties (#4902)

* close #4901 - replace reflection magic in MNTR with reading of MultiNodeConfig properties

* fixed outdated DiscoverySpec

* fixed SBR logging error that blew up StandardOutLogger (#4909)

This format error would cause the StandardOutLogger to throw a `FormatException` internally

* added timestamp to node failures in MNTR (#4911)

* cleaned up the SpecPass / SpecFail messages (#4912)

* reduce allocations inside PhiAccrualFailureDetector (#4913)

made `HeartbeatHistory` into a `readonly struct` and cleaned up some other old LINQ calls inside the data structure

* Bump Microsoft.Data.SQLite from 5.0.4 to 5.0.5 (#4914)

* [MNTR] Add include and exclude test filter feature (#4916)

* Add -Dmultinode.include and -Dmultinode.exclude filter feature

* Add documentation

* Fix typos and makes sentences more readable

* Make the sample command line wrap instead of running off the screen

* Change include and exclude filtering by method name instead (requested)

* cleaned up RemoteWatcher (#4917)

* Fixed System.ArgumentNullException in Interspase operation on empty stream finish. (#4918)

* Rewrite the AkkaDiFixture so that it does not need to start a HostBuilder (#4920)

* Fix case where PersistenceMessageSerializer.FromBinary got a null for its type parameter (#4923)

* Bump Google.Protobuf from 3.15.7 to 3.15.8 (#4927)

Bumps [Google.Protobuf](https://github.com/protocolbuffers/protobuf) from 3.15.7 to 3.15.8.
- [Release notes](https://github.com/protocolbuffers/protobuf/releases)
- [Changelog](https://github.com/protocolbuffers/protobuf/blob/master/generate_changelog.py)
- [Commits](protocolbuffers/protobuf@v3.15.7...v3.15.8)

Signed-off-by: dependabot-preview[bot] <support@dependabot.com>

Co-authored-by: dependabot-preview[bot] <27856297+dependabot-preview[bot]@users.noreply.github.com>

* close #4096 - documented how to terminate remembered entities (#4928)

Updated the Akka.Cluster.Sharding documentation to explain how to terminate remembered-entities.

* Add CLI switches to show help and version number (#4925)

* cleaned up protobuf CLI and definitions (#4930)

* cleaned up protobuf CLI and definitions

- Remove all `optional` fields (not allowed in Protobuf3, as all fields are optional by default unless specifically defined as `required`)
- Removed `--experimental_allow_proto3_optional` call from `protoc` compiler as it's no longer supported / needed
- Doesn't have any impact on existing wire formats, especially for `ClusterMessages.proto` which is where I removed all of the `optional` commands

* fixed compilation error caused by change in generated `AppVersion` output

* Fix MNTK specs for DData: DurablePruningSpec (#4933)

* Powershell splits CLI arguments on "." before passing them into applications (#4924)

* porting Cluster heartbeat timings, hardened Akka.Cluster serialization (#4934)

* porting Cluster heartbeat timings, hardened Akka.Cluster serialization

port akka/akka#27281
port akka/akka#25183
port akka/akka#24625

* increased ClusterLogSpec join timespan

Increased the `TimeSpan` here to 10 seconds in order to prevent this spec from failing racily, since even an Akka.Cluster self-join can take more than the default 3 seconds due to some of the timings involved in node startup et al.

* Bump Hyperion from 0.9.17 to 0.10.0 (#4935)

Bumps [Hyperion](https://github.com/akkadotnet/Hyperion) from 0.9.17 to 0.10.0.
- [Release notes](https://github.com/akkadotnet/Hyperion/releases)
- [Changelog](https://github.com/akkadotnet/Hyperion/blob/dev/RELEASE_NOTES.md)
- [Commits](akkadotnet/Hyperion@0.9.17...0.10.0)

Signed-off-by: dependabot-preview[bot] <support@dependabot.com>

Co-authored-by: dependabot-preview[bot] <27856297+dependabot-preview[bot]@users.noreply.github.com>

* Add spec for handling delegates in DI (#4922)

* Add spec for handling delegates in DI

* Make sure spec exits cleanly by terminating the actor system.

* Add spec where singleton delegate is called from another actor

* Fix racy test

Co-authored-by: Aaron Stannard <aaron@petabridge.com>

* Bump FsCheck from 2.15.1 to 2.15.2 (#4939)

* ClusterStressSpec and Cluster Failure Detector Cleanup (#4940)

* implementation of Akka.Cluster.Tests.MultiNode.StressSpec

* made MuteLog overrideable in Akka.Cluster.Testkit

* if Roles is empty, then don't run the thunk on any nodes

Changed this to make it consistent with the JVM

* made it possible to actually enable Cluster.AssertInvariants via environment variable

* added assert invariants to build script

cleaned up gossip class to assert more invariants

* ReSharper'd Reachability.cs

* cleaned up immutability and CAS issues inside DefaultFailureDetectorRegistry

added bugfix from akka/akka#23595

* Bump FsCheck.Xunit from 2.15.1 to 2.15.2 (#4938)

Bumps [FsCheck.Xunit](https://github.com/fsharp/FsCheck) from 2.15.1 to 2.15.2.
- [Release notes](https://github.com/fsharp/FsCheck/releases)
- [Changelog](https://github.com/fscheck/FsCheck/blob/master/FsCheck%20Release%20Notes.md)
- [Commits](fscheck/FsCheck@2.15.1...2.15.2)

Signed-off-by: dependabot-preview[bot] <support@dependabot.com>

Co-authored-by: dependabot-preview[bot] <27856297+dependabot-preview[bot]@users.noreply.github.com>

* cleanup `AKKA_CLUSTER_ASSERT` environment variable (#4942)

Per some of the suggestions on #4940 PR review

* harden Akka.DependencyInjection.Tests (#4945)

Added some `AwaitAssert` calls to check for disposed dependencies - these calls can be racy due to background actor thread calling `Dipose` after foreground test thread checks the `Disposed` property.

* HeartbeatNodeRing performance (#4943)

* added benchmark for HeartbeatNodeRing performance

* switched to local function

No perf change

* approve Akka.Benchmarks friend assembly for Akka.Cluster

* remove HeartbeatNodeRing.NodeRing() allocation and make field immutable

* made it so Akka.Util.Internal.ArrayExtensions.From no longer allocates (much)

* added some descriptive comments on HeartbeatNodeRing.Receivers

* Replaced `Lazy<T>` with `Option<T>` and a similar lazy initialization check

 Improved throughput by ~10% on larger collections and further reduced memory allocation.

* changed return types to `IImmutableSet`

Did this in order to reduce allocations from constantly converting back and forth from `ImmutableSortedSet<T>` and `ImmutableHashSet<T>` - that way we can just use whatever the underlying collection type is.

* added ReachabilityBenchmarks

* modified PingPong / RemotePingPong benchmarks to display threadcount (#4947)

Using this to gauge the impact certain dispatcher changes have on the total number of active threads per-process

* Configure duration for applying `MemberStatus.WeaklyUp`  to joining nodes (#4946)

* Configure duration for applying `MemberStatus.WeaklyUp`  to joining nodes

port of akka/akka#29665

* fixed validation check for TimeSpan duration passed in via HOCON

* harden ClusterLogSpecs

* restored Akka.Cluster model-based FsCheck specs (#4949)

* added `VectorClock` benchmark (#4950)

* added VectorClock benchmark

* fixed broken benchmark comparisons

* Performance optimize `VectorClock` (#4952)

* Performance optimize `VectorClock`

* don't cache MD5, but dispose of it

* guarantee disposal of iterators during VectorClock.Compare

* switch to local function for `VectorClock.CompareNext`

* fixed a comparison bug in how versions where compared

* minor cleanup

* replace `KeyValuePair<TKey,TValue>` with `ValueTuple<TKey,TValue>`

Reduced allocations by 90%, decreased execution time from 100ms to ~40ms

* harden RestartFirstSeedNodeSpec (#4954)

* harden RestartFirstSeedNodeSpec

* validate that we have complete seed node list prior to test

* Turned `HeatbeatNodeRing` into `struct` (#4944)

* added benchmark for HeartbeatNodeRing performance

* switched to local function

No perf change

* approve Akka.Benchmarks friend assembly for Akka.Cluster

* remove HeartbeatNodeRing.NodeRing() allocation and make field immutable

* made it so Akka.Util.Internal.ArrayExtensions.From no longer allocates (much)

* added some descriptive comments on HeartbeatNodeRing.Receivers

* Replaced `Lazy<T>` with `Option<T>` and a similar lazy initialization check

 Improved throughput by ~10% on larger collections and further reduced memory allocation.

* changed return types to `IImmutableSet`

Did this in order to reduce allocations from constantly converting back and forth from `ImmutableSortedSet<T>` and `ImmutableHashSet<T>` - that way we can just use whatever the underlying collection type is.

* converted `HeartbeatNodeRing` into a `struct`

improved performance some, but I don't want to lump it in with other changes just in case

* Add generalized crossplatform support for Hyperion serializer. (#4878)

* Add the groundwork for generalized crossplatform support.

* Update Hyperion to 0.10.0

* Convert adapter class to lambda object

* Add HyperionSerializerSetup setup class

* Add unit test spec

* Improve specs, add comments

* Add documentation

* Add copyright header.

* Change readonly fields to readonly properties.

* cleaned up `ReceiveActor` documentation (#4958)

* removed confusing and conflicting examples in the `ReceiveActor` documentation
* Embedded reference to "how actors restart" YouTube video in supervision docs

* updated website footer to read 2021 (#4959)

* added indicator for `ClusterResultsAggregator` in `StressSpec` logs (#4960)

Did this to make it easier to search for output logs produced during each phase of the `StressSpec`

* Bump Hyperion from 0.10.0 to 0.10.1 (#4957)

* Bump Hyperion from 0.10.0 to 0.10.1

Bumps [Hyperion](https://github.com/akkadotnet/Hyperion) from 0.10.0 to 0.10.1.
- [Release notes](https://github.com/akkadotnet/Hyperion/releases)
- [Changelog](https://github.com/akkadotnet/Hyperion/blob/dev/RELEASE_NOTES.md)
- [Commits](akkadotnet/Hyperion@0.10.0...0.10.1)

Signed-off-by: dependabot-preview[bot] <support@dependabot.com>

* Fix dependabot Hyperion issue (#4961)

* Update Akka.Remote.Tests.csproj to use common.props

* Update HyperionSerializer to reflect recent hyperion changes

Co-authored-by: dependabot-preview[bot] <27856297+dependabot-preview[bot]@users.noreply.github.com>
Co-authored-by: Gregorius Soedharmo <arkatufus@yahoo.com>

* Perf optimize `ActorSelection` (#4962)

* added memory metrics to `ActorSelection` benchmarks

* added ActorSelection benchmark

* ramped up the iteration counts

* validate that double wildcard can't be used outside of leaf node

* improve allocations on create

* minor cleanup

* create emptyRef only when needed via local function

* made `Iterator` into `struct`

* approved public API changes

* `Reachability` performance optimziation (#4955)

* reduced iteration count to speed up benchmarks

* optimize some System.Collections.Immutable invocations to allocate less

* cleanup dictionary construction

* fixed multiple enumeration bug in `Reachability`

* Fix `SpawnActor` benchmark (#4966)

* cleaned up SpawnActorBenchmarks

* cleaned up SpawnActor benchmarks

* fixed N-1 error inside `Mailbox` (#4964)

This error has no impact on extremely busy actors, but for actors who have to process small bursts of messages this can make the difference between getting everything done in one dispatch vs. doing it in two.

* Clean up bad outbound ACKs in Akka.Remote (#4963)

port of akka/akka#20093

Might be responsible for some quarantines in Akka.Cluster / Akka.Remote when nodes are restarting on identical addresses.

* UnfoldResourceSource closing twice on failure (#4969)

* Added test cases where close would be called twice

* Bugfix UnfoldResource closed resource twice on failure

* Add retry pattern with delay calculation support (#4895)

Co-authored-by: Aaron Stannard <aaron@petabridge.com>

* simplified the environment variable name for StressSpec (#4972)

* Refactored `Gossip` into `MembershipState` (#4968)

* refactor Gossip class into `MembershipState`

port of akka/akka#23291

* completed `MembershipState` port

* fixed some downed observers calls

* forgot to copy gossip upon `Welcome` from Leader

* forgot to copy `MembershipState` while calling `UpdateLatestGossip`

* refactored all DOWN-ing logic to live inside `Gossip` class

* added some additional methods onto `MembershipState`

* fixed ValidNodeForGossip bug

* fixed equality check for Reachability

 should be quality by reference, not by value

Co-authored-by: Gregorius Soedharmo <arkatufus@yahoo.com>

* Fix serialization verification problem with Akka.IO messages (#4974)

* Fix serialization verification problem with Akka.IO messages

* Wrap naked SocketAsyncEventArgs in a struct that inherits INoSerializationVerificationNeeded

* Make the wrapper struct readonly

* Expand exception message with their actor types

* Update API Approver list

* ORDictionary with POCO value missing items, add MultiNode spec (#4910)

* Update DData reference HOCON config to follow JVM

* Clean up ReplicatorSettings, add sensible default values.

* Add DurableData spec that uses ORDictionary with POCO values

* Add a special case for Replicator to suppress messages during load

* Slight change in public API, behaviour is identical.

* Change Replicator class to UntypedActor

* Code cleanup - Change fields to properties

* Update API Approver list

* clean up seed node process (#4975)

* fixed racy ActorModelSpec (#4976)

fixed `A_dispatcher_must_handle_queuing_from_multiple_threads` - we were using the wrong message type the entire time, and the previous instance caused `Thread.Sleep` to be called repeatedly.

* Update PluginSpec so that it can accept ActorSystem and ActorSystemSetup in its constructor (#4978)

* Removed inaccurate warning from Cluster Singleton docs (#4980)

Cluster singletons won't create duplicates in a cluster split scenario, and it's much safer to run them _with_ a split brain resolver on than without. This documentation was just out of date.

* Introduce `ChannelExecutor` (#4882)

* added `ChannelExecutor` dispatcher - uses `FixedConcurrencyTaskScheduler` internally - have `ActorSystem.Create` call `ThreadPool.SetMinThreads(0,0)` to improve performance across the board.

* fixed documentation errors

* Upgrade to GitHub-native Dependabot (#4984)

Co-authored-by: dependabot-preview[bot] <27856297+dependabot-preview[bot]@users.noreply.github.com>

* added v1.4.19 release notes (#4985)

* added v1.4.19 release notes

Co-authored-by: Erik Følstad <32196030+ef-computas@users.noreply.github.com>
Co-authored-by: Erik Folstad <erikmafo@gmail.com>
Co-authored-by: dependabot-preview[bot] <27856297+dependabot-preview[bot]@users.noreply.github.com>
Co-authored-by: Igor <igor@fedchenko.pro>
Co-authored-by: Gregorius Soedharmo <arkatufus@yahoo.com>
Co-authored-by: zbynek001 <zbynek001@gmail.com>
Co-authored-by: Cagatay YILDIZOGLU <yildizoglu@gmail.com>
Co-authored-by: Ismael Hamed <1279846+ismaelhamed@users.noreply.github.com>
Co-authored-by: Anton V. Ilyin <iliyn.anton.v@gmail.com>
Co-authored-by: Arjen Smits <Deathraven163@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants