Context deadline exceed when starting go-relayer with Osmosis node started more than 1 hr ago #1389

jununifi · 2024-01-30T00:58:17Z

This issue is typical to Osmosis node and for other nodes didn't detect such problem.

Way to reproduce:

Start Osmosis node and keep it open for 1 hour
Start go relayer connecting to Osmosis node

Jan 29 16:40:01 vultr rly[191642]: ts=2024-01-29T16:40:01.948613Z lvl=warn msg="Relayer start error" error="error querying channels: post failed: Post \"http://127.0.0.1:26657\": context deadline exceeded"
Jan 29 16:40:01 vultr rly[191642]: Error: error querying channels: post failed: Post "http://127.0.0.1:26657": context deadline exceeded

The way of fixing this was only restarting Osmosis node and starting go-relayer.
The problem here is that, once go-relayer stops, it can't restart automatically without Osmosis node restart.

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda2       600G  279G  296G  49% /

$ free -m
               total        used        free      shared  buff/cache   available
Mem:           32086       11559         586           0       19940       20060
Swap:           8191        2967        5224

$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         40 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  8
  On-line CPU(s) list:   0-7
Vendor ID:               GenuineIntel
  Model name:            Intel Core Processor (Broadwell, no TSX, IBRS)
    CPU family:          6
    Model:               61
    Thread(s) per core:  2
    Core(s) per socket:  4
    Socket(s):           1
    Stepping:            2
    BogoMIPS:            4788.90

The text was updated successfully, but these errors were encountered:

jtieri · 2024-01-30T22:19:38Z

Thanks for opening the issue @jununifi.

Can you share what version of Osmosis you are using, as well as the relayer version?
It may also be helpful if we can see how much memory the Osmosis node is using at startup when the relayer seems to work fine, as well as the memory consumption of the Osmosis node at the time you start to see this issue.

danbryan · 2024-01-30T22:24:53Z

in the past, I have had luck changing this setting in config.toml when dealing with Context deadline exceed.

    [rpc]
    timeout_broadcast_tx_commit = "120s"

jununifi · 2024-02-01T09:40:38Z

Thanks for opening the issue @jununifi.

Can you share what version of Osmosis you are using, as well as the relayer version? It may also be helpful if we can see how much memory the Osmosis node is using at startup when the relayer seems to work fine, as well as the memory consumption of the Osmosis node at the time you start to see this issue.

At the moment, we are using v22.0.1 version of Osmosis. But this used to happen in v20.x.x as well.
The context deadline exceed issue would be related to memory consumption?

jununifi · 2024-02-01T09:42:17Z

in the past, I have had luck changing this setting in config.toml when dealing with Context deadline exceed.
    [rpc]
    timeout_broadcast_tx_commit = "120s"

Thanks for your information, we will try updating the configuration. We are using v2.3.1 version of go-relayer.

jtieri · 2024-02-01T17:34:11Z

Thanks for opening the issue @jununifi.
Can you share what version of Osmosis you are using, as well as the relayer version? It may also be helpful if we can see how much memory the Osmosis node is using at startup when the relayer seems to work fine, as well as the memory consumption of the Osmosis node at the time you start to see this issue.

At the moment, we are using v22.0.1 version of Osmosis. But this used to happen in v20.x.x as well. The context deadline exceed issue would be related to memory consumption?

Thanks for the additional information! My coworker had the thought that since the issue only occurs after the Osmosis node has been running for some time, perhaps there is a memory leak in the Osmosis binary that causes a longer than usual response when the relayer attempts to query the channels on startup

dylanschultzie · 2024-02-01T19:01:26Z

I experienced this same issue, but with cosmoshub rather than osmosis.

One thing I'd argue is that this shouldn't close down the rly process. Stop attempting to run against that node, sure, but why kill the entire process which has other working channels?

jtieri · 2024-02-01T20:06:34Z

I experienced this same issue, but with cosmoshub rather than osmosis.

One thing I'd argue is that this shouldn't close down the rly process. Stop attempting to run against that node, sure, but why kill the entire process which has other working channels?

interesting 🤔

i do agree that the current behavior is a bit extreme. i think addressing #1268 by only killing the PathProcessor related to the problematic node is a much better design and should be something we address in the next release. implementing support for multiple RPC endpoints to be configured would help alleviate some of this as well

jtieri · 2024-02-02T17:45:05Z

One thing that we noticed is that we were using a single 60s timeout for the entire paginated query used for querying all channels. For chains like Osmosis and Cosmos Hub, that have a ton of channels in state, this could potentially be problematic. In #1395 I've made changes to use a 10s timeout per paginated query request so this should alleviate some of the issues.

Will also prioritize only killing PathProcessors related to a problematic node vs. killing the entire rly process.

akc2267 · 2024-02-12T20:33:43Z

this should be resolved in 2.5.1

jtieri · 2024-02-26T18:10:20Z

@akc2267 @jtieri we still have this constantly even on 2.5.1

Both larger config and smaller config instances just go to an infinite loop where it reports:

2024-02-22T01:11:44.059393Z	warn	Relayer start error	{"error": "error querying channels: rpc error: code = Unknown desc = Ibc Error: Invalid query path"}
Error: error querying channels: rpc error: code = Unknown desc = Ibc Error: Invalid query path```

This it reports constantly for 10s of channels and chains. it stalls after ~1-2s and restarts the process going into the same error.

Our configs are correct and we can curl the URLs from the server proving it has access to the nodes

I would not expect to see these Ibc Error: Invalid query path logs for the originally described issue, this seems to be an entirely different issue altogether.

Do you know which chain is reporting these errors? I will likely need to do some local testing to sus out what the issue is with regards to that specific error

jununifi changed the title ~~Context deadline exceed issue when starting go-relayer with Osmosis node started more than 1 hr ago~~ Context deadline exceed when starting go-relayer with Osmosis node started more than 1 hr ago Jan 30, 2024

akc2267 closed this as completed Feb 12, 2024

jtieri reopened this Feb 22, 2024

boojamya assigned jtieri Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Context deadline exceed when starting go-relayer with Osmosis node started more than 1 hr ago #1389

Context deadline exceed when starting go-relayer with Osmosis node started more than 1 hr ago #1389

jununifi commented Jan 30, 2024

jtieri commented Jan 30, 2024

danbryan commented Jan 30, 2024

jununifi commented Feb 1, 2024

jununifi commented Feb 1, 2024

jtieri commented Feb 1, 2024

dylanschultzie commented Feb 1, 2024

jtieri commented Feb 1, 2024

jtieri commented Feb 2, 2024

akc2267 commented Feb 12, 2024

jtieri commented Feb 26, 2024

Context deadline exceed when starting go-relayer with Osmosis node started more than 1 hr ago #1389

Context deadline exceed when starting go-relayer with Osmosis node started more than 1 hr ago #1389

Comments

jununifi commented Jan 30, 2024

jtieri commented Jan 30, 2024

danbryan commented Jan 30, 2024

jununifi commented Feb 1, 2024

jununifi commented Feb 1, 2024

jtieri commented Feb 1, 2024

dylanschultzie commented Feb 1, 2024

jtieri commented Feb 1, 2024

jtieri commented Feb 2, 2024

akc2267 commented Feb 12, 2024

jtieri commented Feb 26, 2024