Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The HISTORY_RETENTION_COUNT parameter has no effect #5299

Closed
airstring opened this issue May 6, 2024 · 9 comments
Closed

The HISTORY_RETENTION_COUNT parameter has no effect #5299

airstring opened this issue May 6, 2024 · 9 comments
Labels

Comments

@airstring
Copy link

What version are you using?

root@ip-xxx:/data/xlm# bin/stellar-horizon version
2.30.0-17e1acc6ba990b62b3c364a30ad9d866e562c56e
go1.22.2

What did you do?

xlm's data was growing so fast that I wanted it to automatically prune history blocks.
I added two lines to the file /etc/default/stellar-horizon, as shown below:

HISTORY_RETENTION_COUNT=50
HISTORY_RETENTION_REAP_COUNT=10000

What did you expect to see?

1.horizon's log file should have the reaper keyword
2.Disk space should not continue to grow

What did you see instead?

1.horizon's log files don't have the reaper keyword
2.Disk space continues to grow

@airstring airstring added the bug label May 6, 2024
@sreuland
Copy link
Contributor

sreuland commented May 6, 2024

@airstring , there is a time limit interval imposed on the reaper, it will only run once every hour per this config:
https://github.com/stellar/go/blob/master/services/horizon/internal/reap/system.go#L47

I'm wondering if this observation was in the timeframe of first hour after starting horizon, if so, the reaper would not have ran yet. Can you check the logs and/or db related to this horizon instance after it has been running at least one hour with this retention config and then recheck logs for lines with reaper in the message to confirm activity?

@airstring
Copy link
Author

airstring commented May 7, 2024

@airstring , there is a time limit interval imposed on the reaper, it will only run once every hour per this config: https://github.com/stellar/go/blob/master/services/horizon/internal/reap/system.go#L47

I'm wondering if this observation was in the timeframe of first hour after starting horizon, if so, the reaper would not have ran yet. Can you check the logs and/or db related to this horizon instance after it has been running at least one hour with this retention config and then recheck logs for lines with reaper in the message to confirm activity?

I'm sure I've been watching for a week now and there are no reaper keywords in the logs.

root@ip-xxx:/data/xlm/logs# du -sh stellar-horizon.log 
592M    stellar-horizon.log
root@ip-xxx:/data/xlm/logs# grep "reaper" stellar-horizon.log
root@ip-xxx:/data/xlm/logs#

Of course, the block synchronization is normal.

root@ip-xxx:/data/xlm/logs# curl -s http://localhost:8000 | grep 'ingest_latest_ledger' 
  "ingest_latest_ledger": 51560811,

@sreuland
Copy link
Contributor

sreuland commented May 7, 2024

I ran horizon 2.30.0 locally with similar reap config to replicate on testnet:

$ HISTORY_RETENTION_COUNT=50 \
HISTORY_RETENTION_REAP_COUNT=10000 \
DATABASE_URL=xyz \
NETWORK=testnet \
STELLAR_CORE_BINARY_PATH=/my/path/to/stellar-core \
LOG_FILE=test.log ./horizon

this was on an initially empty db, the first reaper cycle appeard 5 hrs after start:

$ grep "reap" run.log 
time="2024-05-07T03:10:28.480-07:00" level=info msg="reaper: clearing" end_ledger=1490943 pid=33659 start_ledger=1487744
time="2024-05-07T03:10:31.067-07:00" level=info msg="reaper succeeded" new_elder=1490944 pid=33659
time="2024-05-07T04:26:37.077-07:00" level=info msg="reaper: clearing" end_ledger=1491815 pid=33659 start_ledger=1490944
time="2024-05-07T04:26:38.883-07:00" level=info msg="reaper succeeded" new_elder=1491816 pid=33659
time="2024-05-07T05:59:03.274-07:00" level=info msg="reaper: clearing" end_ledger=1492872 pid=33659 start_ledger=1491816
time="2024-05-07T05:59:04.952-07:00" level=info msg="reaper succeeded" new_elder=1492873 pid=33659
time="2024-05-07T07:24:11.008-07:00" level=info msg="reaper: clearing" end_ledger=1493845 pid=33659 start_ledger=1492873
time="2024-05-07T07:24:12.720-07:00" level=info msg="reaper succeeded" new_elder=1493846 pid=33659
time="2024-05-07T08:44:19.326-07:00" level=info msg="reaper: clearing" end_ledger=1494762 pid=33659 start_ledger=1493846
time="2024-05-07T08:44:20.936-07:00" level=info msg="reaper succeeded" new_elder=1494763 pid=33659
time="2024-05-07T09:44:20.934-07:00" level=info msg="reaper: clearing" end_ledger=1495451 pid=33659 start_ledger=1494763
time="2024-05-07T09:44:22.513-07:00" level=info msg="reaper succeeded" new_elder=1495452 pid=33659

can you grep the log for level=error,see if anything shows up there, and also, check the postgres db, and see if any long running transaction/sql are present? wondering if the sql delete statement that reaper is running maybe getting hung up.

@airstring
Copy link
Author

airstring commented May 8, 2024

I ran horizon 2.30.0 locally with similar reap config to replicate on testnet:

$ HISTORY_RETENTION_COUNT=50 \
HISTORY_RETENTION_REAP_COUNT=10000 \
DATABASE_URL=xyz \
NETWORK=testnet \
STELLAR_CORE_BINARY_PATH=/my/path/to/stellar-core \
LOG_FILE=test.log ./horizon

this was on an initially empty db, the first reaper cycle appeard 5 hrs after start:

$ grep "reap" run.log 
time="2024-05-07T03:10:28.480-07:00" level=info msg="reaper: clearing" end_ledger=1490943 pid=33659 start_ledger=1487744
time="2024-05-07T03:10:31.067-07:00" level=info msg="reaper succeeded" new_elder=1490944 pid=33659
time="2024-05-07T04:26:37.077-07:00" level=info msg="reaper: clearing" end_ledger=1491815 pid=33659 start_ledger=1490944
time="2024-05-07T04:26:38.883-07:00" level=info msg="reaper succeeded" new_elder=1491816 pid=33659
time="2024-05-07T05:59:03.274-07:00" level=info msg="reaper: clearing" end_ledger=1492872 pid=33659 start_ledger=1491816
time="2024-05-07T05:59:04.952-07:00" level=info msg="reaper succeeded" new_elder=1492873 pid=33659
time="2024-05-07T07:24:11.008-07:00" level=info msg="reaper: clearing" end_ledger=1493845 pid=33659 start_ledger=1492873
time="2024-05-07T07:24:12.720-07:00" level=info msg="reaper succeeded" new_elder=1493846 pid=33659
time="2024-05-07T08:44:19.326-07:00" level=info msg="reaper: clearing" end_ledger=1494762 pid=33659 start_ledger=1493846
time="2024-05-07T08:44:20.936-07:00" level=info msg="reaper succeeded" new_elder=1494763 pid=33659
time="2024-05-07T09:44:20.934-07:00" level=info msg="reaper: clearing" end_ledger=1495451 pid=33659 start_ledger=1494763
time="2024-05-07T09:44:22.513-07:00" level=info msg="reaper succeeded" new_elder=1495452 pid=33659

can you grep the log for level=error,see if anything shows up there, and also, check the postgres db, and see if any long running transaction/sql are present? wondering if the sql delete statement that reaper is running maybe getting hung up.

1.file /etc/default/stellar-horizon contents are as follows

NETWORK_PASSPHRASE="Public Global Stellar Network ; September 2015"
DATABASE_URL="dbname=horizon host=localhost user=stellar password=yyy"
STELLAR_CORE_URL="http://127.0.0.1:11626"
HISTORY_ARCHIVE_URLS="https://history.stellar.org/prd/core-live/core_live_001/"
PORT=8000
STELLAR_CAPTIVE_CORE_HTTP_PORT=0
INGEST=true
CAPTIVE_CORE_CONFIG_APPEND_PATH=/data/xlm/conf/stellar-captive-core.cfg
STELLAR_CORE_BINARY_PATH="/data/xlm/bin/stellar-core"
PER_HOUR_RATE_LIMIT=40000
HISTORY_RETENTION_COUNT=50
HISTORY_RETENTION_REAP_COUNT=10000

2.Log as follows:

root@ip-xxx:/data/xlm/logs# grep "level=error" stellar-horizon.log |grep -v "Could not download" |grep -v "Failed to download" |grep -v "failed to load the stellar-core info"
Apr 30 13:48:54 ip-xxx stellar-horizon[2466873]: time="2024-04-30T13:48:54.005+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2466873 service=ingest subservice=stellar-core
Apr 30 15:46:02 ip-xxx stellar-horizon[2510673]: time="2024-04-30T15:46:02.830+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
Apr 30 18:46:01 ip-xxx stellar-horizon[2510673]: time="2024-04-30T18:46:01.815+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
Apr 30 21:46:01 ip-xxx stellar-horizon[2510673]: time="2024-04-30T21:46:01.618+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  1 00:46:01 ip-xxx stellar-horizon[2510673]: time="2024-05-01T00:46:01.981+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  1 03:02:47 ip-xxx stellar-horizon[2510673]: time="2024-05-01T03:02:47.771+08:00" level=error msg="could not start repeatable read transaction for txsub tick" error="canceling statement due to lack of response within timeout period" pid=2510673
May  1 03:46:01 ip-xxx stellar-horizon[2510673]: time="2024-05-01T03:46:01.577+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  1 06:46:02 ip-xxx stellar-horizon[2510673]: time="2024-05-01T06:46:02.249+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  1 09:46:02 ip-xxx stellar-horizon[2510673]: time="2024-05-01T09:46:02.442+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  1 12:46:02 ip-xxx stellar-horizon[2510673]: time="2024-05-01T12:46:02.522+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  1 15:46:02 ip-xxx stellar-horizon[2510673]: time="2024-05-01T15:46:02.426+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  1 17:44:34 ip-xxx stellar-horizon[2510673]: time="2024-05-01T17:44:34.797+08:00" level=error msg="could not start repeatable read transaction for txsub tick" error="canceling statement due to lack of response within timeout period" pid=2510673
May  1 17:45:07 ip-xxx stellar-horizon[2510673]: time="2024-05-01T17:45:07.768+08:00" level=error msg="could not start repeatable read transaction for txsub tick" error="canceling statement due to lack of response within timeout period" pid=2510673
May  1 17:59:06 ip-xxx stellar-horizon[2510673]: time="2024-05-01T17:59:06.781+08:00" level=error msg="could not start repeatable read transaction for txsub tick" error="canceling statement due to lack of response within timeout period" pid=2510673
May  1 18:46:02 ip-xxx stellar-horizon[2510673]: time="2024-05-01T18:46:02.723+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  1 21:46:06 ip-xxx stellar-horizon[2510673]: time="2024-05-01T21:46:06.385+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  2 00:46:01 ip-xxx stellar-horizon[2510673]: time="2024-05-02T00:46:01.703+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  2 03:46:01 ip-xxx stellar-horizon[2510673]: time="2024-05-02T03:46:01.727+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  2 06:46:12 ip-xxx stellar-horizon[2510673]: time="2024-05-02T06:46:12.470+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  2 09:46:02 ip-xxx stellar-horizon[2510673]: time="2024-05-02T09:46:02.012+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  2 12:46:01 ip-xxx stellar-horizon[2510673]: time="2024-05-02T12:46:01.705+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  2 15:46:07 ip-xxx stellar-horizon[2510673]: time="2024-05-02T15:46:07.171+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  2 18:46:02 ip-xxx stellar-horizon[2510673]: time="2024-05-02T18:46:02.875+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  2 21:46:04 ip-xxx stellar-horizon[2510673]: time="2024-05-02T21:46:04.701+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  3 00:46:04 ip-xxx stellar-horizon[2510673]: time="2024-05-03T00:46:04.667+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  3 03:46:03 ip-xxx stellar-horizon[2510673]: time="2024-05-03T03:46:03.199+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  3 05:02:12 ip-xxx stellar-horizon[2510673]: time="2024-05-03T05:02:12.857+08:00" level=error msg="could not start repeatable read transaction for txsub tick" error="canceling statement due to lack of response within timeout period" pid=2510673
May  3 05:02:24 ip-xxx stellar-horizon[2510673]: time="2024-05-03T05:02:24.769+08:00" level=error msg="could not start repeatable read transaction for txsub tick" error="canceling statement due to lack of response within timeout period" pid=2510673
May  3 06:46:03 ip-xxx stellar-horizon[2510673]: time="2024-05-03T06:46:03.630+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  3 09:11:53 ip-xxx stellar-horizon[2510673]: time="2024-05-03T09:11:53.804+08:00" level=error msg="could not start repeatable read transaction for txsub tick" error="canceling statement due to lack of response within timeout period" pid=2510673
May  3 09:46:03 ip-xxx stellar-horizon[2510673]: time="2024-05-03T09:46:03.070+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  3 12:46:03 ip-xxx stellar-horizon[2510673]: time="2024-05-03T12:46:03.516+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  3 15:46:03 ip-xxx stellar-horizon[2510673]: time="2024-05-03T15:46:03.399+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  3 18:46:02 ip-xxx stellar-horizon[2510673]: time="2024-05-03T18:46:02.941+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  3 21:46:02 ip-xxx stellar-horizon[2510673]: time="2024-05-03T21:46:02.401+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  4 00:46:03 ip-xxx stellar-horizon[2510673]: time="2024-05-04T00:46:03.165+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  4 03:46:03 ip-xxx stellar-horizon[2510673]: time="2024-05-04T03:46:03.612+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  4 04:04:47 ip-xxx stellar-horizon[2510673]: time="2024-05-04T04:04:47.814+08:00" level=error msg="could not start repeatable read transaction for txsub tick" error="canceling statement due to lack of response within timeout period" pid=2510673
May  4 06:46:03 ip-xxx stellar-horizon[2510673]: time="2024-05-04T06:46:03.195+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  4 09:46:03 ip-xxx stellar-horizon[2510673]: time="2024-05-04T09:46:03.715+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  4 12:46:03 ip-xxx stellar-horizon[2510673]: time="2024-05-04T12:46:03.077+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  4 15:46:03 ip-xxx stellar-horizon[2510673]: time="2024-05-04T15:46:03.541+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  4 18:46:06 ip-xxx stellar-horizon[2510673]: time="2024-05-04T18:46:06.216+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  4 21:46:04 ip-xxx stellar-horizon[2510673]: time="2024-05-04T21:46:04.086+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  5 00:46:03 ip-xxx stellar-horizon[2510673]: time="2024-05-05T00:46:03.748+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  5 03:46:04 ip-xxx stellar-horizon[2510673]: time="2024-05-05T03:46:04.212+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  5 06:46:04 ip-xxx stellar-horizon[2510673]: time="2024-05-05T06:46:04.025+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  5 09:46:04 ip-xxx stellar-horizon[2510673]: time="2024-05-05T09:46:04.158+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  5 12:46:03 ip-xxx stellar-horizon[2510673]: time="2024-05-05T12:46:03.964+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  5 15:46:03 ip-xxx stellar-horizon[2510673]: time="2024-05-05T15:46:03.967+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  5 18:46:03 ip-xxx stellar-horizon[2510673]: time="2024-05-05T18:46:03.220+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  5 21:46:03 ip-xxx stellar-horizon[2510673]: time="2024-05-05T21:46:03.282+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  6 00:46:03 ip-xxx stellar-horizon[2510673]: time="2024-05-06T00:46:03.148+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  6 03:46:03 ip-xxx stellar-horizon[2510673]: time="2024-05-06T03:46:03.298+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  6 06:46:03 ip-xxx stellar-horizon[2510673]: time="2024-05-06T06:46:03.394+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  6 08:59:56 ip-xxx stellar-horizon[2510673]: time="2024-05-06T08:59:56.832+08:00" level=error msg="could not start repeatable read transaction for txsub tick" error="canceling statement due to lack of response within timeout period" pid=2510673
May  6 09:00:29 ip-xxx stellar-horizon[2510673]: time="2024-05-06T09:00:29.768+08:00" level=error msg="could not start repeatable read transaction for txsub tick" error="canceling statement due to lack of response within timeout period" pid=2510673
May  6 09:00:47 ip-xxx stellar-horizon[2510673]: time="2024-05-06T09:00:47.769+08:00" level=error msg="could not start repeatable read transaction for txsub tick" error="canceling statement due to lack of response within timeout period" pid=2510673
May  6 09:46:03 ip-xxx stellar-horizon[2510673]: time="2024-05-06T09:46:03.148+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  6 12:46:02 ip-xxx stellar-horizon[2510673]: time="2024-05-06T12:46:02.969+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  6 15:46:03 ip-xxx stellar-horizon[2510673]: time="2024-05-06T15:46:03.987+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  6 16:18:41 ip-xxx stellar-horizon[2510673]: time="2024-05-06T16:18:41.824+08:00" level=error msg="could not start repeatable read transaction for txsub tick" error="canceling statement due to lack of response within timeout period" pid=2510673
May  6 18:28:34 ip-xxx stellar-horizon[2510673]: time="2024-05-06T18:28:34.869+08:00" level=error msg="could not start repeatable read transaction for txsub tick" error="canceling statement due to lack of response within timeout period" pid=2510673
May  6 18:29:08 ip-xxx stellar-horizon[2510673]: time="2024-05-06T18:29:08.768+08:00" level=error msg="could not start repeatable read transaction for txsub tick" error="canceling statement due to lack of response within timeout period" pid=2510673
May  6 18:29:43 ip-xxx stellar-horizon[2510673]: time="2024-05-06T18:29:43.769+08:00" level=error msg="could not start repeatable read transaction for txsub tick" error="canceling statement due to lack of response within timeout period" pid=2510673
May  6 18:30:01 ip-xxx stellar-horizon[2510673]: time="2024-05-06T18:30:01.774+08:00" level=error msg="could not start repeatable read transaction for txsub tick" error="canceling statement due to lack of response within timeout period" pid=2510673
May  6 18:46:04 ip-xxx stellar-horizon[2510673]: time="2024-05-06T18:46:04.697+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  6 19:05:03 ip-xxx stellar-horizon[2510673]: time="2024-05-06T19:05:03.848+08:00" level=error msg="could not start repeatable read transaction for txsub tick" error="canceling statement due to lack of response within timeout period" pid=2510673
May  6 19:05:14 ip-xxx stellar-horizon[2510673]: time="2024-05-06T19:05:14.767+08:00" level=error msg="could not start repeatable read transaction for txsub tick" error="canceling statement due to lack of response within timeout period" pid=2510673
May  6 19:05:59 ip-xxx stellar-horizon[2510673]: time="2024-05-06T19:05:59.768+08:00" level=error msg="could not start repeatable read transaction for txsub tick" error="canceling statement due to lack of response within timeout period" pid=2510673
May  6 21:46:04 ip-xxx stellar-horizon[2510673]: time="2024-05-06T21:46:04.200+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  6 23:06:09 ip-xxx stellar-horizon[2510673]: time="2024-05-06T23:06:09.831+08:00" level=error msg="could not start repeatable read transaction for txsub tick" error="canceling statement due to lack of response within timeout period" pid=2510673
May  6 23:06:17 ip-xxx stellar-horizon[2510673]: time="2024-05-06T23:06:17.768+08:00" level=error msg="could not start repeatable read transaction for txsub tick" error="canceling statement due to lack of response within timeout period" pid=2510673
May  6 23:12:45 ip-xxx stellar-horizon[2510673]: time="2024-05-06T23:12:45.789+08:00" level=error msg="could not start repeatable read transaction for txsub tick" error="canceling statement due to lack of response within timeout period" pid=2510673
May  6 23:13:09 ip-xxx stellar-horizon[2510673]: time="2024-05-06T23:13:09.774+08:00" level=error msg="could not start repeatable read transaction for txsub tick" error="canceling statement due to lack of response within timeout period" pid=2510673
May  6 23:13:24 ip-xxx stellar-horizon[2510673]: time="2024-05-06T23:13:24.768+08:00" level=error msg="could not start repeatable read transaction for txsub tick" error="canceling statement due to lack of response within timeout period" pid=2510673
May  6 23:13:36 ip-xxx stellar-horizon[2510673]: time="2024-05-06T23:13:36.768+08:00" level=error msg="could not start repeatable read transaction for txsub tick" error="canceling statement due to lack of response within timeout period" pid=2510673
May  6 23:13:50 ip-xxx stellar-horizon[2510673]: time="2024-05-06T23:13:50.769+08:00" level=error msg="could not start repeatable read transaction for txsub tick" error="canceling statement due to lack of response within timeout period" pid=2510673
May  6 23:29:11 ip-xxx stellar-horizon[2510673]: time="2024-05-06T23:29:11.788+08:00" level=error msg="could not start repeatable read transaction for txsub tick" error="canceling statement due to lack of response within timeout period" pid=2510673
May  6 23:31:42 ip-xxx stellar-horizon[2510673]: time="2024-05-06T23:31:42.768+08:00" level=error msg="could not start repeatable read transaction for txsub tick" error="canceling statement due to lack of response within timeout period" pid=2510673
May  6 23:37:54 ip-xxx stellar-horizon[2510673]: time="2024-05-06T23:37:54.791+08:00" level=error msg="could not start repeatable read transaction for txsub tick" error="canceling statement due to lack of response within timeout period" pid=2510673
May  6 23:38:15 ip-xxx stellar-horizon[2510673]: time="2024-05-06T23:38:15.768+08:00" level=error msg="could not start repeatable read transaction for txsub tick" error="canceling statement due to lack of response within timeout period" pid=2510673
May  6 23:40:38 ip-xxx stellar-horizon[2510673]: time="2024-05-06T23:40:38.769+08:00" level=error msg="could not start repeatable read transaction for txsub tick" error="canceling statement due to lack of response within timeout period" pid=2510673
May  7 00:46:04 ip-xxx stellar-horizon[2510673]: time="2024-05-07T00:46:04.586+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  7 02:38:17 ip-xxx stellar-horizon[2510673]: time="2024-05-07T02:38:17.839+08:00" level=error msg="could not start repeatable read transaction for txsub tick" error="canceling statement due to lack of response within timeout period" pid=2510673
May  7 03:46:07 ip-xxx stellar-horizon[2510673]: time="2024-05-07T03:46:07.004+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  7 04:01:29 ip-xxx stellar-horizon[2510673]: time="2024-05-07T04:01:29.797+08:00" level=error msg="could not start repeatable read transaction for txsub tick" error="canceling statement due to lack of response within timeout period" pid=2510673
May  7 05:00:23 ip-xxx stellar-horizon[2510673]: time="2024-05-07T05:00:23.821+08:00" level=error msg="could not start repeatable read transaction for txsub tick" error="canceling statement due to lack of response within timeout period" pid=2510673
May  7 05:00:30 ip-xxx stellar-horizon[2510673]: time="2024-05-07T05:00:30.769+08:00" level=error msg="could not start repeatable read transaction for txsub tick" error="canceling statement due to lack of response within timeout period" pid=2510673
May  7 05:00:38 ip-xxx stellar-horizon[2510673]: time="2024-05-07T05:00:38.767+08:00" level=error msg="could not start repeatable read transaction for txsub tick" error="canceling statement due to lack of response within timeout period" pid=2510673
May  7 06:46:07 ip-xxx stellar-horizon[2510673]: time="2024-05-07T06:46:07.195+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  7 09:46:06 ip-xxx stellar-horizon[2510673]: time="2024-05-07T09:46:06.469+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  7 11:42:11 ip-xxx stellar-horizon[2510673]: time="2024-05-07T11:42:11.880+08:00" level=error msg="could not start repeatable read transaction for txsub tick" error="canceling statement due to lack of response within timeout period" pid=2510673
May  7 12:46:05 ip-xxx stellar-horizon[2510673]: time="2024-05-07T12:46:05.748+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  7 14:40:26 ip-xxx stellar-horizon[2510673]: time="2024-05-07T14:40:26.833+08:00" level=error msg="could not start repeatable read transaction for txsub tick" error="canceling statement due to lack of response within timeout period" pid=2510673
May  7 15:46:05 ip-xxx stellar-horizon[2510673]: time="2024-05-07T15:46:05.572+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  7 18:46:10 ip-xxx stellar-horizon[2510673]: time="2024-05-07T18:46:10.897+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  7 19:04:56 ip-xxx stellar-horizon[2510673]: time="2024-05-07T19:04:56.866+08:00" level=error msg="could not start repeatable read transaction for txsub tick" error="canceling statement due to lack of response within timeout period" pid=2510673
May  7 21:46:13 ip-xxx stellar-horizon[2510673]: time="2024-05-07T21:46:13.585+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  7 23:10:22 ip-xxx stellar-horizon[2510673]: time="2024-05-07T23:10:22.832+08:00" level=error msg="could not start repeatable read transaction for txsub tick" error="canceling statement due to lack of response within timeout period" pid=2510673
May  8 00:46:10 ip-xxx stellar-horizon[2510673]: time="2024-05-08T00:46:10.303+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  8 03:46:10 ip-xxx stellar-horizon[2510673]: time="2024-05-08T03:46:10.326+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  8 06:34:37 ip-xxx stellar-horizon[2510673]: time="2024-05-08T06:34:37.815+08:00" level=error msg="could not start repeatable read transaction for txsub tick" error="canceling statement due to lack of response within timeout period" pid=2510673
May  8 06:46:10 ip-xxx stellar-horizon[2510673]: time="2024-05-08T06:46:10.175+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
May  8 09:46:10 ip-xxx stellar-horizon[2510673]: time="2024-05-08T09:46:10.679+08:00" level=error msg="History: Failed to obtain archive information: lobstr_1_europe" pid=2510673 service=ingest subservice=stellar-core
root@ip-xxx:/data/xlm/logs# 

3.The active sql in the database is as follows:

postgres=# select pid,datname,usename,query,state,backend_start from pg_stat_activity where state='active';
   pid   | datname  | usename  |                                                                                                              query                                                                                                               | state  |         backend_start         
---------+----------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+-------------------------------
 4167882 | postgres | postgres | select pid,datname,usename,query,state,backend_start from pg_stat_activity where state='active';                                                                                                                                 | active | 2024-05-08 10:12:53.507739+08
 2510689 | horizon  | stellar  | UPDATE exp_asset_stats SET accounts = $1, amount = $2, asset_code = $3, asset_issuer = $4, asset_type = $5, balances = $6, contract_id = $7, num_accounts = $8 WHERE asset_code = $9 AND asset_issuer = $10 AND asset_type = $11 | active | 2024-04-30 15:45:48.772808+08
 2465736 | core     | stellar  | INSERT INTO txhistory ( txid, ledgerseq, txindex,  txbody, txresult, txmeta) VALUES ($1,  $2,      $3, $4,   $5,   $6)                                                                                                           | active | 2024-04-30 10:41:34.740153+08
(3 rows)

postgres=# select pid,datname,usename,query,state,backend_start from pg_stat_activity where state='active';
   pid   | datname  | usename  |                                              query                                               | state  |         backend_start         
---------+----------+----------+--------------------------------------------------------------------------------------------------+--------+-------------------------------
 4167882 | postgres | postgres | select pid,datname,usename,query,state,backend_start from pg_stat_activity where state='active'; | active | 2024-05-08 10:12:53.507739+08

@sreuland
Copy link
Contributor

sreuland commented May 9, 2024

can you confirm how stellar-horizon is being run on the o/s as a process, is the pkg installed binary /usr/bin/stellar-horizon being run directly or using systemd with /usr/lib/systemd/system/stellar-horizon.service, or using stellar-horizon-cmd? the latter two options will load /etc/default/stellar-horizon. Also, if using one of the latter two runtime options, the horizon process will need to be restarted to force reload /etc/default/stellar-horizon.

@airstring
Copy link
Author

airstring commented May 9, 2024

vim /lib/systemd/system/stellar-horizon.service

[Unit]
Description=SDF - stellar-horizon
Before=multi-user.target

[Service]
User=stellar
Group=stellar
LimitNOFILE=8192
EnvironmentFile=/etc/default/stellar-horizon
SyslogIdentifier=stellar-horizon
ExecStart=/data/xlm/bin/stellar-horizon serve
WorkingDirectory=/data/xlm/data/stellar-horizon
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=no
RestartPreventExitStatus=255
Type=simple

[Install]
WantedBy=multi-user.target

vim /etc/default/stellar-horizon

NETWORK_PASSPHRASE="Public Global Stellar Network ; September 2015"
DATABASE_URL="dbname=horizon host=localhost user=xxx password=yyy"
STELLAR_CORE_URL="http://127.0.0.1:11626"
HISTORY_ARCHIVE_URLS="https://history.stellar.org/prd/core-live/core_live_001/"
PORT=8000
STELLAR_CAPTIVE_CORE_HTTP_PORT=0
INGEST=true
CAPTIVE_CORE_CONFIG_APPEND_PATH=/data/xlm/conf/stellar-captive-core.cfg
STELLAR_CORE_BINARY_PATH="/data/xlm/bin/stellar-core"
PER_HOUR_RATE_LIMIT=40000
HISTORY_RETENTION_COUNT=50
HISTORY_RETENTION_REAP_COUNT=10000

@sreuland The /etc/default/stell-horizon file must be in effect, otherwise stell-horizon won't connect to the postgres database.

Replenishment
I changed the start command:

/data/xlm/bin/stellar-horizon serve --history-retention-count 50 --history-retention-reap-count 10000

One hour later, reap appeared in the log

root@ip-xxx:/data/xlm/logs# grep "reap" stellar-horizon.log
May  9 18:24:31 ip-xxx stellar-horizon[261239]: 2024/05/09 18:24:31 DEPRECATED - the use of command-line flags: [--history-retention-count,--history-retention-reap-count], has been deprecated in favor of environment variables. Please consult our Configuring section in the developer documentation on how to use them - https://developers.stellar.org/docs/run-api-server/configuring
May  9 18:50:31 ip-xxx stellar-horizon[261239]: time="2024-05-09T18:50:31.614+08:00" level=warning msg="Error reaping lookup tables" err="error running query: delete from history_assets where id IN (select id from (select id, (select 1 from history_trades where base_asset_id = hcb.id limit 1) as c0, (select 1 from history_trades where counter_asset_id = hcb.id limit 1) as c1, (select 1 from history_trades_60000 where base_asset_id = hcb.id limit 1) as c2, (select 1 from history_trades_60000 where counter_asset_id = hcb.id limit 1) as c3, 1 as cx from history_assets hcb where id >= 1001 order by id limit 1000) as sub where c0 IS NULL and c1 IS NULL and c2 IS NULL and c3 IS NULL and 1=1);: canceling statement due to lack of response within timeout period" pid=261239 service=ingest
May  9 18:51:17 ip-xxx stellar-horizon[261239]: time="2024-05-09T18:51:17.842+08:00" level=warning msg="Error reaping lookup tables" err="error running query: delete from history_accounts where id IN (select id from (select id, (select 1 from history_effects where history_account_id = hcb.id limit 1) as c0, (select 1 from history_operation_participants where history_account_id = hcb.id limit 1) as c1, (select 1 from history_trades where base_account_id = hcb.id limit 1) as c2, (select 1 from history_trades where counter_account_id = hcb.id limit 1) as c3, (select 1 from history_transaction_participants where history_account_id = hcb.id limit 1) as c4, 1 as cx from history_accounts hcb where id >= 7001 order by id limit 1000) as sub where c0 IS NULL and c1 IS NULL and c2 IS NULL and c3 IS NULL and c4 IS NULL and 1=1);: canceling statement due to lack of response within timeout period" pid=261239 service=ingest
May  9 18:57:53 ip-xxx stellar-horizon[261239]: time="2024-05-09T18:57:53.096+08:00" level=warning msg="Error reaping lookup tables" err="error running query: delete from history_accounts where id IN (select id from (select id, (select 1 from history_effects where history_account_id = hcb.id limit 1) as c0, (select 1 from history_operation_participants where history_account_id = hcb.id limit 1) as c1, (select 1 from history_trades where base_account_id = hcb.id limit 1) as c2, (select 1 from history_trades where counter_account_id = hcb.id limit 1) as c3, (select 1 from history_transaction_participants where history_account_id = hcb.id limit 1) as c4, 1 as cx from history_accounts hcb where id >= 70585 order by id limit 1000) as sub where c0 IS NULL and c1 IS NULL and c2 IS NULL and c3 IS NULL and c4 IS NULL and 1=1);: canceling statement due to lack of response within timeout period" pid=261239 service=ingest
May  9 19:24:31 ip-xxx stellar-horizon[261239]: time="2024-05-09T19:24:31.649+08:00" level=info msg="reaper: clearing" end_ledger=51595328 pid=261239 start_ledger=51585328
May  9 19:24:38 ip-xxx stellar-horizon[261239]: time="2024-05-09T19:24:38.745+08:00" level=warning msg="Error reaping lookup tables" err="error running query: delete from history_accounts where id IN (select id from (select id, (select 1 from history_effects where history_account_id = hcb.id limit 1) as c0, (select 1 from history_operation_participants where history_account_id = hcb.id limit 1) as c1, (select 1 from history_trades where base_account_id = hcb.id limit 1) as c2, (select 1 from history_trades where counter_account_id = hcb.id limit 1) as c3, (select 1 from history_transaction_participants where history_account_id = hcb.id limit 1) as c4, 1 as cx from history_accounts hcb where id >= 329656 order by id limit 1000) as sub where c0 IS NULL and c1 IS NULL and c2 IS NULL and c3 IS NULL and c4 IS NULL and 1=1);: canceling statement due to lack of response within timeout period" pid=261239 service=ingest
May  9 19:25:50 ip-xxx stellar-horizon[261239]: time="2024-05-09T19:25:50.995+08:00" level=error msg="reaper failed: Error in DeleteRangeAll: Error clearing history_transactions: canceling statement due to statement timeout" pid=261239

But how do you solve this timeout problem?

@sreuland
Copy link
Contributor

sreuland commented May 9, 2024

@airstring , first, can you confirm that the environment variables for the running horizon process is getting the latest from /etc/default/stellar-horizon, you can dump the initial env variables of the pid:

cat /proc/<horizon_pid>/environ

does it have the RETENTION settings?

on the reap timeouts, the first approach could be to decrease the reap batch size, can you try --history-retention-reap-count 1000 and see if resolves?

if that doesn't work, an alternate approach for now would be to increase CONNECTION_TIMEOUT , it's expressed in seconds, try bumping that up to 30, and see if resolves. it does apply for all queries regardless of reaping.

we may need to raise a bug ticket on the reap sql timeouts, ideally those shouldn't be happening.

@sreuland
Copy link
Contributor

Hello @airstring , we have created a bug ticket to investigate the reaper performance issue similar as what you observed - #5320, if you don't mind, could you post your horizon system compute specs on #5320, such as:

  • cpu, ram of db host machine, is it shared with other processes?
  • cpu, ram for horizon host machine
  • what is the ledger range ingested db at present, i.e. what is the history_elder_ledger and history_latest_ledger

thanks!

@airstring
Copy link
Author

airstring commented May 17, 2024

@sreuland Thank you, you helped me solve the problem. Now reap is done automatically every hour.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants