Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test-run sometimes doesn't kill server started by test #345

Open
locker opened this issue Jul 5, 2022 · 0 comments
Open

test-run sometimes doesn't kill server started by test #345

locker opened this issue Jul 5, 2022 · 0 comments
Labels
bug Something isn't working

Comments

@locker
Copy link
Member

locker commented Jul 5, 2022

If a test starts a server, it should stop it on completion. However, if it doesn't, the server should still be stopped by test-run or luatest. Normally, this is what happens, but sometimes, the test server somehow survives.

How to reproduce:

  1. Revert test: stop server started by vinyl-luatest/update_optimize test tarantool#7359.
    The PR added server stop to vinyl-luatest/update_optimize_test.lua.
  2. Run the test in a loop:
    yes vinyl-luatest/update_optimize_test.lua | head -n 30 | xargs ./test-run.py --builddir ../build/debug
    
  3. Check if there Tarantool instances left after the test:
    ps ax | grep tarantool
    

(You'll probably need to try a few times before you catch it.)

A stray instance is running normally: it can be connected to or killed with SIGTERM.

Output
vlad@esperanza:~/src/tarantool/tarantool/test$ ps ax | grep tarantool
 240930 pts/3    S+     0:00 grep --color=auto tarantool
vlad@esperanza:~/src/tarantool/tarantool/test$ yes vinyl-luatest/update_optimize_test.lua | head -n 30 | xargs ./test-run.py --builddir ../build/debug
Started ./test-run.py --builddir ../build/debug vinyl-luatest/update_optimize_test.lua vinyl-luatest/update_optimize_test.lua vinyl-luatest/update_optimize_test.lua vinyl-luatest/update_optimize_test.lua vinyl-luatest/update_optimize_test.lua vinyl-luatest/update_optimize_test.lua vinyl-luatest/update_optimize_test.lua vinyl-luatest/update_optimize_test.lua vinyl-luatest/update_optimize_test.lua vinyl-luatest/update_optimize_test.lua vinyl-luatest/update_optimize_test.lua vinyl-luatest/update_optimize_test.lua vinyl-luatest/update_optimize_test.lua vinyl-luatest/update_optimize_test.lua vinyl-luatest/update_optimize_test.lua vinyl-luatest/update_optimize_test.lua vinyl-luatest/update_optimize_test.lua vinyl-luatest/update_optimize_test.lua vinyl-luatest/update_optimize_test.lua vinyl-luatest/update_optimize_test.lua vinyl-luatest/update_optimize_test.lua vinyl-luatest/update_optimize_test.lua vinyl-luatest/update_optimize_test.lua vinyl-luatest/update_optimize_test.lua vinyl-luatest/update_optimize_test.lua vinyl-luatest/update_optimize_test.lua vinyl-luatest/update_optimize_test.lua vinyl-luatest/update_optimize_test.lua vinyl-luatest/update_optimize_test.lua vinyl-luatest/update_optimize_test.lua
Running in parallel with 16 workers

Timeout options:
-------------------
SERVER_START_TIMEOUT:     90
REPLICATION_SYNC_TIMEOUT: 100
TEST_TIMEOUT:             110
NO_OUTPUT_TIMEOUT:        120

Collecting tests in 'app'            (Found 0   tests): application server tests.
Collecting tests in 'app-luatest'    (Found 0   tests): application server tests on luatest.
Collecting tests in 'app-tap'        (Found 0   tests): application server tests (TAP).
Collecting tests in 'box'            (Found 0   tests): Database tests.
Collecting tests in 'box-luatest'    (Found 0   tests): Database tests.
Collecting tests in 'box-py'         (Found 0   tests): legacy python tests.
Collecting tests in 'box-tap'        (Found 0   tests): Database tests with #! using TAP.
Collecting tests in 'engine'         (Found 0   tests): tarantool multiengine tests.
Collecting tests in 'engine-luatest' (Found 0   tests): Database tests.
Collecting tests in 'engine-tap'     (Found 0   tests): tarantool multiengine tap tests.
Collecting tests in 'engine_long'    (Found 0   tests): tarantool engine stress tests.
Collecting tests in 'long_run-py'    (Found 0   tests): long running tests.
Collecting tests in 'replication'    (Found 0   tests): tarantool/box, replication.
Collecting tests in 'replication-luatest' (Found 0   tests): replication luatests.
Collecting tests in 'replication-py' (Found 0   tests): tarantool/box, replication.
Collecting tests in 'small'          (Found 0   tests): libsmall unit tests.
Collecting tests in 'sql'            (Found 0   tests): sql tests.
Collecting tests in 'sql-luatest'    (Found 0   tests): SQL tests on luatest.
Collecting tests in 'sql-tap'        (Found 0   tests): Database tests with #! using TAP.
Collecting tests in 'swim'           (Found 0   tests): SWIM tests.
Collecting tests in 'unit'           (Found 0   tests): unit tests.
Collecting tests in 'vinyl'          (Found 0   tests): vinyl integration tests.
Collecting tests in 'vinyl-luatest'  (Found 30  tests): vinyl space engine luatests.
Collecting tests in 'wal_off'        (Found 0   tests): tarantool/box, wal_mode = none.
Collecting tests in 'xlog'           (Found 0   tests): tarantool write ahead log tests.
Collecting tests in 'xlog-py'        (Found 0   tests): legacy python tests.

Tarantool server information
 | Found executable at /home/vlad/src/tarantool/tarantool/build/debug/src/tarantool
 | Found tarantoolctl at /home/vlad/src/tarantool/tarantool/build/debug/extra/dist/tarantoolctl

 | Tarantool 2.11.0-entrypoint-201-g1fa9b6648725
 | Target: Linux-x86_64-Debug
 | Build options: cmake . -DCMAKE_INSTALL_PREFIX=/usr/local -DENABLE_BACKTRACE=ON
 | Compiler: /usr/bin/cc /usr/bin/c++
 | C_FLAGS: -fexceptions -funwind-tables -fno-common -fopenmp -msse2 -std=c11 -Wall -Wextra -Wno-strict-aliasing -Wno-char-subscripts -Wno-format-truncation -Wno-gnu-alignof-expression -fno-gnu89-inline -Wno-cast-function-type -Werror
 | CXX_FLAGS: -fexceptions -funwind-tables -fno-common -fopenmp -msse2 -std=c++11 -Wall -Wextra -Wno-strict-aliasing -Wno-char-subscripts -Wno-format-truncation -Wno-invalid-offsetof -Wno-gnu-alignof-expression -Wno-cast-function-type -Werror


======================================================================================
WORKR TEST                                            PARAMS          RESULT
---------------------------------------------------------------------------------
[007] vinyl-luatest/update_optimize_test.lua                          [ pass ]
[003] vinyl-luatest/update_optimize_test.lua                          [ pass ]
[004] vinyl-luatest/update_optimize_test.lua                          [ pass ]
[010] vinyl-luatest/update_optimize_test.lua                          [ pass ]
[009] vinyl-luatest/update_optimize_test.lua                          [ pass ]
[001] vinyl-luatest/update_optimize_test.lua                          [ pass ]
[013] vinyl-luatest/update_optimize_test.lua                          [ pass ]
[008] vinyl-luatest/update_optimize_test.lua                          [ pass ]
[014] vinyl-luatest/update_optimize_test.lua                          [ pass ]
[006] vinyl-luatest/update_optimize_test.lua                          [ pass ]
[002] vinyl-luatest/update_optimize_test.lua                          [ pass ]
[011] vinyl-luatest/update_optimize_test.lua                          [ pass ]
[016] vinyl-luatest/update_optimize_test.lua                          [ pass ]
[005] vinyl-luatest/update_optimize_test.lua                          [ pass ]
[012] vinyl-luatest/update_optimize_test.lua                          [ pass ]
[015] vinyl-luatest/update_optimize_test.lua                          [ pass ]
[007] vinyl-luatest/update_optimize_test.lua                          [ pass ]
[001] vinyl-luatest/update_optimize_test.lua                          [ pass ]
[004] vinyl-luatest/update_optimize_test.lua                          [ pass ]
[010] vinyl-luatest/update_optimize_test.lua                          [ pass ]
[006] vinyl-luatest/update_optimize_test.lua                          [ pass ]
[014] vinyl-luatest/update_optimize_test.lua                          [ pass ]
[003] vinyl-luatest/update_optimize_test.lua                          [ pass ]
[005] vinyl-luatest/update_optimize_test.lua                          [ pass ]
[002] vinyl-luatest/update_optimize_test.lua                          [ pass ]
[009] vinyl-luatest/update_optimize_test.lua                          [ pass ]
[013] vinyl-luatest/update_optimize_test.lua                          [ pass ]
[016] vinyl-luatest/update_optimize_test.lua                          [ pass ]
[011] vinyl-luatest/update_optimize_test.lua                          [ pass ]
[008] vinyl-luatest/update_optimize_test.lua                          [ pass ]
---------------------------------------------------------------------------------
Top 10 tests by occupied memory (RSS, Mb):
*  401.0 vinyl-luatest/update_optimize_test.lua

(Tests quicker than 0.1 seconds may be missed.)
---------------------------------------------------------------------------------
Top 10 longest tests (seconds):
*   1.81 vinyl-luatest/update_optimize_test.lua
---------------------------------------------------------------------------------
Statistics:
* pass: 30
vlad@esperanza:~/src/tarantool/tarantool/test$ ps ax | grep tarantool
 241803 pts/3    Sl     0:00 tarantool default.lua <running>
 242301 pts/3    S+     0:00 grep --color=auto tarantool
vlad@esperanza:~/src/tarantool/tarantool/test$ kill 241803
vlad@esperanza:~/src/tarantool/tarantool/test$ ps ax | grep tarantool
 242497 pts/3    S+     0:00 grep --color=auto tarantool
@locker locker added the bug Something isn't working label Jul 5, 2022
locker added a commit to locker/tarantool that referenced this issue Jul 8, 2022
The gh_6565 test doesn't stop the hot standby replica it started,
because the replica should fail to initialize and exit eventually
anyway. However, if the replica lingers until the next test due to
tarantool/test-run#345, the next test may
successfully connect to it, which is likely to lead to a failure,
because UNIX socket paths used by luatest servers are not randomized.

For example, here gh_6568 test fails after gh_6565, because it uses the
same alias for the test instance ('replica'):

NO_WRAP
[008] vinyl-luatest/gh_6565_hot_standby_unsupported_>                 [ pass ]
[008] vinyl-luatest/gh_6568_replica_initial_join_rem>                 [ fail ]
[008] Test failed! Output from reject file /tmp/t/rejects/vinyl-luatest/gh_6568_replica_initial_join_removal_of_compacted_run_files.reject:
[008] TAP version 13
[008] 1..1
[008] # Started on Fri Jul  8 15:30:47 2022
[008] # Starting group: tarantoolgh-6568-replica-initial-join-removal-of-compacted-run-files
[008] not ok 1  tarantoolgh-6568-replica-initial-join-removal-of-compacted-run-files.test_replication_compaction_cleanup
[008] #   builtin/fio.lua:242: fio.pathjoin(): undefined path part 1
[008] #   stack traceback:
[008] #         builtin/fio.lua:242: in function 'pathjoin'
[008] #         ...ica_initial_join_removal_of_compacted_run_files_test.lua:43: in function 'tarantoolgh-6568-replica-initial-join-removal-of-compacted-run-files.test_replication_compaction_cleanup'
[008] #         ...
[008] #         [C]: in function 'xpcall'
[008] replica | 2022-07-08 15:30:48.311 [832856] main/103/default.lua F> can't initialize storage: unlink, called on fd 30, aka unix/:(socket), peer of unix/:(socket): Address already in use
[008] # Ran 1 tests in 0.722 seconds, 0 succeeded, 1 errored
NO_WRAP

Let's fix this by explicitly killing the hot standby replica. Since it
could have exited voluntarily, we need to use pcall, because server.stop
fails if the instance is already dead.

This issue is similar to the one fixed by commit 8504016 ("test:
stop server started by vinyl-luatest/update_optimize test").

NO_DOC=test
NO_CHANGELOG=test
locker added a commit to tarantool/tarantool that referenced this issue Jul 8, 2022
The gh_6565 test doesn't stop the hot standby replica it started,
because the replica should fail to initialize and exit eventually
anyway. However, if the replica lingers until the next test due to
tarantool/test-run#345, the next test may
successfully connect to it, which is likely to lead to a failure,
because UNIX socket paths used by luatest servers are not randomized.

For example, here gh_6568 test fails after gh_6565, because it uses the
same alias for the test instance ('replica'):

NO_WRAP
[008] vinyl-luatest/gh_6565_hot_standby_unsupported_>                 [ pass ]
[008] vinyl-luatest/gh_6568_replica_initial_join_rem>                 [ fail ]
[008] Test failed! Output from reject file /tmp/t/rejects/vinyl-luatest/gh_6568_replica_initial_join_removal_of_compacted_run_files.reject:
[008] TAP version 13
[008] 1..1
[008] # Started on Fri Jul  8 15:30:47 2022
[008] # Starting group: gh-6568-replica-initial-join-removal-of-compacted-run-files
[008] not ok 1  gh-6568-replica-initial-join-removal-of-compacted-run-files.test_replication_compaction_cleanup
[008] #   builtin/fio.lua:242: fio.pathjoin(): undefined path part 1
[008] #   stack traceback:
[008] #         builtin/fio.lua:242: in function 'pathjoin'
[008] #         ...ica_initial_join_removal_of_compacted_run_files_test.lua:43: in function 'gh-6568-replica-initial-join-removal-of-compacted-run-files.test_replication_compaction_cleanup'
[008] #         ...
[008] #         [C]: in function 'xpcall'
[008] replica | 2022-07-08 15:30:48.311 [832856] main/103/default.lua F> can't initialize storage: unlink, called on fd 30, aka unix/:(socket), peer of unix/:(socket): Address already in use
[008] # Ran 1 tests in 0.722 seconds, 0 succeeded, 1 errored
NO_WRAP

Let's fix this by explicitly killing the hot standby replica. Since it
could have exited voluntarily, we need to use pcall, because server.stop
fails if the instance is already dead.

This issue is similar to the one fixed by commit 8504016 ("test:
stop server started by vinyl-luatest/update_optimize test").

NO_DOC=test
NO_CHANGELOG=test
locker added a commit to tarantool/tarantool that referenced this issue Jul 8, 2022
The gh_6565 test doesn't stop the hot standby replica it started,
because the replica should fail to initialize and exit eventually
anyway. However, if the replica lingers until the next test due to
tarantool/test-run#345, the next test may
successfully connect to it, which is likely to lead to a failure,
because UNIX socket paths used by luatest servers are not randomized.

For example, here gh_6568 test fails after gh_6565, because it uses the
same alias for the test instance ('replica'):

NO_WRAP
[008] vinyl-luatest/gh_6565_hot_standby_unsupported_>                 [ pass ]
[008] vinyl-luatest/gh_6568_replica_initial_join_rem>                 [ fail ]
[008] Test failed! Output from reject file /tmp/t/rejects/vinyl-luatest/gh_6568_replica_initial_join_removal_of_compacted_run_files.reject:
[008] TAP version 13
[008] 1..1
[008] # Started on Fri Jul  8 15:30:47 2022
[008] # Starting group: gh-6568-replica-initial-join-removal-of-compacted-run-files
[008] not ok 1  gh-6568-replica-initial-join-removal-of-compacted-run-files.test_replication_compaction_cleanup
[008] #   builtin/fio.lua:242: fio.pathjoin(): undefined path part 1
[008] #   stack traceback:
[008] #         builtin/fio.lua:242: in function 'pathjoin'
[008] #         ...ica_initial_join_removal_of_compacted_run_files_test.lua:43: in function 'gh-6568-replica-initial-join-removal-of-compacted-run-files.test_replication_compaction_cleanup'
[008] #         ...
[008] #         [C]: in function 'xpcall'
[008] replica | 2022-07-08 15:30:48.311 [832856] main/103/default.lua F> can't initialize storage: unlink, called on fd 30, aka unix/:(socket), peer of unix/:(socket): Address already in use
[008] # Ran 1 tests in 0.722 seconds, 0 succeeded, 1 errored
NO_WRAP

Let's fix this by explicitly killing the hot standby replica. Since it
could have exited voluntarily, we need to use pcall, because server.stop
fails if the instance is already dead.

This issue is similar to the one fixed by commit 8504016 ("test:
stop server started by vinyl-luatest/update_optimize test").

NO_DOC=test
NO_CHANGELOG=test

(cherry picked from commit 6213907)
@kyukhin kyukhin added the teamX label Jul 15, 2022
mkokryashkin pushed a commit to mkokryashkin/tarantool that referenced this issue Sep 9, 2022
The gh_6565 test doesn't stop the hot standby replica it started,
because the replica should fail to initialize and exit eventually
anyway. However, if the replica lingers until the next test due to
tarantool/test-run#345, the next test may
successfully connect to it, which is likely to lead to a failure,
because UNIX socket paths used by luatest servers are not randomized.

For example, here gh_6568 test fails after gh_6565, because it uses the
same alias for the test instance ('replica'):

NO_WRAP
[008] vinyl-luatest/gh_6565_hot_standby_unsupported_>                 [ pass ]
[008] vinyl-luatest/gh_6568_replica_initial_join_rem>                 [ fail ]
[008] Test failed! Output from reject file /tmp/t/rejects/vinyl-luatest/gh_6568_replica_initial_join_removal_of_compacted_run_files.reject:
[008] TAP version 13
[008] 1..1
[008] # Started on Fri Jul  8 15:30:47 2022
[008] # Starting group: tarantoolgh-6568-replica-initial-join-removal-of-compacted-run-files
[008] not ok 1  tarantoolgh-6568-replica-initial-join-removal-of-compacted-run-files.test_replication_compaction_cleanup
[008] #   builtin/fio.lua:242: fio.pathjoin(): undefined path part 1
[008] #   stack traceback:
[008] #         builtin/fio.lua:242: in function 'pathjoin'
[008] #         ...ica_initial_join_removal_of_compacted_run_files_test.lua:43: in function 'tarantoolgh-6568-replica-initial-join-removal-of-compacted-run-files.test_replication_compaction_cleanup'
[008] #         ...
[008] #         [C]: in function 'xpcall'
[008] replica | 2022-07-08 15:30:48.311 [832856] main/103/default.lua F> can't initialize storage: unlink, called on fd 30, aka unix/:(socket), peer of unix/:(socket): Address already in use
[008] # Ran 1 tests in 0.722 seconds, 0 succeeded, 1 errored
NO_WRAP

Let's fix this by explicitly killing the hot standby replica. Since it
could have exited voluntarily, we need to use pcall, because server.stop
fails if the instance is already dead.

This issue is similar to the one fixed by commit 8504016 ("test:
stop server started by vinyl-luatest/update_optimize test").

NO_DOC=test
NO_CHANGELOG=test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants