Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: incorrect splitting lookup span into family spans #76289

Closed
cockroach-teamcity opened this issue Feb 9, 2022 · 12 comments · Fixed by #76563
Closed

sql: incorrect splitting lookup span into family spans #76289

cockroach-teamcity opened this issue Feb 9, 2022 · 12 comments · Fixed by #76563
Assignees
Labels
branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-sql-queries SQL Queries Team
Projects

Comments

@cockroach-teamcity
Copy link
Member

roachtest.tlp failed with artifacts on master @ 432a383c1a7509762ce66a2b9ae7869d82da5f31:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /artifacts/tlp/run_1
	tlp.go:157,tlp.go:70,test_runner.go:779: expected unpartitioned and partitioned results to be equal
		(1) attached stack trace
		  -- stack trace:
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runTLPQuery.func2
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tlp.go:230
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runWithTimeout.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tlp.go:255
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (2) expected unpartitioned and partitioned results to be equal
		  |   []string{
		  |   	"-3.70146132388279487E+33,p,-218355588",
		  | - 	"7786967667765.792819,NULL,-626397434",
		  |   }
		  | sql: SELECT DISTINCT tab_21127.col3_4,tab_21127.col3_5,tab_21127.col3_1 FROM defaultdb.public.table3 AS tab_21127
		  | (SELECT DISTINCT tab_21127.col3_4,tab_21127.col3_5,tab_21127.col3_1 FROM defaultdb.public.table3 AS tab_21127 WHERE (NOT tab_21127.col3_6)) UNION (SELECT DISTINCT tab_21127.col3_4,tab_21127.col3_5,tab_21127.col3_1 FROM defaultdb.public.table3 AS tab_21127 WHERE NOT ((NOT tab_21127.col3_6))) UNION (SELECT DISTINCT tab_21127.col3_4,tab_21127.col3_5,tab_21127.col3_1 FROM defaultdb.public.table3 AS tab_21127 WHERE ((NOT tab_21127.col3_6)) IS NULL)
		  | with args: []
		Error types: (1) *withstack.withStack (2) *errutil.leafError
Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

/cc @cockroachdb/sql-queries

This test on roachdash | Improve this report!

@cockroach-teamcity cockroach-teamcity added branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Feb 9, 2022
@cockroach-teamcity cockroach-teamcity added this to Triage in SQL Queries Feb 9, 2022
@blathers-crl blathers-crl bot added the T-sql-queries SQL Queries Team label Feb 9, 2022
@cockroach-teamcity
Copy link
Member Author

roachtest.tlp failed with artifacts on master @ d10188f400deafa2569b052cd3e9fd9713783ed9:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /artifacts/tlp/run_1
	cluster.go:1935,tlp.go:71,test_runner.go:779: cluster.StopE: context deadline exceeded
		(1) attached stack trace
		  -- stack trace:
		  | main.(*clusterImpl).StopE
		  | 	main/pkg/cmd/roachtest/cluster.go:1915
		  | main.(*clusterImpl).Stop
		  | 	main/pkg/cmd/roachtest/cluster.go:1934
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runTLP
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tlp.go:71
		  | main.(*testRunner).runTest.func2
		  | 	main/pkg/cmd/roachtest/test_runner.go:779
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (2) cluster.StopE
		Wraps: (3) context deadline exceeded
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) context.deadlineExceededError
Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.tlp failed with artifacts on master @ 84a0e500920e6d0b935d2007de256e42822e94c7:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /artifacts/tlp/run_1
	cluster.go:1935,tlp.go:71,test_runner.go:779: cluster.StopE: context deadline exceeded
		(1) attached stack trace
		  -- stack trace:
		  | main.(*clusterImpl).StopE
		  | 	main/pkg/cmd/roachtest/cluster.go:1915
		  | main.(*clusterImpl).Stop
		  | 	main/pkg/cmd/roachtest/cluster.go:1934
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runTLP
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tlp.go:71
		  | main.(*testRunner).runTest.func2
		  | 	main/pkg/cmd/roachtest/test_runner.go:779
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (2) cluster.StopE
		Wraps: (3) context deadline exceeded
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) context.deadlineExceededError
Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.tlp failed with artifacts on master @ 80f1c2ce09389f1d7e97376964d3f2a922405b1b:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /artifacts/tlp/run_1
	cluster.go:1935,tlp.go:71,test_runner.go:779: cluster.StopE: context deadline exceeded
		(1) attached stack trace
		  -- stack trace:
		  | main.(*clusterImpl).StopE
		  | 	main/pkg/cmd/roachtest/cluster.go:1915
		  | main.(*clusterImpl).Stop
		  | 	main/pkg/cmd/roachtest/cluster.go:1934
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runTLP
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tlp.go:71
		  | main.(*testRunner).runTest.func2
		  | 	main/pkg/cmd/roachtest/test_runner.go:779
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (2) cluster.StopE
		Wraps: (3) context deadline exceeded
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) context.deadlineExceededError
Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.tlp failed with artifacts on master @ 260be01361ed7264a870e530397ed3529cd014f7:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /artifacts/tlp/run_1
	cluster.go:1935,tlp.go:71,test_runner.go:779: cluster.StopE: context deadline exceeded
		(1) attached stack trace
		  -- stack trace:
		  | main.(*clusterImpl).StopE
		  | 	main/pkg/cmd/roachtest/cluster.go:1915
		  | main.(*clusterImpl).Stop
		  | 	main/pkg/cmd/roachtest/cluster.go:1934
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runTLP
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tlp.go:71
		  | main.(*testRunner).runTest.func2
		  | 	main/pkg/cmd/roachtest/test_runner.go:779
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (2) cluster.StopE
		Wraps: (3) context deadline exceeded
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) context.deadlineExceededError
Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.tlp failed with artifacts on master @ bbb473c8f304ac20fec51ff0a0d04e128383bcf6:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /artifacts/tlp/run_1
	cluster.go:1935,tlp.go:71,test_runner.go:779: cluster.StopE: context deadline exceeded
		(1) attached stack trace
		  -- stack trace:
		  | main.(*clusterImpl).StopE
		  | 	main/pkg/cmd/roachtest/cluster.go:1915
		  | main.(*clusterImpl).Stop
		  | 	main/pkg/cmd/roachtest/cluster.go:1934
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runTLP
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tlp.go:71
		  | main.(*testRunner).runTest.func2
		  | 	main/pkg/cmd/roachtest/test_runner.go:779
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (2) cluster.StopE
		Wraps: (3) context deadline exceeded
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) context.deadlineExceededError
Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@yuzefovich
Copy link
Member

yuzefovich commented Feb 14, 2022

First failure is reduced to:

SET sql_safe_updates = false;

CREATE TABLE table3 (col3_0 INT8 NOT NULL, col3_2 REGNAMESPACE NOT NULL, col3_3 TIMESTAMP NOT NULL, col3_4 DECIMAL NOT NULL, col3_5 "char", col3_6 BOOL NOT NULL, PRIMARY KEY (col3_3 DESC, col3_4 DESC, col3_6, col3_0 DESC), UNIQUE (col3_6 DESC), FAMILY (col3_0), FAMILY (col3_4), FAMILY (col3_2, col3_6), FAMILY (col3_5, col3_3));

INSERT
INTO
	table3 (col3_6, col3_0, col3_3, col3_4, col3_5, col3_2)
VALUES
	(
		false,
		(-1061417075):::INT8,
		'2013-01-30 17:00:17.000512':::TIMESTAMP,
		6.384946904185073993E+28:::DECIMAL,
		'S':::STRING,
		839972907:::OID
	);

UPDATE
	table3
SET
	col3_5 = NULL, col3_2 = 2707732394:::OID;

SELECT DISTINCT col3_4, col3_5 FROM table3;

(SELECT DISTINCT col3_4, col3_5 FROM table3 WHERE (NOT col3_6)) UNION (SELECT DISTINCT col3_4, col3_5 FROM table3 WHERE NOT (NOT col3_6)) UNION (SELECT DISTINCT col3_4, col3_5 FROM table3 WHERE (NOT col3_6) IS NULL);

Here is where the problem appears:

root@127.0.0.1:26257/defaultdb> SELECT DISTINCT col3_4, col3_5 FROM table3 WHERE (NOT col3_6);
  col3_4 | col3_5
---------+---------
(0 rows)


Time: 1ms total (execution 1ms / network 0ms)

root@127.0.0.1:26257/defaultdb> SELECT DISTINCT col3_4, col3_5 FROM table3@primary WHERE (NOT col3_6);
           col3_4          | col3_5
---------------------------+---------
  6.384946904185073993E+28 | NULL
(1 row)

Somehow we're missing a row when reading of a UNIQUE secondary index, and my guess is that we have a problem with UPDATE.

@yuzefovich
Copy link
Member

The bug is present on both 21.1.13 and 21.2.5.

I no longer think the problem is with the UPDATE because querying for a single column returns the correct result (with the plan being the same). I'm currently thinking that we're incorrectly constructing Get requests for the index join (on 21.1 the same applies to Scan requests for the index join).

@yuzefovich
Copy link
Member

yuzefovich commented Feb 15, 2022

Smaller repro:

CREATE TABLE t (
  pk1 DECIMAL NOT NULL, pk2 BOOL NOT NULL, c1 INT8, c2 INT8,
  PRIMARY KEY (pk1, pk2),
  UNIQUE (pk2),
  FAMILY (c1), FAMILY (pk1, pk2, c2)
);
INSERT
INTO
	t (pk1, pk2, c1)
VALUES
	(
		1:::DECIMAL,
		false,
		0:::INT8
	);
SELECT c2 FROM t WHERE (NOT pk2);

The order of column families matters, the order of column in index matters, DECIMAL cannot be replaced with INT8, non-zero value for DECIMAL has to be inserted.

@yuzefovich yuzefovich removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Feb 15, 2022
@yuzefovich
Copy link
Member

I think I've narrowed it down.

The INSERT statements performs a single put for the first column family (familyID == 0) because the value of c2 is NULL:

  2022-02-15 01:19:43.124527+00 | 00:00:00.006515 | CPut /Table/104/1/1/0/0 -> /TUPLE/3:3:Int/0                 | [n1,client=127.0.0.1:56053,user=root]              | sql/row/writer.go:213                   | flow             |    7

However, later in the index join we're trying to read value of c2 using familyID == 1 and that lookups comes back empty:

  2022-02-15 01:19:57.784064+00 | 00:00:00.001573 | Scan /Table/104/1/1/0/1/1                   | [n1,client=127.0.0.1:56053,user=root]              | sql/row/kv_batch_fetcher.go:322         | join reader      |    5

@cockroach-teamcity
Copy link
Member Author

roachtest.tlp failed with artifacts on master @ 9514f13d0190849eda12b39e272c278b82c269f4:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /artifacts/tlp/run_1
	tlp.go:157,tlp.go:70,test_runner.go:779: expected unpartitioned and partitioned results to be equal
		(1) attached stack trace
		  -- stack trace:
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runTLPQuery.func2
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tlp.go:230
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runWithTimeout.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tlp.go:255
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (2) expected unpartitioned and partitioned results to be equal
		  |   []string{
		  |   	strings.Join({
		  |   		"e,",
		  | + 		"-",
		  |   		"0,false,1992-10-16 00:00:00 +0000 +0000,0000-01-01 00:00:00 +155",
		  |   		"9 +1559,1977-03-27 06:28:14.000752 +0000 UTC,false,false,0.77932",
		  |   		"2868568521,{na3@\n>1,false,true,false,true,false,false",
		  |   	}, ""),
		  |   	"v,-1.6114159461761979,true,1978-09-04 00:00:00 +0000 +0000,0000-"...,
		  |   }
		  | sql: SELECT *, tab_28698.col3_2, NOT (tab_28698.col3_2), (tab_28698.col3_2) IS NULL, true, false, false FROM defaultdb.public.table3 AS tab_28698
		  | (SELECT *,
		  | tab_28698.col3_2, NOT (tab_28698.col3_2), (tab_28698.col3_2) IS NULL,
		  | tab_28698.col3_2, NOT (tab_28698.col3_2), (tab_28698.col3_2) IS NULL
		  | FROM defaultdb.public.table3 AS tab_28698
		  | WHERE tab_28698.col3_2)
		  | UNION ALL (SELECT *,
		  | tab_28698.col3_2, NOT (tab_28698.col3_2), (tab_28698.col3_2) IS NULL,
		  | NOT (tab_28698.col3_2), tab_28698.col3_2, (tab_28698.col3_2) IS NULL
		  | FROM defaultdb.public.table3 AS tab_28698
		  | WHERE NOT (tab_28698.col3_2))
		  | UNION ALL (SELECT *,
		  | tab_28698.col3_2, NOT (tab_28698.col3_2), (tab_28698.col3_2) IS NULL,
		  | (tab_28698.col3_2) IS NULL, (tab_28698.col3_2) IS NOT NULL, (NOT (tab_28698.col3_2)) IS NOT NULL
		  | FROM defaultdb.public.table3 AS tab_28698
		  | WHERE (tab_28698.col3_2) IS NULL)
		  | with args: []
		Error types: (1) *withstack.withStack (2) *errutil.leafError
Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@yuzefovich yuzefovich changed the title roachtest: tlp failed sql: incorrect splitting lookup span into family spans Feb 15, 2022
@yuzefovich yuzefovich self-assigned this Feb 15, 2022
@yuzefovich yuzefovich moved this from Triage to 22.1 February Milestone in SQL Queries Feb 15, 2022
@yuzefovich
Copy link
Member

Hm, I cannot reproduce the latest failure (i.e. both queries return the same result sets):

root@:26257/defaultdb> SELECT *, tab_28698.col3_2, NOT (tab_28698.col3_2), (tab_28698.col3_2) IS NULL, true, false, false FROM defaultdb.public.table3 AS tab_28698                                                                                                                                 ;
  col3_0 |       col3_1        | col3_2 |   col3_3   |        col3_4         |            col3_5             | col3_6 | col3_7 |       col3_8        | col3_9 | col3_2 | ?column? | ?column? | bool | bool  | bool
---------+---------------------+--------+------------+-----------------------+-------------------------------+--------+--------+---------------------+--------+--------+----------+----------+------+-------+--------
  e      |                   0 | false  | 1992-10-16 | 00:00:00+15:59        | 1977-03-27 06:28:14.000752+00 | false  | false  |   0.779322868568521 | {na3@  | false  |   true   |  false   | true | false | false
         |                     |        |            |                       |                               |        |        |                     | >1     |        |          |          |      |       |
  v      | -1.6114159461761979 |  true  | 1978-09-04 | 08:09:16.083766-09:22 | 2026-09-14 13:35:00.000648+00 | true   | true   | -0.8320930776076769 | {na3@  |  true  |  false   |  false   | true | false | false
         |                     |        |            |                       |                               |        |        |                     | >1     |        |          |          |      |       |
(2 rows)
root@:26257/defaultdb> (SELECT *,                                                                                                                 tab_28698.col3_2, NOT (tab_28698.col3_2), (tab_28698.col3_2) IS NULL,                                                                             tab_28698.col3_2, NOT (tab_28698.col3_2), (tab_28698.col3_2) IS NULL                                                                              FROM defaultdb.public.table3 AS tab_28698                                                                                                         WHERE tab_28698.col3_2)                                                                                                                           UNION ALL (SELECT *,                                                                                                                              tab_28698.col3_2, NOT (tab_28698.col3_2), (tab_28698.col3_2) IS NULL,                                                                             NOT (tab_28698.col3_2), tab_28698.col3_2, (tab_28698.col3_2) IS NULL                                                                              FROM defaultdb.public.table3 AS tab_28698                                                                                                         WHERE NOT (tab_28698.col3_2))                                                                                                                     UNION ALL (SELECT *,                                                                                                                              tab_28698.col3_2, NOT (tab_28698.col3_2), (tab_28698.col3_2) IS NULL,                                                                             (tab_28698.col3_2) IS NULL, (tab_28698.col3_2) IS NOT NULL, (NOT (tab_28698.col3_2)) IS NOT NULL                                                  FROM defaultdb.public.table3 AS tab_28698                                                                                                         WHERE (tab_28698.col3_2) IS NULL);                                                                                                                 
  col3_0 |       col3_1        | col3_2 |   col3_3   |        col3_4         |            col3_5             | col3_6 | col3_7 |       col3_8        | col3_9 | col3_2 | ?column? | ?column? | col3_2 | ?column? | ?column?
---------+---------------------+--------+------------+-----------------------+-------------------------------+--------+--------+---------------------+--------+--------+----------+----------+--------+----------+-----------
  v      | -1.6114159461761979 |  true  | 1978-09-04 | 08:09:16.083766-09:22 | 2026-09-14 13:35:00.000648+00 | true   | true   | -0.8320930776076769 | {na3@  |  true  |  false   |  false   |  true  |  false   |  false
         |                     |        |            |                       |                               |        |        |                     | >1     |        |          |          |        |          |
  e      |                  -0 | false  | 1992-10-16 | 00:00:00+15:59        | 1977-03-27 06:28:14.000752+00 | false  | false  |   0.779322868568521 | {na3@  | false  |   true   |  false   |  true  |  false   |  false
         |                     |        |            |                       |                               |        |        |                     | >1     |        |          |          |        |          |
(2 rows)

@mgartner do you have an idea of why a failure would be filed in this case?

@craig craig bot closed this as completed in ffd8b81 Feb 15, 2022
SQL Queries automation moved this from 22.1 February Milestone to Done Feb 15, 2022
@mgartner
Copy link
Collaborator

@yuzefovich Reminds me of the bizarre failures I saw here: #74675 (comment). I think it's a problem with the test apparatus, but I haven't been able to confirm that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-sql-queries SQL Queries Team
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants