Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQLite migration #150

Merged
merged 42 commits into from Jun 28, 2022
Merged

SQLite migration #150

merged 42 commits into from Jun 28, 2022

Conversation

jsign
Copy link
Contributor

@jsign jsign commented Jun 27, 2022

This PR migrates from using Postgres to SQLite.

In this PR description, I’ll list a high-level overview of the changes, for more details see the PR comments.

sqlc and migrations tools

sqlc doesn’t support sqlite so unfortunately, we can’t rely on it for automatically generating models and queries. There’s ongoing work in that library to add support but isn’t there yet. When that happens, we could think of moving again to sqlc for convenience.
What I did regarding this was:
1- Switching in the sqlc config to use the database/sql generic driver instead of the Postgres one we were using.
2- Autogenerate again the queries so generated statements and models depend on database/sql types.
3- Remove sqlc from our project.

This removed many postgres dependencies from our generated SQL code. Obviously, the generated code wasn’t working since we had queries with Postgres syntax and similar, but was a good clean-up before removing sqlc.

Regarding migrations, we still use golang-migrate for migrations. Obviously, I had to start from scratch in migrations since our history of migrations was full of Postgres-syntax changes which won’t work in SQLite. So, we started again from 001_ which is also nice to have a fresh starting point with our current schema.

Validator configuration

Regarding the validator configuration flags, two important changes.

Changes in flags:

  • All the flags related to database connection are gone since there’s no user, password, or URL.
  • The AdminAPI section is gone. The admin user or admin password wasn’t used since we removed whitelisting, so made no sense to still have this configuration which can add confusion to configuring the validator. If we ever need them again, it’s totally fine to include them again.
  • The Impl attribute was removed since this was to enable a mocked Mesa service that made sense some months ago.
  • The EventFeed.MaxBlocksFetchSize was removed since I made a change to dynamically find the right value. Coming up with a good configuration number manually/statically was complicated. This allows cold syncing to be faster to consume the history of the chain that didn’t have events as fast as possible, and later adjust the maximum response size when events starts appearing.

There’s a new --dir flag that indicates the state directory of the validator, by default ${HOME}/.tableland which:

  • Contains the config.json file, instead of forcing the validator to find it in the same path of executing the validator as it was now.
  • Will have the SQLite database.

In a nutshell, all the state of the validator will be in the --dir folder.

SQLite setup

Regarding SQLite3 drivers, after digging around with potential SQLite3 drivers, we decided to go with mattn/sqlite3 since looks, like it’s the best, maintained one and battle-tested. There are some other CGO and non-CGO options but we valued more maturity.

For the connection URI we use file://%s?_busy_timeout=5000&_foreign_keys=on&_journal_mode=WAL which I’ll explain:

  • _busy_timeout is a configuration to manage how long are we willing to wait for a write-access to the database. Recall that SQLite only allows a single-writer; so if there’s already a writer doing actions in the database, a new writer can get a database is locked error. The _busy_timeout allows to wait up to X milliseconds before returning the error. This configuration is only to avoid potential error noise; the application should always handle the error.
  • _foreign_keys=on simply enables foreign key checking since it’s disabled by default.
  • _journal_mode=WAL is a new-ish way to run the database which uses a WAL instead of journaling to handle transactions. WAL mode allows for having multiple readers and a single writer at the same time, compared with the default journal mode which a writer blocks excludes any reader. In WAL mode all readers have point-in-time isolation. To understand more about WAL mode, see the docs.

ACL representation in the DB

In the Postgres implementation, we leveraged support for array-type columns and UNNEST() to do some magic of adding or removing privileges in a single-roundtrip.

Since SQLite3 doesn’t have array types we can’t use what we have. Since I wanted to also do the addition or removal of privileges in a single roundtrip, I represented privileges as an INTEGER and used bitwise operations (which are supported in SQLite3) to add and remove privileges in one roundtrip.

See the PR comments for more details.

SQL result to JSON

This logic was greatly simplified since I noticed that the way Scan(..) worked if we gave *interface{} returned the same values used by the underlying driver, and in mattn/sqlite3 case that were exactly the things we expected. More on this in PR comments.

Nested txns

In our event processing pipeline, we use nested txns to do a crash-resistant execution of block-level events which:

  • Can fail independently, without forcing a rollback on other successful events in the same block
  • while running everything in block-level txn, so if the validator crashes at any point in time, the database is in a correct state.

The database/sql driver doesn’t have support for nested txns as the previous Postgres driver had. SQLite3 has support for nested txns, so I had to do a helper function to manually run SAVEPOINT . From the perspective of our logic, we do exactly the same but use this new method to wrap whatever logic we want to do in the nested txns.

More about this in PR comments.

Detecting query vs infrastructure errors in the processor

The mattn/sqlite3 driver has a two-level grouping of SQL errors, which is very convenient for what we wanted to do in distinguishing user-related errors vs infrastructure ones. More on this in this PR comment

Tests

Now the tests use in-memory SQLite databases. Locally I’ve noticed a 4x (13s vs ~55s) speed improvement in running tests because of that, plus some other tests parallelization work I’ve done in some places.

In CI, I still see some flakiness probably because now more tests can run in parallel which may cause some memory problems. Since I’m tired of that, I simply made the CI action run the tests up to 3 times. If there are some flaky tests, the second run will have 99% of the tests run instantly due to caching and only retry whatever tests might be failed. It’s a pretty non-invasive change, and it will save us (hopefully) many headaches. This approach sounds better than simply limiting the parallelization run of the tests in CI, which is simply accepting to run things slower.

jsign added 30 commits June 27, 2022 10:27
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
…ID from int64

Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
…sted txns

Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
.
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
@jsign jsign self-assigned this Jun 27, 2022
@jsign jsign changed the title Sqlite migration SQLite migration Jun 27, 2022
@jsign jsign force-pushed the jsign/sqlitemigr branch 3 times, most recently from 137e117 to df19b56 Compare June 27, 2022 15:05
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
@@ -14,12 +14,16 @@ linters:
- whitespace
- godot
- lll
- sqlclosecheck
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this extra linter that sounds useful.

Comment on lines +28 to +29
skip-dirs:
- "pkg/sqlstore/impl/system/internal/db"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The .go files here are where the previous sqlc files existed, which had a top comment saying that the code was automatically generated, which by default the linters ignore. Now that we don't use sqlic, those files aren't autogenerated, so just avoid lints there as usual.

Makefile Show resolved Hide resolved
func setupConfig() *config {
fileBytes, err := os.ReadFile(configFilename)
fileStr := string(fileBytes)
func setupConfig() (*config, string) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some changes here to support the new --dir flag.
It loads the config.json from the directory folder, and also returns the parsed directory folder path so in main.go we can read the database file from there.

Comment on lines +83 to +91
flagDirPath := flag.String("dir", "${HOME}/.tableland", "Directory where the configuration and DB exist")
flag.Parse()
if flagDirPath == nil {
log.Fatal().Msg("--dir is null")
return nil, "" // Helping the linter know the next line is safe.
}
dirPath := os.ExpandEnv(*flagDirPath)

_ = os.MkdirAll(dirPath, 0755)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read the --dir flag via the flag package, and expand env vars.
If the folder doesn't exist, create it.

`INSERT INTO system_acl ("chain_id","table_id","controller","privileges","created_at")
VALUES (?1, ?2, ?3, ?4, ?5)
ON CONFLICT (chain_id,table_id,controller)
DO UPDATE SET privileges = privileges | ?4, updated_at = ?5`,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We OR the current value in the table with the mask we generated. This will activate the desired bits thus adding the wanted privileges.

privileges,
); err != nil {
privilegesMask,
time.Now().Unix()); err != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note here that we use time.Now().Unix() to be used in created_at in case of an insert, but the real intention is to use it in updated_at in the UPDATE case. For other cases of created_at in this file we don't need time.Now().Unix() since we have a default value. But for updated_at, we need an explicit one.

Comment on lines +797 to 812
// Tune the mask to have a 0 in the places we want to disable the bit.
// For example, if we want to remove tableland.PrivUpdate, the following
// code will transform 111 -> 101.
// We'll then use 101 to AND the value in the DB.
for _, privilege := range privileges {
switch privilege {
case tableland.PrivInsert:
privilegesMask &^= tableland.PrivInsert.Bitfield
case tableland.PrivUpdate:
privilegesMask &^= tableland.PrivUpdate.Bitfield
case tableland.PrivDelete:
privilegesMask &^= tableland.PrivDelete.Bitfield
default:
return fmt.Errorf("unknown privilege: %s", privilege.Abbreviation)
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we're in executeRevokePrivilegesTxn, so the opposite case of the described one above.
To do this we do the opposite. We start with a full mask of all bits enabled (L796), and we set to 0 the desired privileges we want to disable.

Some docs to explain what &^ is just in case:

The &^ operator is bit clear (AND NOT): in the expression z = x &^ y, each bit of z is 0 if the corresponding bit of y is 1; otherwise it equals the corresponding bit of x.

Exactly what we needed.

}
if _, err := tx.ExecContext(ctx,
`UPDATE system_acl
SET privileges = privileges & ?4, updated_at = ?5
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we do an AND which will keep all the potentially existing privileges, but it will disable the ones that we wanted (since there's a 0 in those bit positions).

Comment on lines +836 to +868
// isErrCausedByQuery detects if the query execution failed because of possibly expected
// bad queries from users. If that's the case the call might want to accept the failure
// as an expected event in the flow.
func isErrCausedByQuery(err error) (string, bool) {
// This array contains all the sqlite errors that should be query related.
// e.g: inserting a column with the wrong type, some function call failing, etc.
// All these errors must be errors that will always happen if the query is retried.
// (e.g: a timeout error isn't the querys fault, but an infrastructure problem)
//
// Each error in sqlite3 has an "Error Code" and an "Extended error code".
// e.g: a FK violation has "Error Code" 19 (ErrConstraint) and
// "Extended error code" 787 (SQLITE_CONSTRAINT_FOREIGNKEY).
// The complete list of extended errors is found in: https://www.sqlite.org/rescode.html
// In this logic if we use "Error Code", with some few cases, we can detect a wide range of errors without
// being so exhaustive dealing with "Extended error codes".
//
// sqlite3ExecutionErrors is probably missing values, but we'll keep discovering and adding them.
sqlite3ExecutionErrors := []sqlite3.ErrNo{
sqlite3.ErrError, /* SQL error or missing database */
sqlite3.ErrConstraint, /* Abort due to constraint violation */
sqlite3.ErrTooBig, /* String or BLOB exceeds size limit */
sqlite3.ErrMismatch, /* Data type mismatch */
}
var sqlErr sqlite3.Error
if errors.As(err, &sqlErr) {
for _, ee := range sqlite3ExecutionErrors {
if sqlErr.Code == ee {
return sqlErr.Error(), true
}
}
}
return "", false
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I went pretty long in the comments so they already explain the gist.

The good part is that I think we already have all the values we need here, compared to the Postgres case.
In the sqlite driver, the error codes have two levels: an Error code which is pretty general like "error in constraints", and an Extended error code that gives more detail on which kind of constraint error (FK?, CHECK?, etc).

The good part is that the Error code level contains few values, see here. I think the ones I included here should cover all scenarios which are nice.

We'll see with time... in the worst case, we can have a long list of extended errors for some cases, which would be more similar to how the Postgres detection worked. But that style is kind of endless waiting for finding the next unknown error, so pretty bullish on this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome 👏

@jsign jsign requested a review from brunocalza June 27, 2022 19:19
@jsign jsign marked this pull request as ready for review June 27, 2022 19:19
@sanderpick
Copy link
Member

Thanks for the great summary 🙌

Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Copy link
Collaborator

@brunocalza brunocalza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Very excited to see this change coming true. Makes the whole architecture much simpler. 👏👏

Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
@jsign jsign merged commit bdfa368 into sqlite Jun 28, 2022
@jsign jsign deleted the jsign/sqlitemigr branch July 7, 2022 13:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants