New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Syntactic indexing: enqueuer and scheduler #62485
base: main
Are you sure you want to change the base?
Conversation
11d5272
to
03414ce
Compare
It initialises a new database which of course doesn't contain any of the test data
for policyID, patterns := range map[int][]string{ | ||
1100: {"github.com/*"}, | ||
} { | ||
if err := store.UpdateReposMatchingPatterns(ctx, patterns, policyID, nil); err != nil { | ||
t.Fatalf("unexpected error while updating repositories matching patterns: %s", err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@varungandhi-src I need to use this method to make sure there's data in database created for policy matcher to pick up a repository based on pattern matching.
To use it I had to make policies/store package non-internal. It's not great for encapsulation, but I'm not aware of an idiomatic Go way to make this work.
An alternative would be to change this particular policy to one matching repository by ID explicitly, slightly weakening the tests.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idiomatic way would likely be to only expose a public interface with the method (or handful of methods) of interest. Sometimes that can be tricky though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes that can be tricky though
Yeah, I think exposing this particular method is not worth it.
I think I can expose a new service whose sole purpose is to be used in the periodic job that updates the repo <--> policy
matching cache.
So store will remain internal, and the service will delegate to it. If if works out, I'll extract it to a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks to our call, I realised it was much easier than I thought initially: 42022f2
Caution License checking failed, please read: how to deal with third parties licensing. |
Caution License checking failed, please read: how to deal with third parties licensing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some initial comments; will take a deeper look at more of the code shortly.
func InitRawDB(observationCtx *observation.Context) (*sql.DB, error) { | ||
rawDB, err := initDatabaseMemo.Init(observationCtx) | ||
if err != nil { | ||
return nil, err | ||
} | ||
|
||
return rawDB, err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func InitRawDB(observationCtx *observation.Context) (*sql.DB, error) { | |
rawDB, err := initDatabaseMemo.Init(observationCtx) | |
if err != nil { | |
return nil, err | |
} | |
return rawDB, err | |
} | |
func InitRawDB(observationCtx *observation.Context) (*sql.DB, error) { | |
return initDatabaseMemo.Init(observationCtx) | |
} |
I'm a bit confused though. Why does this need to be called separately/why is InitDB
not called during the normal initialization?
type EnqueueOptions struct { | ||
force bool | ||
bypassLimit bool | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add doc comments for these fields.
} | ||
|
||
type operations struct { | ||
queueIndex *observation.Operation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
queueIndex *observation.Operation | |
queueIndexes *observation.Operation |
nit: Following existing conventions, this should match the method name exactly
} | ||
|
||
return &operations{ | ||
queueIndex: op("QueueIndex"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
queueIndex: op("QueueIndex"), | |
queueIndexes: op("QueueIndexes"), |
if err != nil { | ||
return nil, errors.Wrap(err, "gitserver.ResolveRevision") | ||
} | ||
commit := string(commitID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's delay converting this to a string
until the very end. The api.CommitID
type says it is a 40-character SHA, that's useful to know compared to just the name commit
.
offset := 0 | ||
|
||
for { | ||
|
||
options := policiesshared.GetConfigurationPoliciesOptions{ | ||
RepositoryID: p.RepositoryID, | ||
ForSyntacticIndexing: forSyntacticIndexing, | ||
ForPreciseIndexing: forPreciseIndexing, | ||
Limit: p.BatchSize, | ||
Offset: offset, | ||
} | ||
|
||
policies, totalCount, err := p.Service.GetConfigurationPolicies(ctx, options) | ||
|
||
if err != nil { | ||
return err | ||
} | ||
|
||
if len(policies) == 0 { | ||
break | ||
} | ||
|
||
handlerError := handle(policies) | ||
|
||
if handlerError != nil { | ||
return handlerError // propagate error from the handler | ||
} | ||
|
||
offset = offset + len(policies) | ||
|
||
if offset >= totalCount { | ||
break | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
offset := 0 | |
for { | |
options := policiesshared.GetConfigurationPoliciesOptions{ | |
RepositoryID: p.RepositoryID, | |
ForSyntacticIndexing: forSyntacticIndexing, | |
ForPreciseIndexing: forPreciseIndexing, | |
Limit: p.BatchSize, | |
Offset: offset, | |
} | |
policies, totalCount, err := p.Service.GetConfigurationPolicies(ctx, options) | |
if err != nil { | |
return err | |
} | |
if len(policies) == 0 { | |
break | |
} | |
handlerError := handle(policies) | |
if handlerError != nil { | |
return handlerError // propagate error from the handler | |
} | |
offset = offset + len(policies) | |
if offset >= totalCount { | |
break | |
} | |
} | |
options := policiesshared.GetConfigurationPoliciesOptions{ | |
RepositoryID: p.RepositoryID, | |
ForSyntacticIndexing: forSyntacticIndexing, | |
ForPreciseIndexing: forPreciseIndexing, | |
Limit: p.BatchSize, | |
} | |
for offset := 0; ; { | |
options.Offset = offset | |
policies, totalCount, err := p.Service.GetConfigurationPolicies(ctx, options) | |
if err != nil { | |
return err | |
} | |
if len(policies) == 0 { | |
break | |
} | |
if handlerError := handle(policies); handlerError != nil { | |
return handlerError | |
} | |
if offset = offset + len(policies); offset >= totalCount { | |
break | |
} | |
} |
nit: Please don't add whitespace after every statement.
InsertIndexes(ctx context.Context, indexes []SyntacticIndexingJob) ([]SyntacticIndexingJob, error) | ||
IsQueued(ctx context.Context, repositoryID int, commit string) (bool, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on pushing this logic into a single method? Otherwise, this API creates the risk of TOCTOU because something might get queued in-between the IsQueued
check and the Insert
check. AFAICT, that's impossible given that these are currently invoked from the same goroutine, and there's only going to be one such goroutine, and the worst case scenario is that there will be some redundant work, but it seems unnecessarily error-prone.
// TODO: add operations and record this operation | ||
// s.operations.indexesInserted.Add(float64(len(ids))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need fixing?
ids, err := basestore.ScanInts(tx.Query(ctx, sqlf.Sprintf(insertIndexQuery, sqlf.Join(values, ",")))) | ||
if err != nil { | ||
return err | ||
} | ||
// TODO: add operations and record this operation | ||
// s.operations.indexesInserted.Add(float64(len(ids))) | ||
|
||
authzConds, err := database.AuthzQueryConds(ctx, database.NewDBWith(s.logger, s.db)) | ||
if err != nil { | ||
return err | ||
} | ||
|
||
queries := make([]*sqlf.Query, 0, len(ids)) | ||
for _, id := range ids { | ||
queries = append(queries, sqlf.Sprintf("%d", id)) | ||
} | ||
|
||
indexes, err = scanIndexes(tx.Query(ctx, sqlf.Sprintf(getIndexesByIDsQuery, sqlf.Join(queries, ", "), authzConds))) | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this code. It looks like we're performing insertions into syntactic_scip_indexing_jobs
and returning the ids for the just-inserted rows. Then we're reading more data from those rows in a separate query (through a view). Can we expand the RETURNING
clause to return more columns during insertion? I don't understand what role the extra authzConds
is playing here.
This PR introduces three main components:
Refactoring:
TODO:
[ ] Tests for policy iterator- we're currently investigating whether policy iterator is needed at all, as pagination of policies is in question.Test plan