Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPIKE] Decouple DSA from Solr #4578

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Conversation

mjgiarlo
Copy link
Member

@mjgiarlo mjgiarlo commented Aug 30, 2023

Why was this change made?

Fixes #4459

This is a spike commit towards an SDR Evolution the team has been batting around for a while now, namely severing DSA's dependency on Solr. The spike largely replaces Solr queries with direct DB queries, and for most use cases this works just fine. The key word here is "most..."

  • The Solr queries have been replaced with DB queries that reach into JSONB columns which results in table scans. I tested all of these queries in stage with large-ish, but not prod-huge, data sets (~25K records) and most of them perform fine, and I used SQL select statements to only retrieve the fields we need which made a big difference. That said, we might want to test this with prod-like data and do some benchmarking to determine if we want to index more of the JSONB data.
  • A notable performance outlier is MemberService.for which needs to make a single Workflow API call for each member of a virtual object. These are impressively slow for a virtual object with a few thousand members, taking over a minute to complete.

Another question we'd need to answer to take this work forward is what to do about bin/generate-druid-list, which allows a user to issue Solr queries directly, and lib/tasks/missing_druids.rake, which compares what's in the DSA DB and what's in Solr to determine if any objects need (re-)indexing. Are these still useful? If so, could they live elsewhere or could we solve these problems in a different way? If the answer is no, we may not want to proceed with this decoupling.

NOTE: Since this is a spike meant to generate discussion, I have not yet bothered with deal with changing the tests (or caring about linting). That will naturally come later if we decide the idea and implementation has merit.

How was this change tested?

N/A, see above.

Fixes #4459

This is a spike commit towards an SDR Evolution the team has been batting around for a while now, namely severing DSA's dependency on Solr. The spike largely replaces Solr queries with direct DB queries, and for most use cases this works just fine. The key word here is "most..."

* The Solr queries have been replaced with DB queries that reach into JSONB columns which results in table scans. I tested all of these queries in stage with large-ish, but not prod-huge, data sets (~25K records) and most of them perform fine. That said, we might want to test this with prod-like data and do some benchmarking to determine if we want to index more of the JSONB data.
* A notable performance outlier is `MemberService.for` which needs to make a single Workflow API call for *each* member of a virtual object. These are impressively slow for a virtual object with a few thousand members, taking over a minute to complete.

Another question we'd need to answer to take this work forward is what to do about `bin/generate-druid-list`, which allows a user to issue Solr queries directly, and `lib/tasks/missing_druids.rake`, which compares what's in the DSA DB and what's in Solr to determine if any objects need (re-)indexing. Are these still useful? If so, could they live elsewhere or could we solve these problems in a different way? If the answer is no, we may not want to proceed with this decoupling.

**NOTE:** Since this is a spike meant to generate discussion, I have not yet bothered with deal with changing the tests (or caring about linting). That will naturally come later if we decide the idea and implementation has merit.
@mjgiarlo mjgiarlo marked this pull request as draft August 30, 2023 22:58
@@ -2,7 +2,7 @@

require 'csv'

# Find items that are goverened by the provided APO and then return all catkeys and refresh status.
# Find items that are governed by the provided APO and then return all catkeys and refresh status.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo.

Comment on lines +8 to +11
scope :has_admin_policy, ->(admin_policy_druid) { where("administrative ->> 'hasAdminPolicy' = '#{admin_policy_druid}'").select(:external_identifier).order(:external_identifier) }
scope :in_virtual_objects, ->(member_druid) { where("structural #> '{hasMemberOrders,0}' -> 'members' ? :druid", druid: member_druid) }
scope :members_of_collection, ->(collection_druid) { where("structural -> 'isMemberOf' ? :druid", druid: collection_druid).select(:external_identifier, :version, :content_type) }
scope :embargoed_and_releaseable, -> { where("(access -> 'embargo' ->> 'releaseDate')::timestamp <= ?", Time.zone.now).select(:external_identifier) }
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If nothing else, I learned a bit more about how to do fancy JSONB queries with our crazy nested JSON.

@@ -4,7 +4,6 @@
class DeleteService
# Tries to remove any existence of the object in our systems
# Does the following:
# - Removes item from Fedora/Solr
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cruft.

Comment on lines +12 to +35
.then { |members| reject_opened_members(members, exclude_opened) }
.then { |members| select_published_members(members, only_published) }
.map do |member|
{
'id' => member.external_identifier,
'objectType' => member.content_type == Cocina::Models::ObjectType.agreement ? 'agreement' : 'item'
}
end
end

def self.reject_opened_members(members, exclude_opened)
return members unless exclude_opened

members.reject do |member|
WorkflowClientFactory.build.status(druid: member.external_identifier, version: member.version).display_simplified == 'Opened'
end
end

def self.select_published_members(members, only_published)
return members unless only_published

members.select do |member|
WorkflowClientFactory.build.lifecycle(druid: member.external_identifier, milestone_name: 'published', version: member.version).present?
end
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the sluggish part.

@@ -4,6 +4,8 @@
require_relative '../config/environment'
require 'optparse'

# TODO: Figure out if we still want this or not, given how tightly coupled this functionality is to Solr
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment.

@@ -1,5 +1,7 @@
# frozen_string_literal: true

# TODO: Figure out if we still want this or not, given how tightly coupled this functionality is to Solr
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Decouple from Solr
1 participant