[SPIKE] Decouple DSA from Solr #4578

mjgiarlo · 2023-08-30T22:58:10Z

Why was this change made?

This is a spike commit towards an SDR Evolution the team has been batting around for a while now, namely severing DSA's dependency on Solr. The spike largely replaces Solr queries with direct DB queries, and for most use cases this works just fine. The key word here is "most..."

The Solr queries have been replaced with DB queries that reach into JSONB columns which results in table scans. I tested all of these queries in stage with large-ish, but not prod-huge, data sets (~25K records) and most of them perform fine, and I used SQL select statements to only retrieve the fields we need which made a big difference. That said, we might want to test this with prod-like data and do some benchmarking to determine if we want to index more of the JSONB data.
A notable performance outlier is MemberService.for which needs to make a single Workflow API call for each member of a virtual object. These are impressively slow for a virtual object with a few thousand members, taking over a minute to complete.

Another question we'd need to answer to take this work forward is what to do about bin/generate-druid-list, which allows a user to issue Solr queries directly, and lib/tasks/missing_druids.rake, which compares what's in the DSA DB and what's in Solr to determine if any objects need (re-)indexing. Are these still useful? If so, could they live elsewhere or could we solve these problems in a different way? If the answer is no, we may not want to proceed with this decoupling.

NOTE: Since this is a spike meant to generate discussion, I have not yet bothered with deal with changing the tests (or caring about linting). That will naturally come later if we decide the idea and implementation has merit.

How was this change tested?

N/A, see above.

Fixes #4459 This is a spike commit towards an SDR Evolution the team has been batting around for a while now, namely severing DSA's dependency on Solr. The spike largely replaces Solr queries with direct DB queries, and for most use cases this works just fine. The key word here is "most..." * The Solr queries have been replaced with DB queries that reach into JSONB columns which results in table scans. I tested all of these queries in stage with large-ish, but not prod-huge, data sets (~25K records) and most of them perform fine. That said, we might want to test this with prod-like data and do some benchmarking to determine if we want to index more of the JSONB data. * A notable performance outlier is `MemberService.for` which needs to make a single Workflow API call for *each* member of a virtual object. These are impressively slow for a virtual object with a few thousand members, taking over a minute to complete. Another question we'd need to answer to take this work forward is what to do about `bin/generate-druid-list`, which allows a user to issue Solr queries directly, and `lib/tasks/missing_druids.rake`, which compares what's in the DSA DB and what's in Solr to determine if any objects need (re-)indexing. Are these still useful? If so, could they live elsewhere or could we solve these problems in a different way? If the answer is no, we may not want to proceed with this decoupling. **NOTE:** Since this is a spike meant to generate discussion, I have not yet bothered with deal with changing the tests (or caring about linting). That will naturally come later if we decide the idea and implementation has merit.

mjgiarlo · 2023-08-30T23:02:54Z

app/reports/apo_catkey.rb

@@ -2,7 +2,7 @@

 require 'csv'

-# Find items that are goverened by the provided APO and then return all catkeys and refresh status.
+# Find items that are governed by the provided APO and then return all catkeys and refresh status.


mjgiarlo · 2023-08-30T23:03:22Z

app/models/dro.rb

+  scope :has_admin_policy, ->(admin_policy_druid) { where("administrative ->> 'hasAdminPolicy' = '#{admin_policy_druid}'").select(:external_identifier).order(:external_identifier) }
+  scope :in_virtual_objects, ->(member_druid) { where("structural #> '{hasMemberOrders,0}' -> 'members' ? :druid", druid: member_druid) }
+  scope :members_of_collection, ->(collection_druid) { where("structural -> 'isMemberOf' ? :druid", druid: collection_druid).select(:external_identifier, :version, :content_type) }
+  scope :embargoed_and_releaseable, -> { where("(access -> 'embargo' ->> 'releaseDate')::timestamp <= ?", Time.zone.now).select(:external_identifier) }


If nothing else, I learned a bit more about how to do fancy JSONB queries with our crazy nested JSON.

mjgiarlo · 2023-08-30T23:05:11Z

app/services/delete_service.rb

@@ -4,7 +4,6 @@
 class DeleteService
  # Tries to remove any existence of the object in our systems
  #   Does the following:
-  #   - Removes item from Fedora/Solr


mjgiarlo · 2023-08-30T23:05:33Z

app/services/member_service.rb

+      .then { |members| reject_opened_members(members, exclude_opened) }
+      .then { |members| select_published_members(members, only_published) }
+      .map do |member|
+      {
+        'id' => member.external_identifier,
+        'objectType' => member.content_type == Cocina::Models::ObjectType.agreement ? 'agreement' : 'item'
+      }
+    end
+  end
+
+  def self.reject_opened_members(members, exclude_opened)
+    return members unless exclude_opened
+
+    members.reject do |member|
+      WorkflowClientFactory.build.status(druid: member.external_identifier, version: member.version).display_simplified == 'Opened'
+    end
+  end
+
+  def self.select_published_members(members, only_published)
+    return members unless only_published
+
+    members.select do |member|
+      WorkflowClientFactory.build.lifecycle(druid: member.external_identifier, milestone_name: 'published', version: member.version).present?
+    end


This is the sluggish part.

mjgiarlo · 2023-08-30T23:05:52Z

bin/generate-druid-list

@@ -4,6 +4,8 @@
 require_relative '../config/environment'
 require 'optparse'

+# TODO: Figure out if we still want this or not, given how tightly coupled this functionality is to Solr


See comment.

mjgiarlo · 2023-08-30T23:05:58Z

lib/tasks/missing_druids.rake

@@ -1,5 +1,7 @@
 # frozen_string_literal: true

+# TODO: Figure out if we still want this or not, given how tightly coupled this functionality is to Solr


See comment.

mjgiarlo marked this pull request as draft August 30, 2023 22:58

mjgiarlo commented Aug 30, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPIKE] Decouple DSA from Solr #4578

[SPIKE] Decouple DSA from Solr #4578

mjgiarlo commented Aug 30, 2023 •

edited

mjgiarlo Aug 30, 2023

mjgiarlo Aug 30, 2023

mjgiarlo Aug 30, 2023

mjgiarlo Aug 30, 2023

mjgiarlo Aug 30, 2023

mjgiarlo Aug 30, 2023

		@@ -1,5 +1,7 @@
		# frozen_string_literal: true

		# TODO: Figure out if we still want this or not, given how tightly coupled this functionality is to Solr

[SPIKE] Decouple DSA from Solr #4578

Are you sure you want to change the base?

[SPIKE] Decouple DSA from Solr #4578

Conversation

mjgiarlo commented Aug 30, 2023 • edited

Why was this change made?

How was this change tested?

mjgiarlo Aug 30, 2023

Choose a reason for hiding this comment

mjgiarlo Aug 30, 2023

Choose a reason for hiding this comment

mjgiarlo Aug 30, 2023

Choose a reason for hiding this comment

mjgiarlo Aug 30, 2023

Choose a reason for hiding this comment

mjgiarlo Aug 30, 2023

Choose a reason for hiding this comment

mjgiarlo Aug 30, 2023

Choose a reason for hiding this comment

mjgiarlo commented Aug 30, 2023 •

edited