Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DSEGOG-305 Config & Documentation for Data Re-ingestion #119

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

MRichards99
Copy link
Collaborator

This is a small PR that makes a couple of minor changes for the re-ingestion of data on the new database that Database Services host. I've made a couple of updates to the documentation (mainly an update to the dev server docs to explain the current situation with the data on the machine). For the ingestion script, I've moved from the configuration using a hostname and port for the database to a connection string. I've made this change because the new connection string from Database Services is quite complex and therefore the string is a better fit than the hostname and port.

…script

- This has been done because the move to MongoDB v7 with Database Services has resulted in a complex connection string so using that becomes easier than conerting the string to indivdual arugments, some of which don't seem to easily convert into CLI options
@codecov-commenter
Copy link

codecov-commenter commented Apr 29, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.99%. Comparing base (0063ced) to head (bc138ea).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #119   +/-   ##
=======================================
  Coverage   94.99%   94.99%           
=======================================
  Files          50       50           
  Lines        2837     2837           
  Branches      297      297           
=======================================
  Hits         2695     2695           
  Misses        105      105           
  Partials       37       37           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -7,8 +7,7 @@ ssh:
enabled: false
ssh_connection_url: 127.0.0.1
database:
hostname: localhost
port: 27017
connection_uri: mongodb://localhost:27017/opsgateway
name: opsgateway
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can see, we're only still using this in one place:

db = client[Config.config.database.name]

(and two places for print statements, which I'm less concerned about)

f" {Config.config.database.name}",

In most places since we're using mongoimport as a subprocess I gather the fact that connection_uri ends /opsgateway means we don't need to pass it (the DB name) directly. So I'm sort of wondering if we need to still have Config.config.database.name at all or if for that one case we still need it we could split it from the last / in the uri (as apparently pymongo.errors.InvalidName: database names cannot contain the character '/')?

Alternatively, given that in the main config we have:

mongodb:
mongodb_url: mongodb://localhost:27017
database_name: opsgateway

We could make this more consistent and keep the name and don't have it trailing the port.

The motivation of this change was around the connection string becoming to complex, so maybe the current approach is the most appropriate (without knowing the real string it's hard to judge). In reality this is all minor, but I suppose my concern would be that you change the name but not the connection_uri and now have one command that tries to use a different DB to the other (or vice versa). Ideally I think we'd have a single source of truth for opsgateway, regardless of whether it's via connection_uri or name - but ideally not both?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, one source of truth would be ideal. I think I kept Config.config.database.name around because I couldn't (at the time) get mongoimport to work with the connection URI - I'm not sure if that's the reality, I just couldn't get it working while working on this issue. Regardless, everywhere that Config.config.database.name is used, I'm pretty sure I can refactor to use pymongo instead, thus negating the need for that config option and fully moving to the URI.

Since developing this script (and classes in ingest/ to work alongside it), I've had the chance to use the script in anger. I haven't used the SSH functionality at all and to be honest, it's a bit of a pain to maintain - the code is similar across LocalCommandRunner and SSHHandler too. The original intention was to provide the ability to connect to the API of the dev server and be able to import the experiments etc., all in a single script. In reality, I have a separate cloud VM where I configure the script to point at the database and S3 bucket used by the dev server; I've taken this approach to benefit from the speed of a beefy cloud VM - the ingestion would take

I think I'm going to branch off this branch and fully switch to the connection URI and refactor/remove the separate LocalCommandRunner and SSHHandler classes. Not sure when I'll do this, hopefully in the next week or so. I'll fully digest your comments on #120 first (I've quickly scanned through them) to ensure the changes are done in a sensible way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants