Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop firebase in favor of always alive postgres instance + some sort of middleware for querying #20

Open
8 tasks
evamaxfield opened this issue Oct 15, 2021 · 6 comments
Labels
infrastructure Build, upgrade, or remove, a piece of infrastructure or core utility proposal A detailed proposal / spec for a CDP feature

Comments

@evamaxfield
Copy link
Member

Idea / Feature

Replace Firebase Cloud Firestore with a Postgres instance. (and Firebase Storage with just a Google Storage bucket)

Use Case / User Story

  1. Firebase has given us a ton of problems in automated deployment / cookiecutter deploy, replacing it with something else that we can entirely control and manage would be easier.
  2. Document Storage has generally worse query performance than Relational storage and there is no join functionality. This affects our query performance drastically.
  3. Directly from data user: document storage is just not what I think about when I think of a dataset.

Solution

In general, switch to some Relation storage database (Postgres preferred just because of how tested and trusted it is). File storage can be done using a basic google bucket.

I have seen promise with supabase: https://github.com/supabase/supabase
it is entirely open source, it has a beautiful API for access data, and bills itself as a the relational database equivalent of firebase but ya know. everything in it's tech stack is open.

it is even open enough that we can just use their docker images and self host: https://supabase.io/docs/guides/hosting/overview

Alternatives

much like the problems with firebase, I am not sure how customizable supabase is. or if we should even use it.

an alternative is just to run everything ourselves. but it seems like all of the docker images and tools that supabase has decided to use make a lot of sense.

Stakeholders

@isaacna @tohuynh

Major Components

these are rough steps

  • experiment with different frameworks, libraries, etc.
  • decide on frameworks, libraries, etc. as team
  • show example working API in Python + JS and storage components (with setup in infra-as-code)
    • make build should setup some container somewhere on GCP with database and ports are exposed and public read set for example
    • make add-test-data should simply add some fake values to the newly created infra
    • show in Python + JS that we can easily fetch data
  • come up with rollout and migration strategy
  • migrate

Dependencies

Other Notes

@evamaxfield evamaxfield added the proposal A detailed proposal / spec for a CDP feature label Oct 15, 2021
@tohuynh
Copy link

tohuynh commented Oct 15, 2021

Document Storage has generally worse query performance than Relational storage and there is no join functionality. This affects our query performance drastically.

True, but only if our database models are relational (which cdp database models are) and we don't de-normalize our database models. Our error was trying to use a non-relational database to store relational data.

Overall, I support the switch since it could improve the speed of fetching data by a lot.

@evamaxfield
Copy link
Member Author

This is also related to CouncilDataProject/cdp-backend#167. Most relational database systems have model handling built in and we may not need these models. Further, most of our issues with event gather pipeline and uploading data in chunks. I think with relational we can actually use a proper transaction? Maybe?

Finally. If we use a relational database, we can use alembic for schema versioning (thank god)

@isaacna
Copy link

isaacna commented Oct 16, 2021

I agree with To in that it doesn't really make much sense using a non-relational database when our schema is relational.

I think with relational we can actually use a proper transaction? Maybe?

Postgres supports transactions so I think this would be a huge plus. The way we handle bulk db uploads with store_event_processing_results in the pipeline currently is pretty cool, but I don't think we should have to do that in the first place.

Also after just glancing briefly at the docs, Supabase seems to be really nice and also have much more active dev support and general usage than FireO.

I will say tho that the Python client they have seems to be community made, and not actually from the supabase devs themselves. Additionally, they don't have a native ORM geneators for Python (or at least yet): https://supabase.io/docs/reference/javascript/generating-types

If ORM is really desired than we could use something like SQLAlchemy along with Supabase

What customization use cases did you have in mind that CDP have that Supabase might not support?

@evamaxfield
Copy link
Member Author

I will say tho that the Python client they have seems to be community made, and not actually from the supabase devs themselves. Additionally, they don't have a native ORM geneators for Python (or at least yet): https://supabase.io/docs/reference/javascript/generating-types

Yea I think this is something we will have to simply discuss as a team. I personally think I preferred FireO and the ORM style because we weren't using a relational system with SQL. Like, FireO's querying is easier than firebase python querying imo. That was a large reason why I personally supported it and if supabase-js / any sort of python API client can handle it, I think we should just experiment a bit.

What customization use cases did you have in mind that CDP have that Supabase might not support?

It's basically always public read and related. "How easy can we set this up with infrastructure-as-code"

@evamaxfield evamaxfield added the infrastructure Build, upgrade, or remove, a piece of infrastructure or core utility label Oct 21, 2021
@tohuynh
Copy link

tohuynh commented Nov 3, 2021

If we want to change the database, the frontend and backend code that interacts with the database would need to change as well.

We should add an intermediary data layer for both the frontend and backend. To interact with the database, the frontend and backend will go through the data layer. The advantage is if we decide to change the database again, we would only change the data layer. So long as the contract between the data layer and the frontend and backend doesn't change, we won't need to modify the frontend or backend at all.

@evamaxfield
Copy link
Member Author

Ahhhhh I see. Good point @tohuynh. So this would be the "middleware" component.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
infrastructure Build, upgrade, or remove, a piece of infrastructure or core utility proposal A detailed proposal / spec for a CDP feature
Projects
None yet
Development

No branches or pull requests

3 participants