Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DB integration #781

Merged
merged 24 commits into from
Apr 2, 2024
Merged

DB integration #781

merged 24 commits into from
Apr 2, 2024

Conversation

aurangzaib048
Copy link
Member

@aurangzaib048 aurangzaib048 commented Mar 5, 2024

Resolves: #762

@aurangzaib048 aurangzaib048 changed the title DB initial migration DB integration Mar 5, 2024
porteron
porteron previously approved these changes Mar 6, 2024
Copy link
Member

@porteron porteron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it looks clean. I've never used alembic before but it looks pretty straightforward. The migrations setup looks like it will develop nicely over time. I don't see any queries or migrations to create the database user(s). Maybe we should include that.

porteron
porteron previously approved these changes Mar 11, 2024
Copy link
Member

@porteron porteron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

end;
$$ language plpgsql;
""" # nosec
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reported by reviewdog 🐶
[semgrep] sqlalchemy.text passes the constructed SQL statement to the database mostly unchanged. This means that the usual SQL injection protections are not applied and this function is vulnerable to SQL injection if user input can reach here. Use normal SQLAlchemy operators (such as or_, and_, etc.) to construct SQL.

Source: https://semgrep.dev/r/python.sqlalchemy.security.audit.avoid-sqlalchemy-text.avoid-sqlalchemy-text


Cc @thypon @bcaller

porteron
porteron previously approved these changes Mar 28, 2024
Copy link
Member

@porteron porteron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Not sure if it exists somewhere already but it would be great to add to the README some generic information about the schema design, how and why we are using alembic, how migrations are managed.

db/README.md Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you 🙏

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super helpful

porteron
porteron previously approved these changes Mar 30, 2024
Copy link

github-actions bot commented Apr 1, 2024

[puLL-Merge] - brave/news-aggregator@781

Here is my review of the PR:

Description

This PR makes several major changes to integrate a PostgreSQL database using SQLAlchemy ORM to store publishers, feeds, articles, and related data. It adds database migrations using Alembic. The aggregator logic is updated to check the database for existing processed articles before re-processing them. Articles are inserted into the database after aggregation.

Changes

Changes

  • README.md - Updated documentation on project structure and added a link to detailed DB documentation
  • alembic.ini - Added configuration for Alembic database migration tool
  • bin/start.sh - Shell script updated to run DB migrations and insert publishers on startup
  • config.py - Added database connection config and helper to create a DB session
  • db/ - New directory containing:
    • README.md - Detailed documentation on DB schema design, Alembic usage and migration management
    • alembic/ - Contains Alembic migration scripts
    • setup/init.sql - SQL script to initialize news schema before running migrations
    • tables/ - Contains SQLAlchemy ORM definitions for tables
  • requirements.txt - Added new dependencies: alembic, psycopg2, sqlalchemy
  • src/aggregator/aggregate.py - Updated to return processed article data from DB
  • src/aggregator/processor.py - Checks database for existing processed articles
  • src/csv_to_global_json.py - Minor logging statement added
  • src/db_crud.py - New file containing database CRUD operations
  • src/main.py - Inserts aggregated articles into the database
  • tests/unit/aggregator/test_processor.py - Updated unit test

Security Hotspots

  1. High - db/alembic/env.py contains raw SQL execution to initialize schema. This should be reviewed to prevent SQL injection if any dynamic values are used. Consider using parameterization.

  2. Medium - src/db_crud.py has several database queries constructed using string formatting. While the current inputs look safe, using SQL parameters would be more secure to prevent injection if any of these become dynamic in future.

  3. Low - The database connection URL is read from config. Ensure this config value is not logged or exposed in error messages, as it may contain sensitive credentials.

  4. Low - Adding a database increases the attack surface. Follow security best practices like using a least-privileged DB user, encrypting data at rest if needed, and keeping dependencies up-to-date.

Let me know if you would like me to elaborate on any part of the review!

@aurangzaib048
Copy link
Member Author

aurangzaib048 commented Apr 1, 2024

Infra PR: https://github.com/brave-intl/today-ops/pull/154, only enabled for US for now.

@aurangzaib048 aurangzaib048 merged commit b4ca7f0 into master Apr 2, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Database integration in News infra
4 participants