Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(taps): Support declaring arbitrary SQLAlchemy type instances in SQL tap schemas #2102

Open
edgarrmondragon opened this issue Dec 12, 2023 · 1 comment
Labels
kind/Feature New feature or request valuestream/SDK

Comments

@edgarrmondragon
Copy link
Collaborator

edgarrmondragon commented Dec 12, 2023

Feature scope

Tap/target metadata.

Description

Related to the original Singer spec's sql-datatype (see #1323, #1903), this new metadata object would work in an analogous manner to Python's logging module capability for importing and instantiating arbitrary callables and classes.

The story, specifically for vector/embedding fields for LLMs, would go something like:

  1. As a user, I know that one or more fields in a stream represent a vector. For example, {"id": 1, "my_vector": [1, 2, 3]}.

  2. I would like to use MeltanoLabs/target-postgres alongside pgvector/pgvector-python to declare the SQLAlchemy type of my_vector: pgvector.sqlalchemy.Vector

  3. I would like not to only declare the type but also any arbitrary parameteres for it, like length, dimension, etc.:

    # catalog metadata in Meltano syntax
    schema:
      my_stream:
        my_vector:
          sqlalchemy_type:
            (): "pgvector.sqlalchemy.Vector"
            dim: 3

    This would trigger the SDK to import pgvector.sqlalchemy.Vector and instantiate it as Vector(dim=3).

  4. The other requirement is that the user installs pgvector-python in the same virtual environment as target-postgres, which could be achieved with package extras, e.g. target-postgres[vector] or by documenting known postgres SQLAlchemy extensions in the target's readme.

Potential issues

The biggest problem I can think off is figuring out the priority with which this type override is considered. For example, in target-postgres:

  1. a few JSON schema types are mapped first to postgres-specific column types, e.g. `int -> BIGINT
  2. then the SDK defaults are used

But by introducing this feature, we'd expect targets to resolve in the following order:

  1. sqlalchemy_type overrides
  2. custom target implementation overrides
  3. SDK defaults
@amotl
Copy link

amotl commented Dec 12, 2023

Dear Edgar,

thank you for converging this from MeltanoLabs/target-pinecone#20 so quickly. Is GH-1872 actually already setting the stage for your proposal, or is it something different?

With kind regards,
Andreas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/Feature New feature or request valuestream/SDK
Projects
None yet
Development

No branches or pull requests

2 participants