Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal to add file similarity retriever to haystack #5629

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

elundaeva
Copy link
Contributor

@elundaeva elundaeva commented Aug 25, 2023

Related Issues

Link to code PR - #5666

Proposed Changes:

Proposal to add file similarity retriever to haystack

@elundaeva elundaeva force-pushed the file-similarity-retriever-proposal branch from 2b3761c to 0acb0b6 Compare August 29, 2023 13:37
@elundaeva elundaeva force-pushed the file-similarity-retriever-proposal branch from 0acb0b6 to 3958905 Compare August 29, 2023 13:38
@elundaeva elundaeva marked this pull request as ready for review August 29, 2023 13:39
@elundaeva elundaeva requested review from a team as code owners August 29, 2023 13:39
@elundaeva elundaeva requested review from dfokina and MichelBartels and removed request for a team August 29, 2023 13:39
@elundaeva elundaeva changed the title [DRAFT] Proposal to add file similarity retriever to haystack Proposal to add file similarity retriever to haystack Aug 29, 2023
@elundaeva elundaeva added the proposal:active Proposal is in "Active" state label Aug 29, 2023
@MichelBartels
Copy link
Contributor

This node uses composition with another node which is generally not something we would like for v2. Instead, we would probably prefer if the node was split in two separate nodes (e.g. a DocumentFetcher and a DocumentAggregator) that could provide the input and accept the output of a retriever in a pipeline. However, I can also see that having one node is probably easier for the user so it might make sense for now in v1.
Still, I don't think there are any nodes currently that take another node as a parameter (document stores are not nodes so they don't count). @ZanSara do you perhaps have an opinion on this?

@mathislucka
Copy link
Member

This node uses composition with another node which is generally not something we would like for v2. Instead, we would probably prefer if the node was split in two separate nodes (e.g. a DocumentFetcher and a DocumentAggregator) that could provide the input and accept the output of a retriever in a pipeline. However, I can also see that having one node is probably easier for the user so it might make sense for now in v1.

@MichelBartels I've commented on this in many places and it was always said that v2 would also provide something like higher level components to make certain things easier. I think that in the case of this component it would be even more necessary to have that abstraction layer above. If you don't have it, then users need to figure to whole concept of file similarity retrieval out for themselves and piece everything together. Maybe a good time to start thinking about how that usable abstraction could look like in v2?

@MichelBartels
Copy link
Contributor

MichelBartels commented Sep 1, 2023

@mathislucka I agree with you and we have decided at our co.work last week that without "subpipelines" or a similar abstraction the v2 concept of pipelines is not really usable.
That's also why I was asking. It seems like it's a tradeoff between ease of porting it to v2 and usability while we're still using v1 pipelines.

@ZanSara
Copy link
Contributor

ZanSara commented Sep 4, 2023

I confirm what Michel said about v2 "subpipelines": we have discussed the topic during our co:work and agreed that this abstraction is going to be present in v2 in several situations. So when implementing components we shouldn't worry about splitting them into smaller units, because later they can be grouped back into larger units that contain "subpipelines" made of such smaller components.

@elundaeva
Copy link
Contributor Author

elundaeva commented Sep 4, 2023

Thank you for the input! 💯 In that case, @ZanSara @MichelBartels I can try to implement it as two separate sub-components:

  • DocumentFetcher that gets all docs corresponding to a file based on a provided aggregation meta key and returns a list of docs to be used by one or two retrievers as input for batch retrieval.
    &
  • DocumentAggregator placed after the retriever (or the JoinNode) that aggregates the document list(s) into a single ranked list of files (which are put together based on file_ids of retrieved docs) using RRF.

How does it sound?

@MichelBartels
Copy link
Contributor

@elundaeva That sounds good to me.
@ZanSara Just to confirm, would we then want some sort of abstraction for v1? Otherwise, it's not very usable until we release v2.

- file_aggregation_key: default would be "file_id", but can be changed to e.g. "name"
- output: default "top_document", but can also be "file_aggregation_key"
- keep_original_score
- top_k = 5
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sjrl @MichelBartels I've updated the Detailed design section & the Basic example sections above - please take a look and let me know if sth can be improved 👍

Also, for the implementation, I'm thinking what these sub-components should inherit from - one option could be DocumentFetcher based on BaseRetriever (bc in practice it will work similarly to FilterRetriever for example) and DocumentAggregator can be based on BaseRanker since it sort of ranks the retrieved documents by assessing full file similarity but not sure... WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one option could be DocumentFetcher based on BaseRetriever

Yeah, that makes sense to me. If it is going to basically be the FilterRetriever I do wonder if we should just use that node instead of creating a new one. What differences would there be between the FilterRetriever and the DocumentFetcher?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and DocumentAggregator can be based on BaseRanker since it sort of ranks the retrieved documents by assessing full file similarity but not sure

I think that makes rough sense since it does eventually rank the output by the score of each file. I would try and inherit from the BaseRanker class, but if it causes too much trouble I would also directly inherit from BaseComponent.

Copy link
Contributor Author

@elundaeva elundaeva Sep 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one option could be DocumentFetcher based on BaseRetriever

Yeah, that makes sense to me. If it is going to basically be the FilterRetriever I do wonder if we should just use that node instead of creating a new one. What differences would there be between the FilterRetriever and the DocumentFetcher?

One small difference would be in how the metadata for filtering is provided - in the "filters" of the query (in FilterRetriever) or the meta key for aggregation is given as a param to DocumentFetcher and the value is received from the query itself.
But this input part isn't as important as the output - DocumentFetcher needs to return a list of strings of fetched documents' contents (while FilterRetriever outputs haystack document objects), which can then be provided to the next retriever(s) as "queries" for batch retrieval at the next stage. Though this can be solved by adding an "output" param for FilterRetriever that enables an option to get a list of strings instead of whole documents returned, what do you think @sjrl ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elundaeva Sorry, I could only take a look today because I was on PTO last week. I would agree with Sebastian that it would be nicer if there wasn't a need for a separate DocumentFetcher node. However, I can see that it would have to be ugly either way because of the output type.

Have you also thought about batch file retrieval? It was originally supported by your draft PR, but I can't see how it's possible the way it's described in the current proposal. I also didn't consider batching when I suggested to split the components. I would guess if we were to enable batching somehow, it would be even harder to use the FilterRetriever because we would need to flatten the result at some point.

output = "top_document", # This new param is explained in Detailed design section below
keep_original_score = True
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused by the full pipeline flow here. Could you provide code examples for the Retriever(s) and how they interact with the DocumentAggregator? Will these components be passed to a wrapper node like FileSimilarityRetriever?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think how we package it is still an open question (related to this comment) - whether it makes sense to make a wrapper node already now or if things change drastically with Haystack 2.0 it could be done later to avoid double work 🤔

@github-actions
Copy link
Contributor

This proposal is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the stale label Oct 21, 2023
@github-actions github-actions bot closed this Oct 31, 2023
@bglearning
Copy link
Contributor

Hi,
Looking to reopen this proposal (or create a new one if necessary) of adding the FileSimilarityRetrieval now that v2 is in beta.

Summarizing the discussion till now

On Design Principle

  • We'd like to avoid composition of the components and ideally have many smaller components than one or few big ones. As such, the suggestion to create two components (DocumentFetcher, DocumentAggregator) instead of one (FileSimilarityRetriever).
  • The joining of such smaller components may be handled with subpipelines (details to be discussed)

On behavior

Compared to the custom FileSimilarityRetriever already available internally.

  • Return value: could be list of agg_key values instead of one doc per agg group
  • Would be good to pass list of retrievers instead of explicitly passing primary, secondary retrievers.

On links with other components

  • The FilterRetriever in v1 is very close to the DocumentFetcher being proposed. Differences:

    • The way the input is specified (direct meta value as query vs ignored query + full filter syntax)
    • The output format (list of docs vs possibly list of strings i.e. doc contents)
  • The DocumentJoiner could combine the docs from the different retrievers. It doesn't provide aggregation which would need to be another component. Also two undesired behavior:

    • It would overwrite rrf score for a document with its last appearance instead of adding them up.
    • It would overwrite the original score

Questions and way forward

I tried creating the pipeline with v2 in this Colab Notebook.
There seems to be two main options.

For both we'd need a DocumentFetcher component. We could also name it FilterRetriever. But would need to settle on exact behavior (perhaps converging on/collapsing the differences above). Maybe could be a separate proposal?

Assuming we have this fetcher:

Route 1

A bigger scoped MetaDocumentAggregator that wraps the retrievers (so has component composition) and also performs the aggregation.

file_retriever_01

The notebook has a basic version of this.

Route 2

A smaller scoped MetaDocumentAggregator that only does aggregation and receives documents from the retrievers.

file_retriever_02

I couldn't get this to work. The retrievers expect str whilst we produce List[Document] or List[str] (with a shaper-like node). Also depending on the retriever we might want to pass content or embedding. As such this seems harder to implement?

Asides

  • Seems like this could be a more general (something like MetaDocSimilarityRetriever) rather than FileSimilarityRetriever as the grouping can be things apart from file too.
  • List of retrievers seem fine for Route 1 (at least in code) (?)

Next Step Questions

  • First propose and implement a "FilterRetriever"?
  • Go with Route 1 or figure out a way to make Route 2 work?
  • Are we okay to have variable output type that depends on init arguments? (List[Document] i.e. one from each meta-key grouping vs List[str] i.e. the meta-key values themselves)

@sjrl sjrl reopened this Jan 17, 2024
@github-actions github-actions bot added the 2.x Related to Haystack v2.0 label Jan 17, 2024
@sjrl
Copy link
Contributor

sjrl commented Jan 17, 2024

Hey @bglearning thanks for your work on this and for exploring the two different options you proposed.

Let's start with your step 1 so

First propose and implement a "FilterRetriever"?

but I think you can go ahead and implement it based off of the v1 version. No need for a full proposal since it's a known and understood feature of Haystack v1. In the PR we can revisit something like the name, but my feeling is that we should leave it as FilterRetriever and we may need to implement different versions based on the doc store. For example, we have now have an InMemoryEmbeddingRetriever and then a separate OpenSearchEmbeddingRetriever implemented in our haystack-core-integrations repo.

So we would want something like a InMemoryFilterRetriever and a OpenSearchFilterRetriever in the respective packages.

@silvanocerza
Copy link
Contributor

I do agree with @sjrl here, no need for a new proposal to add a FilterRetriever.

Also I think it can be generic enough to work with every Document Store. The filter_documents() method is part of the common Document Store interface. So there's no need to have a FilterRetriever for each Document Store.

I couldn't get this to work. The retrievers expect str whilst we produce List[Document] or List[str] (with a shaper-like node). Also depending on the retriever we might want to pass content or embedding. As such this seems harder to implement?

@bglearning we're aware of this connection limitation and we're investigating possible solutions. The fastest solution would be changing the input types to accept both but that complicates a bit the logic. We're trying to come up with something better. 🤔

@bglearning
Copy link
Contributor

Hi all, getting back to this with the FilterRetriever (#6836) now available, plus with latest developments around v2.

I updated the Colab Notebook above based on latest developments.

Now, for Route 2, with OutputAdapter (#6936) available there is a way to connect the results from the first (filter) retrieval to the subsequent retrievers. However, given the retrievers take one query, I think there isn't a way to run them on all resulting docs from the first retrieval(?). I only took the first of the documents to make it work here. Aside: it's cool that we can simply connect multiple outputs to a component with Variadic input.

As such, leaning towards the first approach (Route1) mentioned above i.e. with a MetaDocumentAggregator that has a list of retrievers. Please flag any issue that might cause (i.e. having list[<another_component>] inside a component. I think there won't be an issue with (de)-serialization. But can't think of other problems except of course that it goes somewhat against the multiple small components approach.

Next: I'll update the proposal to reflect this.

Route 1

Route 2

@sjrl
Copy link
Contributor

sjrl commented Feb 19, 2024

As such, leaning towards the first approach (Route1) mentioned above i.e. with a MetaDocumentAggregator that has a list of retrievers.

Yeah I think I am also leaning towards this approach. I think using this it would be easier to perform checks on the incoming documents as well from the FilterRetriever in this case. For example, we will want to enforce that all incoming documents to the MetaDocumentAggregator all share the same value in the provided meta key (agg_key in your colab).

@bglearning
Copy link
Contributor

bglearning commented May 10, 2024

Revisiting this as part of v2 migrations. Considering a third route (also again added to the Colab Notebook):

route3

To summarize:

We want to do two things based on resultant documents from the filter retriever:

  1. Run multiple retrievers for each document to get a set of similar documents.
  2. Group/aggregate the documents from 1 based on an agg_key

Route 1 above does both things inside a single component.
Route 2 creates one component for 2 and tries to run 1 without a wrapper component for the retrievers but this doesn't seem to be possible.
Route 3 creates one component for 1 (SimilarDocumentsRetriever) and one for 2 (DocumentMetaAggregator).

Leaning towards Route3 now as it results in smaller and possibly independently reusable components. E.g. "SimilarDocuments to a select group of documents from previous result" or "Aggregate/group the result from a query based on metadata using DocumentMetaAggregator"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.x Related to Haystack v2.0 proposal:active Proposal is in "Active" state proposal
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants