Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can longer sequences be encoded? Are the encodings good? #2596

Open
DavidNemeskey opened this issue Apr 16, 2024 · 5 comments
Open

Can longer sequences be encoded? Are the encodings good? #2596

DavidNemeskey opened this issue Apr 16, 2024 · 5 comments

Comments

@DavidNemeskey
Copy link

I was wondering how good the published models are at encoding longer texts. First of all, the name "sentence-transformers" suggest sentence-level training, and the papers seem to support this. So my question is: if I want to encode a longer text, can I do it? Will the encodings be meaningful / good? Or am I better off encoding the sentences separately and aggregating the embeddings somehow (e.g. by averaging them)?

Thanks!

@ir2718
Copy link
Contributor

ir2718 commented Apr 16, 2024

The simplest way to achieve this would be by using a model meant for longer text sequences eg. BigBird.

If you're using it for a language that doesn't have a model of this sort, you could go about this by artificially making the context size bigger. You can first tokenize all the examples and try to determine a new context size that you're ok with, eg. 2048. Then, you can split each example into 512 tokens, encode them separately and concatenate these partial embeddings into a single embedding, and train using this embedding.

Another option would be to split each document into manageable chunks such as paragraphs. Then, a sentence transformer out of the box might work.

These are just some off the top of my head, but there's also some literature available regarding documents embeddings like this. I also found an interesting paper regarding this exact subject published at ACL in the past few years, but I can't seem to find it currently.

On a side note, I've done some training using out-of-the box sentence-transformers for semantic similarity in news articles (usually longer than the context size) and it worked like a charm.

@DavidNemeskey
Copy link
Author

@ir2718 Thanks for the reply. What I am wondering about is whether the per-paragraph (or whatever) application of the model will be any good, given that the published models were trained on sentences. Your side note gives me hope that the performance will be OK.

The paper looks interesting as well, will definitely read it.

@ir2718
Copy link
Contributor

ir2718 commented Apr 17, 2024

@DavidNemeskey

If you plan on using an already fine-tuned model, it depends on the data and the task it was trained on. Some models are trained on paragraph data and those might be suitable (check here). Without knowing the exact task you're working on it's hard for me to say if the embeddings will be any good.

@DavidNemeskey
Copy link
Author

@ir2718 Sorry, I forgot to mention that I would like to work with Hungarian text, meaning my choices are restricted to the multilingual models. All of which, except for one, has this as the first sentence on their model cards: "This is a sentence-transformers model: It maps sentences & paragraphs to a 512 dimensional dense vector space and can be used for tasks like clustering or semantic search."

Now my problem is that this seems to be a generic sentence talking more about the capabilities of sbert rather than the model in question. So I don't know if I can assume that all of these models were trained on both sentences and paragraphs.

@ir2718
Copy link
Contributor

ir2718 commented Apr 19, 2024

@DavidNemeskey

Now my problem is that this seems to be a generic sentence talking more about the capabilities of sbert rather than the model in question

Yes, I'm pretty sure that was intended.

My best bet would be to check out each multilingual model and see if the data contains any longer texts used in training. For example, this model has a list of datasets it was trained on.

Although, I'm not sure if the multilingual models are any good for hungarian. An alternative path you can take is to use a pretrained model for hungarian and then label data for the task you're working on. Unfortunately, the labeling part is a hassle due to the text length, but it's worth it if you're looking to build a good model for such a specific language. I used the same idea for Croatian and it worked well with around 1000 labeled examples (meaning only around 700 for training). Besides, if you're aware of a similar dataset for english and a good model for english to hungarian translation, you can significantly speed up the process by translating the data and keeping the labels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants