Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding more tools to the benchmark? #3

Open
adbar opened this issue Jun 26, 2020 · 7 comments
Open

Adding more tools to the benchmark? #3

adbar opened this issue Jun 26, 2020 · 7 comments

Comments

@adbar
Copy link
Contributor

adbar commented Jun 26, 2020

Hi,

Thanks for your contribution, it's really useful to see evaluations on real-world data! There are further extraction tools for Python which this repository doesn't feature yet and which could be more efficient than some of the ones you're mentioning. You might have a look at

  • goose3
  • jusText (especially with a custom configuration)
  • inscriptis (html-to-txt conversion)
  • trafilatura (disclaimer: I'm the author).

Or is there a reason why you didn't use them in the first place? I'd be curious to hear about it.

For more details please refer to the evaluation I've performed. The code including baselines is available here.

@lopuhin
Copy link
Member

lopuhin commented Jun 29, 2020

hi @adbar thanks for the pointers of the tools and evaluation. Another tool which was referenced elsewhere by @saippuakauppias was https://github.com/go-shiori/go-readability. It would be great to add them, we only need to write a script which outputs results in JSON. PRs are welcome, and I hope to have time to add more tools soon as well, it would be great to have more tools evaluated.

@adbar
Copy link
Contributor Author

adbar commented Jul 7, 2020

Thanks for your answer, I've added JSON to trafilatura and will check if I can write a straightforward PR.

@adbar
Copy link
Contributor Author

adbar commented Sep 14, 2021

Hi @lopuhin, here is another tool that could be added: Mercury Parser.
(source: adbar/trafilatura#114)

@adbar
Copy link
Contributor Author

adbar commented Jan 5, 2022

Hi @lopuhin, just a quick follow-up: the benchmark could also be updated using the latest versions of the tools, see for instance the issue adbar/trafilatura#156.

@Seirdy
Copy link

Seirdy commented Apr 1, 2022

Another tool to consider is Azure Immersive Reader, used in Microsoft Edge.

@BradKML
Copy link

BradKML commented Apr 2, 2023

Seconded this, but also would like to see:

  1. which ones are better (F1/precision/accuracy/recall) relative to speed in the same vein as Squash Benchmark or Matt Mahoney for compression algorithms (since there will always a tradeoff between performance and speed)
  2. bigger datasets for re-evaluating the benchmark since having a larger diversity of articles from blogs may how a stronger use case

@BradKML
Copy link

BradKML commented May 7, 2024

With the current advancement in RAGs with LLMs I think these benchmarks would be paramount to help in gathering information, and is really due for an update.
P.S. DragNet has a new fork now https://github.com/currentslab/extractnet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants