Skip to content

Releases: spider-rs/spider

v1.97.0

04 Jun 18:44
Compare
Choose a tag to compare

Whats Changed

  • add scoped website semaphore
  • add [cowboy] flag to remove semaphore limiting 🤠

Full Changelog: v1.95.25...v1.97.0

v1.96.0

04 Jun 14:25
Compare
Choose a tag to compare

Whats Changed

Fix chrome stealth handling user-agent

  1. chore(website): fix chrome stealth handling agent
  2. chore(website): add safe semaphore handling

Full Changelog: v1.95.25...v1.96.0

v1.95.28

01 Jun 12:09
Compare
Choose a tag to compare

Whats Changed

The website crawl status now returns the proper state without reseting.

  1. chore(website): fix crawl status persisting

Full Changelog: v1.95.25...v1.95.28

v1.95.27

28 May 15:33
Compare
Choose a tag to compare

Whats Changed

This release provides a major fix for crawls being delayed by respect robots or crawl delays. If you set a limit or budget for the crawl and a robots.txt contains a delay of 10s this would be a bottleneck for the entire crawl when limits applied since the we would have to wait for each link to process prior to exiting. The robots delay is now maxed at 60s for efficiency.

  • chore(cli): fix limit respecting
  • chore(robots): fix respect robots [#184]
  • bump chromiumoxide@0.6.0
  • bump tiktoken-rs@0.5.9
  • bump hashbrown@0.14.5
  • add zstd support reqwest
  • unpin smallvec
  • chore(website): fix crawl limit immediate exit
  • chore(robots): add max delay respect

Full Changelog: v1.95.6...v1.95.27

v1.95.9

18 May 19:05
Compare
Choose a tag to compare

Whats Changed

  1. chore(openai): fix smart mode passing target url
  2. chore(js): remove alpha js feature flag - jsdom crate
  3. chore(chrome): remove unnecessary page activation
  4. chore(openai): compress base prompt
  5. chore(openai): remove hidden content from request

Full Changelog: v1.94.4...v1.95.9

v1.94.4

09 May 16:44
Compare
Choose a tag to compare

Whats Changed

Using a hybrid cache between chrome CDP Request and HTTP Request can be done using the cache_chrome_hybrid feature flag.
You can simulate browser http headers to help increase the chance of the request with http using the real_browser flag.

  1. feat(cache): add chrome caching between http
  2. feat(real_browser): add http simulation headers

Full Changelog: v1.93.43...v1.94.4

v1.93.43

03 May 18:35
Compare
Choose a tag to compare

Whats Changed

Generating random real user-agents can now be done using ua_generator@0.4.1.
Spoofing http headers can now be done with the spoof flag.

Use ua_generator::ua::UserAgents if you need dynamic User-Agent randomizer followed with website.with_user_agent.

  • feat(spoof): add referrer spoofing
  • feat(spoof): add real user-agent spoofing
  • feat(chrome): add dynamic chrome connections

Full Changelog: v1.93.23...v1.93.43

v1.93.13

23 Apr 11:08
Compare
Choose a tag to compare

Whats Changed

Updated crate compatibility with reqwest@0.12.4 and fixed headers compile for worker.
Remove http3 feature flag - follow the unstable instructions if needed.

The function website.get_domain renamed to website.get_url.
The function website.get_domain_parsed renamed to website.get_url_parsed.

  • chore(worker): fix headers flag compile
  • chore(crates): update async-openai@0.20.0
  • chore(openai): trim start messages content output text
  • chore(website): fix url getter function name

Full Changelog: v1.93.3...v1.93.13

v1.93.3

14 Apr 21:20
Compare
Choose a tag to compare

Whats Changed

You can now take screenshots per step when using OpenAI to manipulate the page.
Connecting to a proxy on chrome headless remote is now fixed.

  1. feat(openai): add screenshot js execution after effects
  2. feat(openai): add deserialization error determination
  3. chore(chrome): fix proxy server headless connecting
    use spider::configuration::GPTConfigs;
  
    let mut gpt_config: GPTConfigs = GPTConfigs::new_multi(
        "gpt-4-1106-preview",
        vec!["Search for Movies", "Extract the hrefs found."],
        3000,
    );

    gpt_config.screenshot = true;
    gpt_config.set_extra(true);

Full Changelog: v1.92.0...v1.93.3

v1.92.0

13 Apr 17:30
Compare
Choose a tag to compare

What's Changed

Caching OpenAI responses can now be done using the 'cache_openai' flag and a builder method.

  • docs: fix broken glob url link by @emilsivervik in #179
  • feat(openai): add response caching

Example

extern crate spider;

use spider::configuration::{GPTConfigs, WaitForIdleNetwork};
use spider::moka::future::Cache;
use spider::tokio;
use spider::website::Website;
use std::time::Duration;

#[tokio::main]
async fn main() {
    let cache = Cache::builder()
        .time_to_live(Duration::from_secs(30 * 60))
        .time_to_idle(Duration::from_secs(5 * 60))
        .max_capacity(10_000)
        .build();

    let mut gpt_config: GPTConfigs = GPTConfigs::new_multi_cache(
        "gpt-4-1106-preview",
        vec![
            "Search for Movies",
            "Click on the first result movie result",
        ],
        500,
        Some(cache),
    );
    gpt_config.set_extra(true);

    let mut website: Website = Website::new("https://www.google.com")
        .with_chrome_intercept(true, true)
        .with_wait_for_idle_network(Some(WaitForIdleNetwork::new(Some(Duration::from_secs(30)))))
        .with_limit(1)
        .with_openai(Some(gpt_config))
        .build()
        .unwrap();
    let mut rx2 = website.subscribe(16).unwrap();

    let handle = tokio::spawn(async move {
        while let Ok(page) = rx2.recv().await {
            println!("---\n{}\n{:?}\n{:?}\n---", page.get_url(), page.openai_credits_used, page.extra_ai_data);
        }
    });

    let start = crate::tokio::time::Instant::now();
    website.crawl().await;
    let duration = start.elapsed();
    let links = website.get_links();

    println!(
        "(0) Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
        duration,
        links.len()
    );

    // crawl the page again to see if cache is re-used.
    let start = crate::tokio::time::Instant::now();
    website.crawl().await;
    let duration = start.elapsed();

    website.unsubscribe();

    let _ = handle.await;

    println!(
        "(1) Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
        duration,
        links.len()
    );
}

New Contributors

Full Changelog: v.1.91.1...v1.92.0