This release provides a major fix for crawls being delayed by respect robots or crawl delays. If you set a limit or budget for the crawl and a robots.txt contains a delay of 10s this would be a bottleneck for the entire crawl when limits applied since the we would have to wait for each link to process prior to exiting. The robots delay is now maxed at 60s for efficiency.

chore(cli): fix limit respecting
chore(robots): fix respect robots [#184]
bump chromiumoxide@0.6.0
bump tiktoken-rs@0.5.9
bump hashbrown@0.14.5
add zstd support reqwest
unpin smallvec
chore(website): fix crawl limit immediate exit
chore(robots): add max delay respect

Full Changelog: v1.95.6...v1.95.27

Assets 7

18 May 19:05

j-mendez

v1.95.9

413b4a5

v1.95.9

Whats Changed

chore(openai): fix smart mode passing target url
chore(js): remove alpha js feature flag - jsdom crate
chore(chrome): remove unnecessary page activation
chore(openai): compress base prompt
chore(openai): remove hidden content from request

Full Changelog: v1.94.4...v1.95.9

Assets 2

09 May 16:44

j-mendez

v1.94.4

1044d35

v1.94.4

Whats Changed

Using a hybrid cache between chrome CDP Request and HTTP Request can be done using the cache_chrome_hybrid feature flag.
You can simulate browser http headers to help increase the chance of the request with http using the real_browser flag.

feat(cache): add chrome caching between http
feat(real_browser): add http simulation headers

Full Changelog: v1.93.43...v1.94.4

Assets 2

03 May 18:35

j-mendez

v1.93.43

116a02d

v1.93.43

Whats Changed

Generating random real user-agents can now be done using ua_generator@0.4.1.
Spoofing http headers can now be done with the spoof flag.

Use ua_generator::ua::UserAgents if you need dynamic User-Agent randomizer followed with website.with_user_agent.

feat(spoof): add referrer spoofing
feat(spoof): add real user-agent spoofing
feat(chrome): add dynamic chrome connections

Full Changelog: v1.93.23...v1.93.43

Assets 2

23 Apr 11:08

j-mendez

v1.93.13

21e4013

v1.93.13

Whats Changed

Updated crate compatibility with reqwest@0.12.4 and fixed headers compile for worker.
Remove http3 feature flag - follow the unstable instructions if needed.

The function website.get_domain renamed to website.get_url.
The function website.get_domain_parsed renamed to website.get_url_parsed.

chore(worker): fix headers flag compile
chore(crates): update async-openai@0.20.0
chore(openai): trim start messages content output text
chore(website): fix url getter function name

Full Changelog: v1.93.3...v1.93.13

Assets 2

14 Apr 21:20

j-mendez

v1.93.3

0dd646e

v1.93.3

Whats Changed

You can now take screenshots per step when using OpenAI to manipulate the page.
Connecting to a proxy on chrome headless remote is now fixed.

feat(openai): add screenshot js execution after effects
feat(openai): add deserialization error determination
chore(chrome): fix proxy server headless connecting

    use spider::configuration::GPTConfigs;
  
    let mut gpt_config: GPTConfigs = GPTConfigs::new_multi(
        "gpt-4-1106-preview",
        vec!["Search for Movies", "Extract the hrefs found."],
        3000,
    );

    gpt_config.screenshot = true;
    gpt_config.set_extra(true);

Full Changelog: v1.92.0...v1.93.3

Assets 2

13 Apr 17:30

j-mendez

v1.92.0

5400f8a

v1.92.0

What's Changed

Caching OpenAI responses can now be done using the 'cache_openai' flag and a builder method.

docs: fix broken glob url link by @emilsivervik in #179
feat(openai): add response caching

Example

extern crate spider;

use spider::configuration::{GPTConfigs, WaitForIdleNetwork};
use spider::moka::future::Cache;
use spider::tokio;
use spider::website::Website;
use std::time::Duration;

#[tokio::main]
async fn main() {
    let cache = Cache::builder()
        .time_to_live(Duration::from_secs(30 * 60))
        .time_to_idle(Duration::from_secs(5 * 60))
        .max_capacity(10_000)
        .build();

    let mut gpt_config: GPTConfigs = GPTConfigs::new_multi_cache(
        "gpt-4-1106-preview",
        vec![
            "Search for Movies",
            "Click on the first result movie result",
        ],
        500,
        Some(cache),
    );
    gpt_config.set_extra(true);

    let mut website: Website = Website::new("https://www.google.com")
        .with_chrome_intercept(true, true)
        .with_wait_for_idle_network(Some(WaitForIdleNetwork::new(Some(Duration::from_secs(30)))))
        .with_limit(1)
        .with_openai(Some(gpt_config))
        .build()
        .unwrap();
    let mut rx2 = website.subscribe(16).unwrap();

    let handle = tokio::spawn(async move {
        while let Ok(page) = rx2.recv().await {
            println!("---\n{}\n{:?}\n{:?}\n---", page.get_url(), page.openai_credits_used, page.extra_ai_data);
        }
    });

    let start = crate::tokio::time::Instant::now();
    website.crawl().await;
    let duration = start.elapsed();
    let links = website.get_links();

    println!(
        "(0) Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
        duration,
        links.len()
    );

    // crawl the page again to see if cache is re-used.
    let start = crate::tokio::time::Instant::now();
    website.crawl().await;
    let duration = start.elapsed();

    website.unsubscribe();

    let _ = handle.await;

    println!(
        "(1) Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
        duration,
        links.len()
    );
}

New Contributors

@emilsivervik made their first contribution in #179

Full Changelog: v.1.91.1...v1.92.0

Contributors

emilsivervik

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whats Changed

Whats Changed

Whats Changed

Whats Changed

Whats Changed

Whats Changed

Whats Changed

Whats Changed

Whats Changed

What's Changed

Example

New Contributors

Contributors

Releases: spider-rs/spider

v1.97.0

Whats Changed

v1.96.0

Whats Changed

v1.95.28

Whats Changed

v1.95.27

Whats Changed

v1.95.9

Whats Changed

v1.94.4

Whats Changed

v1.93.43

Whats Changed

v1.93.13

Whats Changed

v1.93.3

Whats Changed

v1.92.0

What's Changed

Example

New Contributors

Contributors