Releases: spider-rs/spider
v1.97.0
v1.96.0
Whats Changed
Fix chrome stealth handling user-agent
- chore(website): fix chrome stealth handling agent
- chore(website): add safe semaphore handling
Full Changelog: v1.95.25...v1.96.0
v1.95.28
Whats Changed
The website crawl status now returns the proper state without reseting.
- chore(website): fix crawl status persisting
Full Changelog: v1.95.25...v1.95.28
v1.95.27
Whats Changed
This release provides a major fix for crawls being delayed by respect robots or crawl delays. If you set a limit or budget for the crawl and a robots.txt contains a delay of 10s this would be a bottleneck for the entire crawl when limits applied since the we would have to wait for each link to process prior to exiting. The robots delay is now maxed at 60s for efficiency.
- chore(cli): fix limit respecting
- chore(robots): fix respect robots [#184]
- bump chromiumoxide@0.6.0
- bump tiktoken-rs@0.5.9
- bump hashbrown@0.14.5
- add zstd support reqwest
- unpin smallvec
- chore(website): fix crawl limit immediate exit
- chore(robots): add max delay respect
Full Changelog: v1.95.6...v1.95.27
v1.95.9
Whats Changed
- chore(openai): fix smart mode passing target url
- chore(js): remove alpha js feature flag - jsdom crate
- chore(chrome): remove unnecessary page activation
- chore(openai): compress base prompt
- chore(openai): remove hidden content from request
Full Changelog: v1.94.4...v1.95.9
v1.94.4
Whats Changed
Using a hybrid cache between chrome CDP Request and HTTP Request can be done using the cache_chrome_hybrid
feature flag.
You can simulate browser http headers to help increase the chance of the request with http using the real_browser
flag.
- feat(cache): add chrome caching between http
- feat(real_browser): add http simulation headers
Full Changelog: v1.93.43...v1.94.4
v1.93.43
Whats Changed
Generating random real user-agents can now be done using ua_generator@0.4.1
.
Spoofing http headers can now be done with the spoof
flag.
Use ua_generator::ua::UserAgents if you need dynamic User-Agent randomizer followed with website.with_user_agent
.
- feat(spoof): add referrer spoofing
- feat(spoof): add real user-agent spoofing
- feat(chrome): add dynamic chrome connections
Full Changelog: v1.93.23...v1.93.43
v1.93.13
Whats Changed
Updated crate compatibility with reqwest@0.12.4
and fixed headers compile for worker
.
Remove http3
feature flag - follow the unstable instructions if needed.
The function website.get_domain
renamed to website.get_url
.
The function website.get_domain_parsed
renamed to website.get_url_parsed
.
- chore(worker): fix headers flag compile
- chore(crates): update async-openai@0.20.0
- chore(openai): trim start messages content output text
- chore(website): fix url getter function name
Full Changelog: v1.93.3...v1.93.13
v1.93.3
Whats Changed
You can now take screenshots per step when using OpenAI to manipulate the page.
Connecting to a proxy on chrome headless remote is now fixed.
- feat(openai): add screenshot js execution after effects
- feat(openai): add deserialization error determination
- chore(chrome): fix proxy server headless connecting
use spider::configuration::GPTConfigs;
let mut gpt_config: GPTConfigs = GPTConfigs::new_multi(
"gpt-4-1106-preview",
vec!["Search for Movies", "Extract the hrefs found."],
3000,
);
gpt_config.screenshot = true;
gpt_config.set_extra(true);
Full Changelog: v1.92.0...v1.93.3
v1.92.0
What's Changed
Caching OpenAI responses can now be done using the 'cache_openai' flag and a builder method.
- docs: fix broken glob url link by @emilsivervik in #179
- feat(openai): add response caching
Example
extern crate spider;
use spider::configuration::{GPTConfigs, WaitForIdleNetwork};
use spider::moka::future::Cache;
use spider::tokio;
use spider::website::Website;
use std::time::Duration;
#[tokio::main]
async fn main() {
let cache = Cache::builder()
.time_to_live(Duration::from_secs(30 * 60))
.time_to_idle(Duration::from_secs(5 * 60))
.max_capacity(10_000)
.build();
let mut gpt_config: GPTConfigs = GPTConfigs::new_multi_cache(
"gpt-4-1106-preview",
vec![
"Search for Movies",
"Click on the first result movie result",
],
500,
Some(cache),
);
gpt_config.set_extra(true);
let mut website: Website = Website::new("https://www.google.com")
.with_chrome_intercept(true, true)
.with_wait_for_idle_network(Some(WaitForIdleNetwork::new(Some(Duration::from_secs(30)))))
.with_limit(1)
.with_openai(Some(gpt_config))
.build()
.unwrap();
let mut rx2 = website.subscribe(16).unwrap();
let handle = tokio::spawn(async move {
while let Ok(page) = rx2.recv().await {
println!("---\n{}\n{:?}\n{:?}\n---", page.get_url(), page.openai_credits_used, page.extra_ai_data);
}
});
let start = crate::tokio::time::Instant::now();
website.crawl().await;
let duration = start.elapsed();
let links = website.get_links();
println!(
"(0) Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
duration,
links.len()
);
// crawl the page again to see if cache is re-used.
let start = crate::tokio::time::Instant::now();
website.crawl().await;
let duration = start.elapsed();
website.unsubscribe();
let _ = handle.await;
println!(
"(1) Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
duration,
links.len()
);
}
New Contributors
- @emilsivervik made their first contribution in #179
Full Changelog: v.1.91.1...v1.92.0