Skip to content

eogns47/Sublink-Crawler

Repository files navigation

🔎Dynamic link crawler

🚀You Can find All Urls from base url!🚀
Dynamic web crawler that uses dynamic browser (Puppeteer) which fetches all links on a page and its children.

💡How to use By Plain:

  1. Clone the repo and run npm install puppeteer yargs
  2. Create a file that lists scrapped targets on {root}/inputs/targets.txt
  3. Create a file that lists unscrapped targets on {root}/inputs/blacklist.txt
  4. Run node index.js -t targets.txt -r results.txt -b blacklist.txt -d 1

💡How to use By ExecBot:

  1. Clone the repo and run npm install puppeteer yargs
  2. pip install -r requirements.txt
  3. python3 exec.py results.txt 1 (results.txt = resultsfile name , 1 = depth)

💡How to use By Docker (Recommend):

It works on amd64, arm64

  1. pull image from dockerhub ➡️link
    -Plain version: tag name ServerCrawler
    -ExecBot version: tag name servercrawler_v2
  2. docker run with some options docker run -d eogns47/linkcrawler:{Your tag}
  3. Connect container shell docker exec -it {container id} /bin/sh
  4. python3 exec.py results.txt 1 (results.txt = resultsfile name , 1 = depth)

💡How to Test:

  1. Install Unit test tool npm install --save-dev jest
  2. Run npx jest

More Options:
--version Show version number [boolean]
-t Input file path [required]
-u Targets array list
-r Output file path
-d Crawling depth
-b Blacklist file path to prevent an url for being crawled (hard match)
--full Use full url for crawling instead of its base
-v Verbosity level [boolean]
--base Must include the base url to except external links when crawling [boolean]

-h, --help Show help [boolean]

🛠️Tech Stack:

link crawler Architecture drawio (1)

🌲File Structure

link-crawler
├─ .dockerignore
│
├─ .gitignore
├─ Dockerfile
├─ ExecBot
│  └─ exec.py
├─ Logger
│  └─ logger.js
├─ README.md
├─ babel.config.js
├─ inputs
│  └─ .gitkeep
├─ results
│  └─ .gitkee
├─ jest.config.js
├─ logs
│  └─ .gitkeep
├─ node_modules
│
├─ package-lock.json
├─ package.json
├─ src
│  ├─ Config
│  │  └─ Extensions.js
│  ├─ IOView.js
│  ├─ LinkCrawler.js
│  ├─ LinkPreprocessor.js
│  ├─ Validator.js
│  ├─ index.js
│  ├─ messageHandler.js
│  └─ tests
│     ├─ File.test.js
│     ├─ Link.test.js
│     └─ Validate.test.js
└─ yarn.lock

About

Web Sublink Crawler for Dynamic Web

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published