Skip to content

website-local/website-scrap-engine

Repository files navigation

website-scrap-engine

Configurable website scraper in typescript.

Features

  • Resource types
  • Configurable process pipeline
  • Options
  • Logger
  • Concurrent downloader
  • Multi-thread processing (with native worker_thread)
  • Process CSS
  • Process HTML
  • Process SiteMap (but not replace path in it)
  • Configurable logging

Multi-thread processing

Note: use multi-thread processing only if your process is cpu sensitive.

  • Main thread
    • resource downloading in queue
    • process after download
    • save binary resources to disk
    • send other resources to worker thread
    • enqueue non-duplicated resource from worker thread
  • Worker thread
    • receive downloaded resource from main thread
    • process after download
      • parse html, css, etc.
    • collect referenced resources
    • process and filter referenced resources before download
    • send referenced resources to main thread
    • save resources to disk

Pipeline life cycle

  • skip or redirect link
  • detect resource type
  • create
  • process before download
  • download
  • process after download
  • save to disk