Krawler: Asynchronous Kotlin Crawler 🚀

Overview

Krawler is a fully configurable and asynchronous HTML Crawler written in Kotlin (JVM). Powered by Coroutines, Kotlin Serialization (JSON), Ktor Client, Exposed, SQLite, and SQLite JDBC, Krawler provides a way to easily scrape HTML webpages.

Features

Asynchronous Processing: Utilizing Kotlin's coroutines, Krawler is designed for high-performance, concurrent web crawling.
Configurability: Krawler is highly customizable through the krawler_config.json file, placed at the project path.
Extensive Logging: Verbose logs can be enabled via the configuration file.
Persisting Errors: Errors during the crawling process are stored in the CrawlErrors table (with the necessary metadata) and printed to the standard output.

Database Schema

Krawler uses the following tables to persist data:

CrawlActivities : IntIdTable() {
  varchar("sessionId", 100)
  long("atEpochSeconds")
  varchar("type", 50)
}

CrawlErrors : IntIdTable() {
  varchar("sessionId", 100)
  long("atEpochSeconds")
  text("url")
  text("error")
}

CrawlingStates : IntIdTable() {
  varchar("sessionId", 100)
  text("url")
  integer("depth")
  long("priority")
}

Webpages : IntIdTable() {
  varchar("sessionId", 100)
  long("atEpochSeconds")
  text("url")
  text("html")
}

Configuration

Krawler is highly customizable through the krawler_config.json file, placed at the project path. Below is a sample configuration containing all settings:

{
"seeds": [
  "https://en.wikipedia.org/wiki/NASA"
],
"filter": {
  "#": "dev.yekta.krawler.model.CrawlingFilter.Whitelist",
  "allowPatterns": [
    "https://en\\.wikipedia\\.org/wiki/.*"
  ]
},
"depth": 8,
"maxPages": 100,
"maxPageSizeKb": null,
"concurrentConnections": 16,
"verbose": true,
"shouldFollowRedirects": true,
"userAgent": "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)",
"connectTimeoutMs": 6000,
"readTimeoutMs": 6000,
"retriesOnServerError": 0,
"customHeaders": null
}

seeds: Starting URLs for crawling.
filter: Crawling filter configuration, either Whitelist or Blacklist.
depth: Maximum depth of crawling.
maxPages: Maximum number of pages to crawl.
maxPageSizeKb: Maximum page size in kilobytes.
concurrentConnections: Number of concurrent connections for crawling.
verbose: Enable verbose logging.
shouldFollowRedirects: Specify if redirects should be followed.
userAgent: User agent string for HTTP requests.
connectTimeoutMs: Connection timeout in milliseconds.
readTimeoutMs: Read timeout in milliseconds.
retriesOnServerError: Number of retries on server errors (5xx).
customHeaders: Additional custom headers for HTTP requests.

Good Next Steps

Things that would benefit Krawler the most:

Implementing Pause/Resume
- Hint: The UrlPool is the only state that isn't currently being persisted but needs to be, in order to be able to restore paused sessions.
Config: respectRobotsTxt: Boolean
Config: consecutiveErrorsToPause: Int?

Disclaimer

Krawler was conceived and brought to life over a weekend, starting as a pet project. It was initially planned to be made as a component of the coursework for the Web & Search Engines course at Yazd University, then growing exponentially due to a sudden desire to make a "good thing" out of it! It's important to note that, no explicit guarantees are extended regarding its correctness of functionality, support, or any other aspect. With that in mind, happy Krawling!

License

Please refer to LICENSE to view the project's license.

Name		Name	Last commit message	Last commit date
Latest commit History 115 Commits
.idea		.idea
gradle/wrapper		gradle/wrapper
resources		resources
src/main/kotlin/dev/yekta/krawler		src/main/kotlin/dev/yekta/krawler
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.gradle.kts		build.gradle.kts
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle.kts		settings.gradle.kts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.idea

.idea

gradle/wrapper

gradle/wrapper

resources

resources

src/main/kotlin/dev/yekta/krawler

src/main/kotlin/dev/yekta/krawler

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

build.gradle.kts

build.gradle.kts

gradle.properties

gradle.properties

gradlew

gradlew

gradlew.bat

gradlew.bat

settings.gradle.kts

settings.gradle.kts

Repository files navigation

Krawler: Asynchronous Kotlin Crawler 🚀

Overview

Features

Database Schema

Configuration

Good Next Steps

Disclaimer

License

About

Languages

License

YektaDev/Krawler

Folders and files

Latest commit

History

Repository files navigation

Krawler: Asynchronous Kotlin Crawler 🚀

Overview

Features

Database Schema

Configuration

Good Next Steps

Disclaimer

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages