Releases: apache/incubator-stormcrawler
Apache StormCrawler 3.0 (Incubating)
Disclaimer
Apache StormCrawler is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.
Release Summary
This is our first release after joining the ASF incubator as a poddling. It is a breaking change with renamings in the group ids and
the removal of the elasticsearch module.
What's Changed
- Handling of DateTimeParseException in WARCSpout by @michaeldinzinger in #1140
- Generate THIRD-PARTY.txt file, fixes #1145 by @jnioche in #1146
- Remove coveralls maven plugin, fixes #1148 by @jnioche in #1149
- OpenSearch - better handling of mappings by @jnioche in #1155
- Delete CODE_OF_CONDUCT.md by @pjfanning in #1158
- Create DISCLAIMER by @pjfanning in #1159
- Update NOTICE by @pjfanning in #1160
- Changed package names to org.apache by @jnioche in #1165
- Create .asf.yaml by @pjfanning in #1161
- Fix #1174 - Exclude optional artifact from storm-hdfs by @rzo1 in #1175
- Fix #1164 - Change license headers by @rzo1 in #1173
- Removed devs section from pom.xml by @jnioche in #1181
- Fix #1167 - Remove Elasticsearch module by @rzo1 in #1182
- Remove hyphens in storm-crawler by @jnioche in #1177
- Fixes #1178 "Set version to 3.0-SNAPSHOT" by @rzo1 in #1183
- Fixes #1169 - Use Apache Parent POM & Enable RAT by @rzo1 in #1180
- Removed ref to Discord in README by @jnioche in #1184
- Fix #1168 - Add a modified version of CONTRIBUTING.md by @rzo1 in #1186
- Fix #1163 - Change the GitHub templates for PRs to be more ASF specific by @rzo1 in #1185
- Upgrade to Storm 2.6.2, fix #1188 by @jnioche in #1189
- link to ASF web site .asf.yaml by @pjfanning in #1192
- Update README.md by @jnioche in #1195
- 1200 - Fix license headers by @jnioche in #1201
- #1197 - Allow to disable SSL/TLS verification in OpenSearchConnection by @rzo1 in #1199
- Fix #1202 - Add release documentation and comply with source package naming requirements by @rzo1 in #1203
- #1207 -- add forbidden-apis by @tballison in #1208
- #1209 fix for emulation error in tests run on silicon by @joshfischer1108 in #1210
- Resolves #1211 "Fix License Header" by @rzo1 in #1212
- #1205 update archetype in README by @joshfischer1108 in #1206
- Introduce "skip.format.code" to skip code formatting by default by @rzo1 in #1213
New Contributors
- @pjfanning made their first contribution in #1158
- @tballison made their first contribution in #1208
- @joshfischer1108 made their first contribution in #1210
Full Changelog: 2.11...stormcrawler-3.0
StormCrawler 2.11
Disclaimer
This is a Pre-ASF release and did not undergo a formal review by the PMC.
What's Changed
- Upgrade to OpenSearch 2.11 #1113 by @jnioche in #1114
- Use mock server for selenium tests, fix #1116 by @jnioche in #1119
- Issue #728: Adding asterisk for metadata transfer by @michaeldinzinger in #1117
- WARCSpout loads inputs using HDFS by @jnioche in #1122
- Fix wrong most recent date was set by @chhsiao90 in #1126
- Glob field mapping for indexer.md.mapping by @jnioche in #1130
- Add committer statement by @michaeldinzinger in #1134
- Implement configurable getDocumentID in DeletionBolt by @chhsiao90 in #1135
- Add two tests for SiteMapParserBolt by @michaeldinzinger in #1138
- dependency upgrades by @jnioche in #1139
New Contributors
- @chhsiao90 made their first contribution in #1126
Full Changelog: 2.10...2.11
What's new in StormCrawler 2.10
Disclaimer
This is a Pre-ASF release and did not undergo a formal review by the PMC.
What's Changed
- Selenium test by @jnioche in #1093
- refactoring timeouts Selenium by @jnioche in #1102
- Improvements and fixes to HttpRobotRulesParser when following redirects by @sebastian-nagel in #1103
and a lot more!
Full Changelog: 2.9...2.10
See https://digitalpebble.blogspot.com/2023/10/focus-on-protocol-improvements-in.html for more details on the protocol improvements
What's new in StormCrawler 2.9
Disclaimer
This is a Pre-ASF release and did not undergo a formal review by the PMC.
What's Changed
- Change HttpProtocol to defer to configured values for retryOnConnectionFailure and followRedirects by @ndtreviv in #1056
- Cache redirected robots.txt for target host only if path is /robots.txt and query is empty by @sebastian-nagel in #1057
- Issue #1043: Fixing problems after restart of Frontier service by @michaeldinzinger in #1054
- #1049 Replace "Collapse and Expand Results" Solr query with "Result Grouping" query. by @syefimov in #1053
- OpenSearch 2.7.0 + renamed OpenSearchConnection by @jnioche in #1064
- BasicURLNormalizer .unmangleQueryString() returns invalid results if "&" symbol in a parents path #1059 by @syefimov in #1062
- Dependency upgrades. fixes #1066 by @jnioche in #1067
- Automatic creation of index definitions should use the bolt type by @jnioche in #1069
- mechanism to retrieve more generic value of configuration by @jnioche in #1071
- Create DeletionBolt.java for Solr. #1050 by @syefimov in #1073
- Increase the number of redirects to 5 for Robots.txt fetching by @michaeldinzinger in #1074
- Issue #1042: Adapt parsing of robots.txt files by @michaeldinzinger in #1055
- Test URL Filtering from the command line by @jnioche in #1081
New Contributors
- @michaeldinzinger made their first contribution in #1054
- @syefimov made their first contribution in #1053
Full Changelog: 2.8...2.9
What's new in StormCrawler 2.8
Disclaimer
This is a Pre-ASF release and did not undergo a formal review by the PMC.
What's Changed
- Enforce Java 11 in archetypes by @msghasan in #1029
- Fix #1027: Ensure SC can be build with Java 17 by @rzo1 in #1030
- Indexer ES document id by @Mikwiss in #1028
- JsoupFilter as Interface by @Mikwiss in #1026
- Create method to add SearchHit info to metadata by @Mikwiss in #1034
- Status ES document id by @Mikwiss in #1036
- Limit the amount of text to be returned by the text extraction, #1038 by @jnioche in #1039
- Allow override on HttpProtocol's method addHeadersToRequest by @Mikwiss in #1041
- Fixes #1045. Remove range syntax from snakeyaml by @rzo1 in #1046
- Fix #1032: Catch the exception inside the loop to avoid breaking if one remote instance is misbehaving by @rzo1 in #1047
New Contributors
Full Changelog: 2.7...2.8
What's new in StormCrawler 2.7
Disclaimer
This is a Pre-ASF release and did not undergo a formal review by the PMC.
What's Changed
- Dependency upgrades #1016
- Opensearch module in #1011
- Maven archetype for Opensearch
- [WARC] Backward compatible storage of HTTP/2 headers by @sebastian-nagel in #1010
- Ignore empty fields indexer in #1019
- Handle single quotes in value of http-equiv="refresh" #1020
Full Changelog: 2.6...2.7
What's new in StormCrawler 2.6
Disclaimer
This is a Pre-ASF release and did not undergo a formal review by the PMC.
Highlights
- Using URLFrontier in archetype
- URLFilter becomes an abstract class
- Fixed deactivation of maxDepthFilter
- JSoupParserBolt improve performance of link extraction
- Multiple dependency upgrades
Full Changelog: storm-crawler-2.5...2.6
What's new in Stormcrawler 2.5
Disclaimer
This is a Pre-ASF release and did not undergo a formal review by the PMC.
In a nutshell
- various dependency upgrades (JSoup, CrawlerCommons, Tika, Elasticsearch)
- Java 11
- bugfix AggregationSpout does not release IsInQuery boolean sometimes
- various improvements to URLFrontier module
In more details
- FEATURE-964: custom crawl delay per page by @juli-alvarez in #967
- Issue 970 HttpProtocol doesn't consider http.content.limit in test for filesize by @wowasa in #972
- Add ChannelManager for local channel management and constants to Spout.java by @FelixEngl in #982
- Fix error when spaces in path to test-resources of StatusBoltTest in ElasticSearch-Module by @FelixEngl in #985
- Add unit test basics for URLFrontier. by @FelixEngl in #984
- Fix starvation and busy waiting of StatusUpdaterBolt.java, add Constants. by @FelixEngl in #983
- Fix starvation and busy waiting of ES StatusUpdaterBolt (Fixes #986) by @FelixEngl in #988
- Fix starvation and busy waiting of ES IndexerBolt by @FelixEngl in #989
- HttpProtocol use the md protocol.set-headers to add custom header by url by @Mikwiss in #993
New Contributors
Full Changelog: 2.4...storm-crawler-2.5
StormCrawler 2.4
Disclaimer
This is a Pre-ASF release and did not undergo a formal review by the PMC.
Upgrade to Apache Storm 2.4
Upgrade to Elasticsearch 7.17.2
bugfix Setting "maxDepth": 0 in urlfilter.json prevents ES seed injection #959
Allow compatibility.mode for rest client to connect to ES8+ #962
Full Changelog: 2.3...2.4
StormCrawler 2.3
Disclaimer
This is a Pre-ASF release and did not undergo a formal review by the PMC.
https://digitalpebble.blogspot.com/2022/03/whats-new-in-stormcrawler-23.html
What's Changed
- Bump xercesImpl from 2.12.1 to 2.12.2 in /core by @dependabot in #942
- General Code Refactoring and Good Practices by @FelixEngl in #937
- Add unified way of initializing classes via string and configuring them. by @FelixEngl in #943
- Rewrote LinkParseFUlter + added XPathFilter + tests for JSOUPFilters by @jnioche in #953
- ISSUE-954: Issue with the order of emit and emitOutlink for redirections in FetcherBolt by @juli-alvarez in #955
New Contributors
- @FelixEngl made their first contribution in #937
Full Changelog: 2.2...2.3