Skip to content

nlnwa/solrwayback-adaption

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SolrWayback container images

This repository builds and publishes container images from SolrWayback releases.

Images

Solrwayback

docker run -p 8080:8080 ghcr.io/nlnwa/solrwayback

This will start SolrWayback which can be accessed at http://localhost:8080/solrwayback, but since the image is configured to access the Solr collection at http://localhost:8983/solr/netarchivebuilder it will give you an error message when run in isolation.

Warc indexer

$ docker run ghcr.io/nlnwa/warc-indexer -h

warc-indexer.sh

Parallel processing of WARC files using webarchive-discovery from UKWA:
https://github.com/ukwa/webarchive-discovery

The scripts keeps track of already processed WARCs by keeping the output
logs from processing of each WARC. These are stored in the folder
/opt/warc-indexer/status


Usage: ./warc-indexer.sh [warc|warc-folder]*


Index 2 WARC files:

  ./warc-indexer.sh mywarcfile1.warc.gz mywarcfile2.warc.gz

Index all WARC files in "folder_with_warc_files" (recursive descend) using
20 threads (this will take 20GB of memory):

  THREADS=20 ./warc-indexer.sh folder_with_warc_files

Index all WARC files in "folder_with_warc_files" (recursive descend) using
20 threads and with an alternative Solr as receiver:

  THREADS=20 SOLR_URL="http://ourcloud.internal:8123/solr/netarchive" ./warc-indexer.sh folder_with_warc_files

Note:
Each thread starts its own Java process with -Xmx1024M.
Make sure that there is enough memory on the machine.

Tweaks:
  SOLR_URL:       The receiving Solr end point, including collection
                  Value: http://localhost:8983/solr/netarchivebuilder

  SOLR_CHECK:     Check whether Solr is available before processing
                  Value: true

  SOLR_COMMIT:    Whether a Solr commit should be issued after indexing to
                  flush the buffers and make the changes immediately visible
                  Value: true

  THREADS:        The number of concurrent processes to use for indexing
                  Value: 2

  STATUS_ROOT:    Where to store log files from processing. The log files are
                  also used to track which WARCs has been processed
                  Value: /opt/warc-indexer/status

  TMP_ROOT:       Where to store temporary files during processing
                  Value: /opt/warc-indexer/status/tmp

  INDEXER_JAR:    The location of the warc-indexer Java tool
                  Value: /opt/warc-indexer/warc-indexer-3.3.1-jar-with-dependencies.jar

  INDEXER_MEM:    Memory allocation for each builder job
                  Value: 1024M

  INDEXER_CONFIG: Configuration for the warc-indexer Java tool
                  Value: /opt/warc-indexer/config3.conf

  INDEXER_CUSTOM: Custom command line options for the warc-indexer tool
                  Value: ""
                  Sample: "--collection yearly2020"

Solr

Use the official image.

To run Solr in kubernetes see Official Kubernetes operator for Apache Solr.

Previously this repository contained a Dockerfile that built a Solr container image (based on the official image) with the "netarchivebuilder" configset from the SolrWayback bundle version 4.4.2.

Docker Compose

See docker-compose.

TODO

  • Kubernetes deployment files and examples.