catalogue-pipeline

The catalogue pipeline creates the search index for our unified collections search. It populates an Elasticsearch index with data which can then be read by our catalogue API. This allows users to search data from all our catalogues in one place, rather than searching multiple systems which each have different views of the data.

Requirements

The catalogue pipeline is designed to:

Create a single search index for records from all our source systems (including image collections, library catalogue, and archive records)
Stay up-to-date with updates and changes in those source systems
Transform those records into a common model
Combine records from different systems that refer to the same object

High-level design

We have a series of "adapters" that fetch records from our source systems. The adapters are responsible for staying up-to-date with changes in the source systems.

The adapters feed a transformation pipeline, which transforms source records into a common model, adds a pipeline identifier, and combines records from different systems. The structure and logic of the transformation pipeline evolves over time, as we find new and better ways to transform the data.

Once the transformation pipeline has finished processing the records, it stores them in a search index, which can be read by the catalogue API.

The catalogue pipeline runs entirely in AWS, with no on-premise infrastructure required.

Usage

We always have at least one pipeline which is populating the currently-live search index, but we may have more than one pipeline running at a time.

Running multiple pipelines means we can try experiments or breaking changes in a new pipeline, and keep them isolated from the live search index (and the public API). Over time, newer pipelines replace older pipelines, and the older pipelines are deleted.

We publish our source code so that other people can learn from it, but it's very unlikely anybody would want to run it themselves. It contains a lot of Wellcome-specific logic, and would need extensive modification to be useful elsewhere.

Development

See docs/developers.md.

License

MIT.

Name		Name	Last commit message	Last commit date
Latest commit History 10,805 Commits
.buildkite		.buildkite
.gitbook/assets		.gitbook/assets
.github/workflows		.github/workflows
builds		builds
calm_adapter		calm_adapter
common		common
docs		docs
ebsco_adapter		ebsco_adapter
index_config		index_config
infrastructure		infrastructure
mets_adapter		mets_adapter
pipeline		pipeline
project		project
reindexer		reindexer
scripts		scripts
sierra_adapter		sierra_adapter
tei_adapter		tei_adapter
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.eslintignore		.eslintignore
.eslintrc.js		.eslintrc.js
.gitattributes		.gitattributes
.gitbook.yaml		.gitbook.yaml
.gitignore		.gitignore
.prettierrc.js		.prettierrc.js
.scalafmt.conf		.scalafmt.conf
.stylelintrc		.stylelintrc
CODEOWNERS		CODEOWNERS
LICENSE		LICENSE
README.md		README.md
REINDEXING.md		REINDEXING.md
SUMMARY.md		SUMMARY.md
build.sbt		build.sbt
package.json		package.json
yarn.lock		yarn.lock

License

wellcomecollection/catalogue-pipeline

Folders and files

Latest commit

History

Repository files navigation

catalogue-pipeline

Requirements

High-level design

Usage

Development

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages