Skip to content

JSoupParserBolt

Paul Armstrong edited this page Mar 17, 2021 · 2 revisions

The JSoupParserBolt can be used to parse HTML documents and extract the outlinks, text and metadata it contains. If you want to parse non-HTML documents, use the Tika-based ParserBolt from the external modules.

This parser calls the URLFilters and ParseFilters defined in the configuration. Please note that it calls MetadataTransfer prior to calling the ParseFilters, if you create new Outlinks in your ParseFilters you'll need to make sure that you use MetadataTransfer there to inherit the Metadata from the parent document.

The JSoupParserBolt automatically identifies the charset of the documents. It uses the status stream to report parsing errors but also for the outlinks it extracts from a page. These would typically be used by an extension of AbstractStatusUpdaterBolt and persisted in some form of storage.