Skip to content

wish: WARCHdfsBolt with CDX index #567

@dportabella

Description

@dportabella

StormCrawler allows to filter web pages and archive them into WARC archives, as follows:

WARCHdfsBolt warcbolt = (WARCHdfsBolt) new WARCHdfsBolt().withFileNameFormat(fileNameFormat);

TopologyBuilder builder = new TopologyBuilder();

builder.setBolt("warc", warcbolt, numWorkers)
  .localOrShuffleGrouping("parse", WarcStreamName)
  .localOrShuffleGrouping("tika",  WarcStreamName);

Would it be possible to create a CDX index (or JCDX index) for the WARC archives at the same time?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions