Bean Reference

Note

This reference is a work in progress and does not yet cover all available beans. For a more complete list of Heritrix beans please refer to the javadoc.

Candidate Processors

CandidateScoper

Simple single-URI scoper, considers passed-in URI as candidate; sets fetchstatus negative and skips to end of processing if out-of-scope.

XML

<bean id="candidateScoper" class="org.archive.crawler.prefetch.CandidateScoper">
  <!-- <property name="logToFile" value="false" /> -->
  <!-- <property name="loggerModule" value="" /> -->
  <!-- <property name="scope" value="" /> -->
</bean>

Groovy

import org.archive.crawler.prefetch.CandidateScoper

candidateScoper(CandidateScoper) {
    // logToFile = false
    // loggerModule = ''
    // scope = ''
}

logToFile: (boolean) If enabled, log decisions to file named logs/{spring-bean-id}.log. Formatis “[timestamp] [decision] [uri]” where decision is ‘ACCEPT’ or ‘REJECT’.
loggerModule: (org.archive.crawler.reporting.CrawlerLoggerModule)
scope: (org.archive.modules.deciderules.DecideRule)

FrontierPreparer

Processor to preload URI with as much precalculated policy-based info as possible before it reaches frontier critical sections.

Frontiers also maintain a direct reference to this class, in casethey need to perform remedial preparation for URIs that do notpass through this processor on the CandidateChain.

XML

<bean id="frontierPreparer" class="org.archive.crawler.prefetch.FrontierPreparer">
  <!-- <property name="canonicalizationPolicy" value="" /> -->
  <!-- <property name="costAssignmentPolicy" value="" /> -->
  <!-- <property name="logToFile" value="false" /> -->
  <!-- <property name="loggerModule" value="" /> -->
  <!-- <property name="preferenceDepthHops" value="-1" /> -->
  <!-- <property name="preferenceEmbedHops" value="1" /> -->
  <!-- <property name="queueAssignmentPolicy" value="" /> -->
  <!-- <property name="scope" value="" /> -->
  <!-- <property name="uriPrecedencePolicy" value="" /> -->
</bean>

Groovy

import org.archive.crawler.prefetch.FrontierPreparer

frontierPreparer(FrontierPreparer) {
    // canonicalizationPolicy = ''
    // costAssignmentPolicy = ''
    // logToFile = false
    // loggerModule = ''
    // preferenceDepthHops = -1
    // preferenceEmbedHops = 1
    // queueAssignmentPolicy = ''
    // scope = ''
    // uriPrecedencePolicy = ''
}

canonicalizationPolicy: (org.archive.modules.canonicalize.UriCanonicalizationPolicy) Ordered list of url canonicalization rules. Rules are applied in theorder listed from top to bottom.
costAssignmentPolicy: (org.archive.crawler.frontier.CostAssignmentPolicy) cost assignment policy to use.
logToFile: (boolean) If enabled, log decisions to file named logs/{spring-bean-id}.log. Formatis “[timestamp] [decision] [uri]” where decision is ‘ACCEPT’ or ‘REJECT’.
loggerModule: (org.archive.crawler.reporting.CrawlerLoggerModule)
preferenceDepthHops: (int) Number of hops (of any sort) from a seed up to which a URI has higherpriority scheduling than any remaining seed. For example, if set to 1items one hop (link, embed, redirect, etc.) away from a seed will bescheduled with HIGH priority. If set to -1, no preferencing will occur,and a breadth-first search with seeds processed before discovered linkswill proceed. If set to zero, a purely depth-first search will proceed,with all discovered links processed before remaining seeds. Seedredirects are treated as one hop from a seed.
preferenceEmbedHops: (int) number of hops of embeds (ERX) to bump to front of host queue
queueAssignmentPolicy: (org.archive.crawler.frontier.QueueAssignmentPolicy) Defines how to assign URIs to queues. Can assign by host, by ip,by SURT-ordered authority, by SURT-ordered authority truncated toa topmost-assignable domain, and into one of a fixed set of buckets(1k).
scope: (org.archive.modules.deciderules.DecideRule)
uriPrecedencePolicy: (org.archive.crawler.frontier.precedence.UriPrecedencePolicy) URI precedence assignment policy to use.

Pre-Fetch Processors

PreconditionEnforcer

Ensures the preconditions for a fetch – such as DNS lookup or acquiring and respecting a robots.txt policy – aresatisfied before a URI is passed to subsequent stages.

XML

<bean id="preconditionEnforcer" class="org.archive.crawler.prefetch.PreconditionEnforcer">
  <!-- <property name="calculateRobotsOnly" value="false" /> -->
  <!-- <property name="credentialStore" value="" /> -->
  <!-- <property name="ipValidityDurationSeconds" value="" /> -->
  <!-- <property name="loggerModule" value="" /> -->
  <!-- <property name="metadata" value="" /> -->
  <!-- <property name="robotsValidityDurationSeconds" value="" /> -->
  <!-- <property name="serverCache" value="" /> -->
</bean>

Groovy

import org.archive.crawler.prefetch.PreconditionEnforcer

preconditionEnforcer(PreconditionEnforcer) {
    // calculateRobotsOnly = false
    // credentialStore = ''
    // ipValidityDurationSeconds = 0
    // loggerModule = ''
    // metadata = ''
    // robotsValidityDurationSeconds = 0
    // serverCache = ''
}

calculateRobotsOnly: (boolean) Whether to only calculate the robots status of an URI, without actuallyapplying any exclusions found. If true, exlcuded URIs will only beannotated in the crawl.log, but still fetched. Default is false.
credentialStore: (org.archive.modules.credential.CredentialStore)
ipValidityDurationSeconds: (int) The minimum interval for which a dns-record will be consideredvalid (in seconds). If the record’s DNS TTL is larger, that willbe used instead.
loggerModule: (org.archive.crawler.reporting.CrawlerLoggerModule)
metadata: (org.archive.modules.CrawlMetadata) Auto-discovered module providing configured (or overridden)User-Agent value and RobotsHonoringPolicy
robotsValidityDurationSeconds: (int) The time in seconds that fetched robots.txt information is considered tobe valid. If the value is set to ‘0’, then the robots.txt informationwill never expire.
serverCache: (org.archive.modules.net.ServerCache)

Preselector

If set to recheck the crawl’s scope, gives a yes/no on whethera CrawlURI should be processed at all. If not, its statuswill be marked OUT_OF_SCOPE and the URI will skip directlyto the first “postprocessor”.

XML

<bean id="preselector" class="org.archive.crawler.prefetch.Preselector">
  <!-- <property name="allowByRegex" value="" /> -->
  <!-- <property name="blockAll" value="false" /> -->
  <!-- <property name="blockByRegex" value="" /> -->
  <!-- <property name="logToFile" value="false" /> -->
  <!-- <property name="loggerModule" value="" /> -->
  <!-- <property name="recheckScope" value="false" /> -->
  <!-- <property name="scope" value="" /> -->
</bean>

Groovy

import org.archive.crawler.prefetch.Preselector

preselector(Preselector) {
    // allowByRegex = ''
    // blockAll = false
    // blockByRegex = ''
    // logToFile = false
    // loggerModule = ''
    // recheckScope = false
    // scope = ''
}

allowByRegex: (java.lang.String) Allow only URIs matching the regular expression to be processed.
blockAll: (boolean) Block all URIs from being processed. This is most likely to be used inoverrides to easily reject certain hosts from being processed.
blockByRegex: (java.lang.String) Block all URIs matching the regular expression from being processed.
logToFile: (boolean) If enabled, log decisions to file named logs/{spring-bean-id}.log. Formatis “[timestamp] [decision] [uri]” where decision is ‘ACCEPT’ or ‘REJECT’.
loggerModule: (org.archive.crawler.reporting.CrawlerLoggerModule)
recheckScope: (boolean) Recheck if uri is in scope. This is meaningful if the scope is alteredduring a crawl. URIs are checked against the scope when they are added toqueues. Setting this value to true forces the URI to be checked againstthe scope when it is coming out of the queue, possibly after the scopeis altered.
scope: (org.archive.modules.deciderules.DecideRule)

Browser Processor

BrowserProcessor

Opens a web page in a local web browser via WebDriver BiDi and runs Behaviors to interact with the page.Subresources loaded by the browser are recorded using a HTTP proxy. Must be used in conjunction withFetchHTTP2. Normally defined in the FetchChain after the link extractors.

XML

<bean id="browserProcessor" class="org.archive.crawler.processor.BrowserProcessor">
  <!-- <property name="behaviors" value="" /> -->
  <!-- <property name="concurrency" value="20" /> -->
  <!-- <property name="executable" value="" /> -->
  <!-- <property name="options" value="" /> -->
</bean>

Groovy

import org.archive.crawler.processor.BrowserProcessor

browserProcessor(BrowserProcessor) {
    // behaviors = ''
    // concurrency = 20
    // executable = ''
    // options = ''
}

behaviors: (java.util.List<org.archive.modules.behaviors.Behavior>) A list of Behaviors to run on each page.
concurrency: (int) Maximum number of web pages that can be open in the browser at once.
executable: (java.lang.String) Webdriver executable to launch. If null, will try several common paths.
Firefox can be used directly as it implements WebDriver BiDI natively. To use Chrome set this to aChromeDriver executable.
options: (java.util.List<java.lang.String>) Extra command-line options to be passed to the webdriver executable.

ExtractLinksBehavior

Extracts navigation links from the loaded page using JavaScript.

XML

<bean id="extractLinksBehavior" class="org.archive.modules.behaviors.ExtractLinksBehavior">
</bean>

Groovy

import org.archive.modules.behaviors.ExtractLinksBehavior

extractLinksBehavior(ExtractLinksBehavior) {
}

ScrollDownBehavior

Scrolls the page down until it reaches the bottom (or until a timeout is reached).

XML

<bean id="scrollDownBehavior" class="org.archive.modules.behaviors.ScrollDownBehavior">
  <!-- <property name="scrollInterval" value="50" /> -->
  <!-- <property name="timeout" value="5000" /> -->
</bean>

Groovy

import org.archive.modules.behaviors.ScrollDownBehavior

scrollDownBehavior(ScrollDownBehavior) {
    // scrollInterval = 50
    // timeout = 5000
}

scrollInterval: (int) How many milliseconds to wait between each scroll step.
timeout: (long) Maximum time to wait to reach the bottom of the page, in milliseconds.

Post-Processors

CandidatesProcessor

Processor which sends all candidate outlinks through the CandidateChain, scheduling those with non-negative statuscodes to the frontier. Also performs special handling for’discovered seeds’ – URIs, as with redirects from seeds, that may deserve special treatment to expand the scope.

XML

<bean id="candidatesProcessor" class="org.archive.crawler.postprocessor.CandidatesProcessor">
  <!-- <property name="candidateChain" value="" /> -->
  <!-- <property name="frontier" value="" /> -->
  <!-- <property name="loggerModule" value="" /> -->
  <!-- <property name="processErrorOutlinks" value="false" /> -->
  <!-- <property name="seeds" value="" /> -->
  <!-- <property name="seedsRedirectNewSeeds" value="true" /> -->
  <!-- <property name="seedsRedirectNewSeedsAllowTLDs" value="true" /> -->
  <!-- <property name="sheetOverlaysManager" value="" /> -->
</bean>

Groovy

import org.archive.crawler.postprocessor.CandidatesProcessor

candidatesProcessor(CandidatesProcessor) {
    // candidateChain = ''
    // frontier = ''
    // loggerModule = ''
    // processErrorOutlinks = false
    // seeds = ''
    // seedsRedirectNewSeeds = true
    // seedsRedirectNewSeedsAllowTLDs = true
    // sheetOverlaysManager = ''
}

candidateChain: (org.archive.modules.CandidateChain) Candidate chain
frontier: (org.archive.crawler.framework.Frontier) The frontier to use.
loggerModule: (org.archive.crawler.reporting.CrawlerLoggerModule)
processErrorOutlinks: (boolean) If true, outlinks from status codes <200 and >=400will be sent through candidates processing. Default isfalse.
seeds: (org.archive.modules.seeds.SeedModule)
seedsRedirectNewSeeds: (boolean) If enabled, any URL found because a seed redirected to it (original seedreturned 301 or 302), will also be treated as a seed, as long as the hopcount is less than {@value #SEEDS_REDIRECT_NEW_SEEDS_MAX_HOPS}.
seedsRedirectNewSeedsAllowTLDs: (boolean) If enabled, any URL found because a seed redirected to it (original seedreturned 301 or 302), will also be treated as a seed, as long as the hopcount is less than {@value #SEEDS_REDIRECT_NEW_SEEDS_MAX_HOPS}.
sheetOverlaysManager: (org.archive.crawler.spring.SheetOverlaysManager)

DispositionProcessor

A step, late in the processing of a CrawlURI, for marking-up the CrawlURI with values to affect frontier disposition, and updatinginformation that may have been affected by the fetch. This includesrobots info and other stats.

(Formerly called CrawlStateUpdater, when it did less.)

XML

<bean id="dispositionProcessor" class="org.archive.crawler.postprocessor.DispositionProcessor">
  <!-- <property name="delayFactor" value="5.0" /> -->
  <!-- <property name="forceRetire" value="false" /> -->
  <!-- <property name="maxDelayMs" value="30000" /> -->
  <!-- <property name="maxPerHostBandwidthUsageKbSec" value="0" /> -->
  <!-- <property name="metadata" value="" /> -->
  <!-- <property name="minDelayMs" value="3000" /> -->
  <!-- <property name="respectCrawlDelayUpToSeconds" value="300" /> -->
  <!-- <property name="serverCache" value="" /> -->
</bean>

Groovy

import org.archive.crawler.postprocessor.DispositionProcessor

dispositionProcessor(DispositionProcessor) {
    // delayFactor = 5.0
    // forceRetire = false
    // maxDelayMs = 30000
    // maxPerHostBandwidthUsageKbSec = 0
    // metadata = ''
    // minDelayMs = 3000
    // respectCrawlDelayUpToSeconds = 300
    // serverCache = ''
}

delayFactor: (float) How many multiples of last fetch elapsed time to wait before recontactingsame server.
forceRetire: (boolean) Whether to set a CrawlURI’s force-retired directive, retiringits queue when it finishes. Mainly intended for URI-specificoverlay settings; setting true globally will just retire all queuesafter they offer one URI, rapidly ending a crawl.
maxDelayMs: (int) never wait more than this long, regardless of multiple
maxPerHostBandwidthUsageKbSec: (int) maximum per-host bandwidth usage
metadata: (org.archive.modules.CrawlMetadata) Auto-discovered module providing configured (or overridden)User-Agent value and RobotsHonoringPolicy
minDelayMs: (int) always wait this long after one completion before recontacting sameserver, regardless of multiple
respectCrawlDelayUpToSeconds: (int) Whether to respect a ‘Crawl-Delay’ (in seconds) given in a site’srobots.txt
serverCache: (org.archive.modules.net.ServerCache)

ReschedulingProcessor

The most simple forced-rescheduling step possible: use a localsetting (perhaps overlaid to vary based on the URI) to set an exactfuture reschedule time, as a delay from now. Unless the reschedulDelaySeconds value is changed from its default, URIs are not rescheduled.

XML

<bean id="reschedulingProcessor" class="org.archive.crawler.postprocessor.ReschedulingProcessor">
  <!-- <property name="rescheduleDelaySeconds" value="-1" /> -->
</bean>

Groovy

import org.archive.crawler.postprocessor.ReschedulingProcessor

reschedulingProcessor(ReschedulingProcessor) {
    // rescheduleDelaySeconds = -1
}

rescheduleDelaySeconds: (long) amount of time to wait before forcing a URI to be rescheduleddefault of -1 means “don’t reschedule”

Bean Reference

Core Beans

ActionDirectory

BdbCookieStore

BdbFrontier

BdbModule

BdbServerCache

BdbUriUniqFilter

CheckpointService

CrawlController

CrawlerLoggerModule

CrawlLimitEnforcer

CrawlMetadata

CredentialStore

DecideRuleSequence

DiskSpaceMonitor

RulesCanonicalizationPolicy

SheetOverlaysManager

StatisticsTracker

TextSeedModule

Decide Rules

AcceptDecideRule

ClassKeyMatchesRegexDecideRule

ContentLengthDecideRule

ContentTypeMatchesRegexDecideRule

ContentTypeNotMatchesRegexDecideRule

ExpressionDecideRule (contrib)

ExternalGeoLocationDecideRule

FetchStatusDecideRule

FetchStatusMatchesRegexDecideRule

FetchStatusNotMatchesRegexDecideRule

HasViaDecideRule

HopCrossesAssignmentLevelDomainDecideRule

HopsPathMatchesRegexDecideRule

IdenticalDigestDecideRule

IpAddressSetDecideRule

MatchesFilePatternDecideRule

MatchesListRegexDecideRule

MatchesRegexDecideRule

MatchesStatusCodeDecideRule

NotMatchesFilePatternDecideRule

NotMatchesListRegexDecideRule

NotMatchesRegexDecideRule

NotMatchesStatusCodeDecideRule

NotOnDomainsDecideRule

NotOnHostsDecideRule

NotSurtPrefixedDecideRule

OnDomainsDecideRule

OnHostsDecideRule

PathologicalPathDecideRule

PredicatedDecideRule

PrerequisiteAcceptDecideRule

RejectDecideRule

ResourceLongerThanDecideRule

ResourceNoLongerThanDecideRule

ResponseContentLengthDecideRule

SchemeNotInSetDecideRule

ScriptedDecideRule

SeedAcceptDecideRule

SourceSeedDecideRule

SurtPrefixedDecideRule

TooManyHopsDecideRule

TooManyPathSegmentsDecideRule

TransclusionDecideRule

ViaSurtPrefixedDecideRule

Candidate Processors

CandidateScoper

FrontierPreparer

Pre-Fetch Processors

PreconditionEnforcer

Preselector

Fetch Processors

FetchDNS

FetchFTP

FetchHTTP

FetchHTTP2

FetchSFTP

FetchWhois

Link Extractors

ExtractorCSS