Bean Reference

Note

This reference is a work in progress and does not yet cover all available beans. For a more complete list of Heritrix beans please refer to the javadoc.

Core Beans

ActionDirectory

Directory watched for new files. Depending on their extension, will process with regard to current crawl, and rename with a datestamp into the ‘done’ directory.

Currently supports: - .seeds(.gz) add each URI found in file as a new seed (to be crawled if not already; to affect scope if appropriate). - (.s).recover(.gz) treat as traditional recovery log: consider all ‘Fs’-tagged lines included, then try-rescheduling all ‘F+’-tagged lines. (If “.s.” present, try scoping URIs before including/scheduling.) - (.s).include(.gz) add each URI found in a recover-log like file (regardless of its tagging) to the frontier’s alreadyIncluded filter, preventing them from being recrawled. (‘.s.’ indicates to apply scoping.) - (.s).schedule(.gz) add each URI found in a recover-log like file (regardless of its tagging) to the frontier’s queues. (‘.s.’ indicates to apply scoping.) Future support planned: - .robots: invalidate robots ASAP - (?) .block: block-all on named site(s) - .overlay: add new overlay settings - .js .rb .bsh .rb etc - execute arbitrary script (a la ScriptedProcessor)

<bean id="actionDirectory" class="org.archive.crawler.framework.ActionDirectory">
  <!-- <property name="initialDelaySeconds" value="10" /> -->
  <!-- <property name="delaySeconds" value="30" /> -->
  <!-- <property name="actionDir" value="" /> -->
  <!-- <property name="doneDir" value="" /> -->
  <!-- <property name="applicationContext" value="" /> -->
  <!-- <property name="seeds" value="" /> -->
  <!-- <property name="frontier" value="" /> -->
</bean>
initialDelaySeconds
(int) how long after crawl start to first scan action directory
delaySeconds
(int) delay between scans of actionDirectory for new files
actionDir
(ConfigPath)
doneDir
(ConfigPath)
applicationContext
(ApplicationContext)
seeds
(SeedModule)
frontier
(Frontier) autowired frontier for actions

BdbCookieStore

Cookie store using bdb for storage. Cookies are stored in a SortedMap keyed by #sortableKey(Cookie), so they are grouped together by domain. #cookieStoreFor(String) returns a facade whose CookieStore#getCookies() returns a list of cookies limited to the supplied host and parent domains, if applicable.

<bean id="bdbCookieStore" class="org.archive.modules.fetcher.BdbCookieStore">
  <!-- <property name="bdbModule" value="" /> -->
  <!-- <property name="recoveryCheckpoint" value="" /> -->
</bean>
bdbModule
(BdbModule)
recoveryCheckpoint
(Checkpoint)

BdbFrontier

A Frontier using several BerkeleyDB JE Databases to hold its record of known hosts (queues), and pending URIs.

<bean id="bdbFrontier" class="org.archive.crawler.frontier.BdbFrontier">
  <!-- <property name="bdbModule" value="" /> -->
  <!-- <property name="beanName" value="" /> -->
  <!-- <property name="dumpPendingAtClose" value="false" /> -->
  <!-- <property name="recoveryCheckpoint" value="" /> -->
</bean>
bdbModule
(BdbModule)
beanName
(String)
dumpPendingAtClose
(boolean)
recoveryCheckpoint
(Checkpoint)

BdbModule

Utility module for managing a shared BerkeleyDB-JE environment

<bean id="bdbModule" class="org.archive.bdb.BdbModule">
  <!-- <property name="dir" value="" /> -->
  <!-- <property name="cachePercent" value="1" /> -->
  <!-- <property name="cacheSize" value="1" /> -->
  <!-- <property name="useSharedCache" value="true" /> -->
  <!-- <property name="maxLogFileSize" value="10000000" /> -->
  <!-- <property name="expectedConcurrency" value="64" /> -->
  <!-- <property name="cleanerThreads" value="" /> -->
  <!-- <property name="evictorCoreThreads" value="1" /> -->
  <!-- <property name="evictorMaxThreads" value="1" /> -->
  <!-- <property name="useHardLinkCheckpoints" value="true" /> -->
  <!-- <property name="recoveryCheckpoint" value="" /> -->
</bean>
dir
(ConfigPath)
cachePercent
(int)
cacheSize
(int)
useSharedCache
(boolean)
maxLogFileSize
(long)
expectedConcurrency
(int) Expected number of concurrent threads; used to tune nLockTables according to JE FAQ http://www.oracle.com/technology/products/berkeley-db/faq/je_faq.html#33
cleanerThreads
(int)
evictorCoreThreads
(int) Configure the number of evictor threads (-1 means use the default) https://docs.oracle.com/cd/E17277_02/html/java/com/sleepycat/je/EnvironmentConfig.html#EVICTOR_CORE_THREADS
evictorMaxThreads
(int) Configure the maximum number of evictor threads (-1 means use the default) https://docs.oracle.com/cd/E17277_02/html/java/com/sleepycat/je/EnvironmentConfig.html#EVICTOR_MAX_THREADS
useHardLinkCheckpoints
(boolean) Whether to use hard-links to log files to collect/retain the BDB log files needed for a checkpoint. Default is true. May not work on Windows (especially on pre-NTFS filesystems). If false, the BDB ‘je.cleaner.expunge’ value will be set to ‘false’, as well, meaning BDB will *not* delete obsolete JDB files, but only rename the ‘.DEL’. They will have to be manually deleted to free disk space, but .DEL files referenced in any checkpoint’s ‘jdbfiles.manifest’ should be retained to keep the checkpoint valid.
recoveryCheckpoint
(Checkpoint)

BdbServerCache

ServerCache backed by BDB big maps; the usual choice for crawls.

<bean id="bdbServerCache" class="org.archive.modules.net.BdbServerCache">
  <!-- <property name="bdbModule" value="" /> -->
  <!-- <property name="recoveryCheckpoint" value="" /> -->
</bean>
bdbModule
(BdbModule)
recoveryCheckpoint
(Checkpoint)

BdbUriUniqFilter

A BDB implementation of an AlreadySeen list.

This implementation performs adequately without blowing out the heap. See AlreadySeen.

Makes keys that have URIs from same server close to each other. Mercator and 2.3.5 ‘Elminating Already-Visited URLs’ in ‘Mining the Web’ by Soumen Chakrabarti talk of a two-level key with the first 24 bits a hash of the host plus port and with the last 40 as a hash of the path. Testing showed adoption of such a scheme halving lookup times (Tutilhis implementation actually concatenates scheme + host in first 24 bits and path + query in trailing 40 bits).

<bean id="bdbUriUniqFilter" class="org.archive.crawler.util.BdbUriUniqFilter">
  <!-- <property name="bdbModule" value="" /> -->
  <!-- <property name="beanName" value="" /> -->
  <!-- <property name="recoveryCheckpoint" value="" /> -->
</bean>
bdbModule
(BdbModule)
beanName
(String)
recoveryCheckpoint
(Checkpoint)

CheckpointService

Executes checkpoints, and offers convenience methods for enumerating available Checkpoints and injecting a recovery-Checkpoint after build and before launch (setRecoveryCheckpointByName).

Offers optional automatic checkpointing at a configurable interval in minutes.

<bean id="checkpointService" class="org.archive.crawler.framework.CheckpointService">
  <!-- <property name="checkpointsDir" value="" /> -->
  <!-- <property name="checkpointIntervalMinutes" value="1" /> -->
  <!-- <property name="forgetAllButLatest" value="false" /> -->
  <!-- <property name="recoveryCheckpoint" value="" /> -->
  <!-- <property name="crawlController" value="" /> -->
  <!-- <property name="applicationContext" value="" /> -->
  <!-- <property name="recoveryCheckpointByName" value="" /> -->
</bean>
checkpointsDir
(ConfigPath) Checkpoints directory
checkpointIntervalMinutes
(long) Period at which to create automatic checkpoints; -1 means no auto checkpointing.
forgetAllButLatest
(boolean) True to save only the latest checkpoint, false to save all of them. Default is false.
recoveryCheckpoint
(Checkpoint)
crawlController
(CrawlController)
applicationContext
(ApplicationContext)
recoveryCheckpointByName
(String) Given the name of a valid checkpoint subdirectory in the checkpoints directory, create a Checkpoint instance, and insert it into all Checkpointable beans.

CrawlController

CrawlController collects all the classes which cooperate to perform a crawl and provides a high-level interface to the running crawl.

As the “global context” for a crawl, subcomponents will often reach each other through the CrawlController.

<bean id="crawlController" class="org.archive.crawler.framework.CrawlController">
  <!-- <property name="applicationContext" value="" /> -->
  <!-- <property name="metadata" value="" /> -->
  <!-- <property name="serverCache" value="" /> -->
  <!-- <property name="frontier" value="" /> -->
  <!-- <property name="scratchDir" value="" /> -->
  <!-- <property name="statisticsTracker" value="" /> -->
  <!-- <property name="seeds" value="" /> -->
  <!-- <property name="fetchChain" value="" /> -->
  <!-- <property name="dispositionChain" value="" /> -->
  <!-- <property name="candidateChain" value="" /> -->
  <!-- <property name="maxToeThreads" value="" /> -->
  <!-- <property name="runWhileEmpty" value="false" /> -->
  <!-- <property name="pauseAtStart" value="true" /> -->
  <!-- <property name="recorderOutBufferBytes" value="" /> -->
  <!-- <property name="recorderInBufferBytes" value="" /> -->
  <!-- <property name="loggerModule" value="" /> -->
  <!-- <property name="recoveryCheckpoint" value="" /> -->
</bean>
applicationContext
(ApplicationContext)
metadata
(CrawlMetadata)
serverCache
(ServerCache)
frontier
(Frontier) The frontier to use for the crawl.
scratchDir
(ConfigPath) Scratch directory for temporary overflow-to-disk
statisticsTracker
(StatisticsTracker) Statistics tracking modules. Any number of specialized statistics trackers that monitor a crawl and write logs, reports and/or provide information to the user interface.
seeds
(SeedModule)
fetchChain
(FetchChain) Fetch chain
dispositionChain
(DispositionChain) Disposition chain
candidateChain
(CandidateChain) Candidate chain
maxToeThreads
(int) Maximum number of threads processing URIs at the same time.
runWhileEmpty
(boolean) whether to keep running (without pause or finish) when frontier is empty
pauseAtStart
(boolean) whether to pause at crawl start
recorderOutBufferBytes
(int) Size in bytes of in-memory buffer to record outbound traffic. One such buffer is reserved for every ToeThread.
recorderInBufferBytes
(int) Size in bytes of in-memory buffer to record inbound traffic. One such buffer is reserved for every ToeThread.
loggerModule
(CrawlerLoggerModule)
recoveryCheckpoint
(Checkpoint)

CrawlerLoggerModule

Module providing all expected whole-crawl logging facilities

<bean id="crawlerLoggerModule" class="org.archive.crawler.reporting.CrawlerLoggerModule">
  <!-- <property name="path" value="" /> -->
  <!-- <property name="logExtraInfo" value="false" /> -->
  <!-- <property name="crawlLogPath" value="" /> -->
  <!-- <property name="alertsLogPath" value="" /> -->
  <!-- <property name="progressLogPath" value="" /> -->
  <!-- <property name="uriErrorsLogPath" value="" /> -->
  <!-- <property name="runtimeErrorsLogPath" value="" /> -->
  <!-- <property name="nonfatalErrorsLogPath" value="" /> -->
  <!-- <property name="upSimpleLog" value="" /> -->
  <!-- <property name="recoveryCheckpoint" value="" /> -->
</bean>
path
(ConfigPath)
logExtraInfo
(boolean) Whether to include the “extra info” field for each entry in crawl.log. “Extra info” is arbitrary JSON. It is the last field of the log line.
crawlLogPath
(ConfigPath)
alertsLogPath
(ConfigPath)
progressLogPath
(ConfigPath)
uriErrorsLogPath
(ConfigPath)
runtimeErrorsLogPath
(ConfigPath)
nonfatalErrorsLogPath
(ConfigPath)
upSimpleLog
(String)
recoveryCheckpoint
(Checkpoint)

CrawlLimitEnforcer

Bean to enforce limits on the size of a crawl in URI count, byte count, or elapsed time. Fires off the StatSnapshotEvent, so only checks at the interval (configured in StatisticsTracker) of those events.

<bean id="crawlLimitEnforcer" class="org.archive.crawler.framework.CrawlLimitEnforcer">
  <!-- <property name="maxBytesDownload" value="0" /> -->
  <!-- <property name="maxNovelBytes" value="0" /> -->
  <!-- <property name="maxNovelUrls" value="0" /> -->
  <!-- <property name="maxWarcNovelUrls" value="0" /> -->
  <!-- <property name="maxWarcNovelBytes" value="0" /> -->
  <!-- <property name="maxDocumentsDownload" value="0" /> -->
  <!-- <property name="maxTimeSeconds" value="0" /> -->
  <!-- <property name="crawlController" value="" /> -->
</bean>
maxBytesDownload
(long) Maximum number of bytes to download. Once this number is exceeded the crawler will stop. A value of zero means no upper limit.
maxNovelBytes
(long) Maximum number of uncompressed payload bytes to write to WARC response or resource records. Once this number is exceeded the crawler will stop. A value of zero means no upper limit.
maxNovelUrls
(long) Maximum number of novel (not deduplicated) urls to download. Once this number is exceeded the crawler will stop. A value of zero means no upper limit.
maxWarcNovelUrls
(long) Maximum number of urls to write to WARC response or resource records. Once this number is exceeded the crawler will stop. A value of zero means no upper limit.
maxWarcNovelBytes
(long) Maximum number of novel (not deduplicated) bytes to write to WARC response or resource records. Once this number is exceeded the crawler will stop. A value of zero means no upper limit.
maxDocumentsDownload
(long) Maximum number of documents to download. Once this number is exceeded the crawler will stop. A value of zero means no upper limit.
maxTimeSeconds
(long) Maximum amount of time to crawl (in seconds). Once this much time has elapsed the crawler will stop. A value of zero means no upper limit.
crawlController
(CrawlController)

CrawlMetadata

Basic crawl metadata, as consulted by functional modules and recorded in ARCs/WARCs.

<bean id="crawlMetadata" class="org.archive.modules.CrawlMetadata">
  <!-- <property name="robotsPolicyName" value="obey" /> -->
  <!-- <property name="availableRobotsPolicies" value="" /> -->
  <!-- <property name="operator" value="" /> -->
  <!-- <property name="description" value="" /> -->
  <!-- <property name="userAgentTemplate" value="Mozilla/5.0 (compatible; heritrix/@VERSION@ +@OPERATOR_CONTACT_URL@)" /> -->
  <!-- <property name="operatorFrom" value="" /> -->
  <!-- <property name="operatorContactUrl" value="" /> -->
  <!-- <property name="audience" value="" /> -->
  <!-- <property name="organization" value="" /> -->
  <!-- <property name="jobName" value="" /> -->
</bean>
robotsPolicyName
(String) Robots policy name
availableRobotsPolicies
(Map) Map of all available RobotsPolicies, by name, to choose from. assembled from declared instances in configuration plus the standard ‘obey’ (aka ‘classic’) and ‘ignore’ policies.
operator
(String)
description
(String)
userAgentTemplate
(String)
operatorFrom
(String)
operatorContactUrl
(String)
audience
(String)
organization
(String)
jobName
(String)

CredentialStore

Front door to the credential store.

Come here to get at credentials.

See Credential Store Design.

<bean id="credentialStore" class="org.archive.modules.credential.CredentialStore">
  <!-- <property name="credentials" value="" /> -->
</bean>
credentials
(Map) Credentials used by heritrix authenticating. See http://crawler.archive.org/proposals/auth/ for background.

DecideRuleSequence

<bean id="decideRuleSequence" class="org.archive.modules.deciderules.DecideRuleSequence">
  <!-- <property name="logToFile" value="false" /> -->
  <!-- <property name="logExtraInfo" value="false" /> -->
  <!-- <property name="loggerModule" value="" /> -->
  <!-- <property name="rules" value="" /> -->
  <!-- <property name="serverCache" value="" /> -->
  <!-- <property name="beanName" value="" /> -->
</bean>
logToFile
(boolean) If enabled, log decisions to file named logs/{spring-bean-id}.log. Format is: [timestamp] [decisive-rule-num] [decisive-rule-class] [decision] [uri] [extraInfo]

Relies on Spring Lifecycle to initialize the log. Only top-level beans get the Lifecycle treatment from Spring, so bean must be top-level for logToFile to work. (This is true of other modules that support logToFile, and anything else that uses Lifecycle, as well.)

logExtraInfo
(boolean) Whether to include the “extra info” field for each entry in crawl.log. “Extra info” is a json object with entries “host”, “via”, “source” and “hopPath”.
loggerModule
(SimpleFileLoggerProvider)
rules
(List)
serverCache
(ServerCache)
beanName
(String)

DiskSpaceMonitor

Monitors the available space on the paths configured. If the available space drops below a specified threshold a crawl pause is requested.

Monitoring is done via the java.io.File.getUsableSpace() method. This method will sometimes fail on network attached storage, returning 0 bytes available even if that is not actually the case.

Paths that do not resolve to actual filesystem folders or files will not be evaluated (i.e. if java.io.File.exists() returns false no further processing is carried out on that File).

Paths are checked available space whenever a StatSnapshotEvent occurs.

<bean id="diskSpaceMonitor" class="org.archive.crawler.monitor.DiskSpaceMonitor">
  <!-- <property name="monitorPaths" value="" /> -->
  <!-- <property name="pauseThresholdMiB" value="8192" /> -->
  <!-- <property name="monitorConfigPaths" value="true" /> -->
  <!-- <property name="crawlController" value="" /> -->
  <!-- <property name="configPathConfigurer" value="" /> -->
</bean>
monitorPaths
(List)
pauseThresholdMiB
(long) Set the minimum amount of space that must be available on all monitored paths. If the amount falls below this pause threshold on any path the crawl will be paused.
monitorConfigPaths
(boolean) If enabled, all the paths returned by ConfigPathConfigurer#getAllConfigPaths() will be monitored in addition to any paths explicitly specified via #setMonitorPaths(List).

true by default.

Note: This is not guaranteed to contain all paths that Heritrix writes to. It is the responsibility of modules that write to disk to register their activity with the ConfigPathConfigurer and some may not do so.

crawlController
(CrawlController) Autowire access to CrawlController *
configPathConfigurer
(ConfigPathConfigurer) Autowire access to ConfigPathConfigurer *

RulesCanonicalizationPolicy

URI Canonicalizatioon Policy

<bean id="rulesCanonicalizationPolicy" class="org.archive.modules.canonicalize.RulesCanonicalizationPolicy">
  <!-- <property name="rules" value="" /> -->
</bean>
rules
(List)

SheetOverlaysManager

Manager which marks-up CrawlURIs with the names of all applicable Sheets, and returns overlay maps by name.

<bean id="sheetOverlaysManager" class="org.archive.crawler.spring.SheetOverlaysManager">
  <!-- <property name="beanFactory" value="" /> -->
  <!-- <property name="sheetsByName" value="" /> -->
</bean>
beanFactory
(BeanFactory)
sheetsByName
(Map) Collect all Sheets, by beanName.

StatisticsTracker

This is an implementation of the AbstractTracker. It is designed to function with the WUI as well as performing various logging activity.

At the end of each snapshot a line is written to the ‘progress-statistics.log’ file.

The header of that file is as follows:

 [timestamp] [discovered]    [queued] [downloaded] [doc/s(avg)]  [KB/s(avg)] [dl-failures] [busy-thread] [mem-use-KB]
First there is a timestamp, accurate down to 1 second.

discovered, queued, downloaded and dl-failures are (respectively) the discovered URI count, pending URI count, successfully fetched count and failed fetch count from the frontier at the time of the snapshot.

KB/s(avg) is the bandwidth usage. We use the total bytes downloaded to calculate average bandwidth usage (KB/sec). Since we also note the value each time a snapshot is made we can calculate the average bandwidth usage during the last snapshot period to gain a “current” rate. The first number is the current and the average is in parenthesis.

doc/s(avg) works the same way as doc/s except it show the number of documents (URIs) rather then KB downloaded.

busy-threads is the total number of ToeThreads that are not available (and thus presumably busy processing a URI). This information is extracted from the crawl controller.

Finally mem-use-KB is extracted from the run time environment (Runtime.getRuntime().totalMemory()).

In addition to the data collected for the above logs, various other data is gathered and stored by this tracker. Successfully downloaded documents per fetch status code Successfully downloaded documents per document mime type Amount of data per mime type Successfully downloaded documents per host Amount of data per host Disposition of all seeds (this is written to ‘reports.log’ at end of crawl) Successfully downloaded documents per host per source

<bean id="statisticsTracker" class="org.archive.crawler.reporting.StatisticsTracker">
  <!-- <property name="seeds" value="" /> -->
  <!-- <property name="bdbModule" value="" /> -->
  <!-- <property name="reportsDir" value="" /> -->
  <!-- <property name="serverCache" value="" /> -->
  <!-- <property name="liveHostReportSize" value="20" /> -->
  <!-- <property name="applicationContext" value="" /> -->
  <!-- <property name="trackSeeds" value="true" /> -->
  <!-- <property name="trackSources" value="true" /> -->
  <!-- <property name="intervalSeconds" value="20" /> -->
  <!-- <property name="keepSnapshotsCount" value="5" /> -->
  <!-- <property name="crawlController" value="" /> -->
  <!-- <property name="reports" value="" /> -->
  <!-- <property name="beanName" value="" /> -->
  <!-- <property name="recoveryCheckpoint" value="" /> -->
</bean>
seeds
(SeedModule)
bdbModule
(BdbModule)
reportsDir
(ConfigPath)
serverCache
(ServerCache)
liveHostReportSize
(int)
applicationContext
(ApplicationContext)
trackSeeds
(boolean) Whether to maintain seed disposition records (expensive in crawls with millions of seeds)
trackSources
(boolean) Whether to maintain hosts-per-source-tag records for; very expensive in crawls with large numbers of source-tags (seeds) or large crawls over many hosts
intervalSeconds
(int) The interval between writing progress information to log.
keepSnapshotsCount
(int) Number of crawl-stat sample snapshots to keep for calculation purposes.
crawlController
(CrawlController)
reports
(List)
beanName
(String)
recoveryCheckpoint
(Checkpoint)

TextSeedModule

Module that announces a list of seeds from a text source (such as a ConfigFile or ConfigString), and provides a mechanism for adding seeds after a crawl has begun.

<bean id="textSeedModule" class="org.archive.modules.seeds.TextSeedModule">
  <!-- <property name="textSource" value="null" /> -->
  <!-- <property name="blockAwaitingSeedLines" value="1" /> -->
</bean>
textSource
(ReadSource) Text from which to extract seeds
blockAwaitingSeedLines
(int) Number of lines of seeds-source to read on initial load before proceeding with crawl. Default is -1, meaning all. Any other value will cause that number of lines to be loaded before fetching begins, while all extra lines continue to be processed in the background. Generally, this should only be changed when working with very large seed lists, and scopes that do *not* depend on reading all seeds.

Decide Rules

AcceptDecideRule

<bean id="acceptDecideRule" class="org.archive.modules.deciderules.AcceptDecideRule">
</bean>

ClassKeyMatchesRegexDecideRule

Rule applies configured decision to any CrawlURI class key – i.e. CrawlURI#getClassKey() – matches matches supplied regex.

<bean id="classKeyMatchesRegexDecideRule" class="org.archive.crawler.deciderules.ClassKeyMatchesRegexDecideRule">
  <!-- <property name="crawlController" value="" /> -->
</bean>
crawlController
(CrawlController)

ContentLengthDecideRule

<bean id="contentLengthDecideRule" class="org.archive.modules.deciderules.ContentLengthDecideRule">
  <!-- <property name="contentLengthThreshold" value="" /> -->
</bean>
contentLengthThreshold
(long) Content-length threshold. The rule returns ACCEPT if the content-length is less than this threshold, or REJECT otherwise. The default is 2^63, meaning any document will be accepted.

ContentTypeMatchesRegexDecideRule

DecideRule whose decision is applied if the URI’s content-type is present and matches the supplied regular expression.

<bean id="contentTypeMatchesRegexDecideRule" class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule">
</bean>

ContentTypeNotMatchesRegexDecideRule

DecideRule whose decision is applied if the URI’s content-type is present and does not match the supplied regular expression.

<bean id="contentTypeNotMatchesRegexDecideRule" class="org.archive.modules.deciderules.ContentTypeNotMatchesRegexDecideRule">
</bean>

ExpressionDecideRule (contrib)

Example usage:

 <bean class=”org.archive.modules.deciderules.ExpressionDecideRule”>
    <property name=”groovyExpression” value=’curi.via == null &amp;&amp; curi ==~ “^https?://(?:www\.)?(facebook|vimeo|flickr)\.com/.*”’/>
</bean>

<bean id="expressionDecideRule" class="org.archive.modules.deciderules.ExpressionDecideRule">
  <!-- <property name="groovyExpression" value="" /> -->
</bean>
groovyExpression
(String)

ExternalGeoLocationDecideRule

A rule that can be configured to take alternate implementations of the ExternalGeoLocationInterface. If no implementation specified, or none found, returns configured decision. If host in URI has been resolved checks CrawlHost for the country code determination. If country code is not present, does country lookup, and saves the country code to CrawlHost for future consultation. If country code is present in CrawlHost, compares it against the configured code. Note that if a host’s IP address changes during the crawl, we still consider the associated hostname to be in the country of its original IP address.

<bean id="externalGeoLocationDecideRule" class="org.archive.modules.deciderules.ExternalGeoLocationDecideRule">
  <!-- <property name="lookup" value="null" /> -->
  <!-- <property name="countryCodes" value="" /> -->
  <!-- <property name="serverCache" value="" /> -->
</bean>
lookup
(ExternalGeoLookupInterface)
countryCodes
(List) Country code name.
serverCache
(ServerCache)

FetchStatusDecideRule

Rule applies the configured decision for any URI which has a fetch status equal to the ‘target-status’ setting.

<bean id="fetchStatusDecideRule" class="org.archive.modules.deciderules.FetchStatusDecideRule">
  <!-- <property name="statusCodes" value="" /> -->
</bean>
statusCodes
(List)

FetchStatusMatchesRegexDecideRule

<bean id="fetchStatusMatchesRegexDecideRule" class="org.archive.modules.deciderules.FetchStatusMatchesRegexDecideRule">
</bean>

FetchStatusNotMatchesRegexDecideRule

<bean id="fetchStatusNotMatchesRegexDecideRule" class="org.archive.modules.deciderules.FetchStatusNotMatchesRegexDecideRule">
</bean>

HasViaDecideRule

Rule applies the configured decision for any URI which has a ‘via’ (essentially, any URI that was a seed or some kinds of mid-crawl adds).

<bean id="hasViaDecideRule" class="org.archive.modules.deciderules.HasViaDecideRule">
</bean>

HopCrossesAssignmentLevelDomainDecideRule

Applies its decision if the current URI differs in that portion of its hostname/domain that is assigned/sold by registrars, its ‘assignment-level-domain’ (ALD) (AKA ‘public suffix’ or in previous Heritrix versions, ‘topmost assigned SURT’)

<bean id="hopCrossesAssignmentLevelDomainDecideRule" class="org.archive.modules.deciderules.HopCrossesAssignmentLevelDomainDecideRule">
</bean>

HopsPathMatchesRegexDecideRule

Rule applies configured decision to any CrawlURIs whose ‘hops-path’ (string like “LLXE” etc.) matches the supplied regex.

<bean id="hopsPathMatchesRegexDecideRule" class="org.archive.modules.deciderules.HopsPathMatchesRegexDecideRule">
</bean>

IdenticalDigestDecideRule

Rule applies configured decision to any CrawlURIs whose revisit profile is set with a profile matching WARCConstants#PROFILE_REVISIT_IDENTICAL_DIGEST

<bean id="identicalDigestDecideRule" class="org.archive.modules.deciderules.recrawl.IdenticalDigestDecideRule">
</bean>

IpAddressSetDecideRule

IpAddressSetDecideRule must be used with org.archive.crawler.prefetch.Preselector#setRecheckScope(boolean) set to true because it relies on Heritrix’ dns lookup to establish the ip address for a URI before it can run.


<bean class=”org.archive.modules.deciderules.IpAddressSetDecideRule”>
 <property name=”ipAddresses”>
  <set>
   <value>127.0.0.1</value>
   <value>69.89.27.209</value>
  </set>
 </property>
 <property name=’decision’ value=’REJECT’ />
</bean>

<bean id="ipAddressSetDecideRule" class="org.archive.modules.deciderules.IpAddressSetDecideRule">
  <!-- <property name="ipAddresses" value="" /> -->
  <!-- <property name="serverCache" value="" /> -->
</bean>
ipAddresses
(Set)
serverCache
(ServerCache)

MatchesFilePatternDecideRule

Compares suffix of a passed CrawlURI, UURI, or String against a regular expression pattern, applying its configured decision to all matches.

Several predefined patterns are available for convenience. Choosing ‘custom’ makes this the same as a regular MatchesRegexDecideRule.

<bean id="matchesFilePatternDecideRule" class="org.archive.modules.deciderules.MatchesFilePatternDecideRule">
  <!-- <property name="usePreset" value="" /> -->
</bean>
usePreset
(Preset)

MatchesListRegexDecideRule

Rule applies configured decision to any CrawlURIs whose String URI matches the supplied regexs.

The list of regular expressions can be considered logically AND or OR.

<bean id="matchesListRegexDecideRule" class="org.archive.modules.deciderules.MatchesListRegexDecideRule">
  <!-- <property name="timeoutPerRegexSeconds" value="0" /> -->
  <!-- <property name="regexList" value="" /> -->
  <!-- <property name="listLogicalOr" value="true" /> -->
</bean>
timeoutPerRegexSeconds
(long) The timeout for regular expression matching, in seconds. If set to 0 or negative then no timeout is specified and there is no upper limit to how long the matching may take. See the corresponding test class MatchesListRegexDecideRuleTest for a pathological example.
regexList
(List) The list of regular expressions to evalute against the URI.
listLogicalOr
(boolean) True if the list of regular expression should be considered as logically AND when matching. False if the list of regular expressions should be considered as logically OR when matching.

MatchesRegexDecideRule

Rule applies configured decision to any CrawlURIs whose String URI matches the supplied regex.

<bean id="matchesRegexDecideRule" class="org.archive.modules.deciderules.MatchesRegexDecideRule">
  <!-- <property name="regex" value="" /> -->
</bean>
regex
(Pattern)

MatchesStatusCodeDecideRule

Provides a rule that returns “true” for any CrawlURIs which have a fetch status code that falls within the provided inclusive range. For instance, to select only URIs with a “success” status code you must provide the range 200 to 299.

<bean id="matchesStatusCodeDecideRule" class="org.archive.modules.deciderules.MatchesStatusCodeDecideRule">
  <!-- <property name="lowerBound" value="" /> -->
  <!-- <property name="upperBound" value="" /> -->
</bean>
lowerBound
(Integer) Sets the lower bound on the range of acceptable status codes.
upperBound
(Integer) Sets the upper bound on the range of acceptable status codes.

NotMatchesFilePatternDecideRule

Rule applies configured decision to any URIs which do *not* match the supplied (file-pattern) regex.

<bean id="notMatchesFilePatternDecideRule" class="org.archive.modules.deciderules.NotMatchesFilePatternDecideRule">
</bean>

NotMatchesListRegexDecideRule

Rule applies configured decision to any URIs which do *not* match the supplied regex.

<bean id="notMatchesListRegexDecideRule" class="org.archive.modules.deciderules.NotMatchesListRegexDecideRule">
</bean>

NotMatchesRegexDecideRule

Rule applies configured decision to any URIs which do *not* match the supplied regex.

<bean id="notMatchesRegexDecideRule" class="org.archive.modules.deciderules.NotMatchesRegexDecideRule">
</bean>

NotMatchesStatusCodeDecideRule

Provides a rule that returns “true” for any CrawlURIs which has a fetch status code that does not fall within the provided inclusive range. For instance, to reject any URIs with a “client error” status code you must provide the range 400 to 499.

<bean id="notMatchesStatusCodeDecideRule" class="org.archive.modules.deciderules.NotMatchesStatusCodeDecideRule">
  <!-- <property name="upperBound" value="" /> -->
</bean>
upperBound
(Integer) Sets the upper bound on the range of acceptable status codes.

NotOnDomainsDecideRule

Rule applies configured decision to any URIs that are *not* in one of the domains in the configured set of domains, filled from the seed set.

<bean id="notOnDomainsDecideRule" class="org.archive.modules.deciderules.surt.NotOnDomainsDecideRule">
</bean>

NotOnHostsDecideRule

Rule applies configured decision to any URIs that are *not* on one of the hosts in the configured set of hosts, filled from the seed set.

<bean id="notOnHostsDecideRule" class="org.archive.modules.deciderules.surt.NotOnHostsDecideRule">
</bean>

NotSurtPrefixedDecideRule

Rule applies configured decision to any URIs that, when expressed in SURT form, do *not* begin with one of the prefixes in the configured set.

The set can be filled with SURT prefixes implied or listed in the seeds file, or another external file.

<bean id="notSurtPrefixedDecideRule" class="org.archive.modules.deciderules.surt.NotSurtPrefixedDecideRule">
</bean>

OnDomainsDecideRule

Rule applies configured decision to any URIs that are on one of the domains in the configured set of domains, filled from the seed set.

<bean id="onDomainsDecideRule" class="org.archive.modules.deciderules.surt.OnDomainsDecideRule">
</bean>

OnHostsDecideRule

Rule applies configured decision to any URIs that are on one of the hosts in the configured set of hosts, filled from the seed set.

<bean id="onHostsDecideRule" class="org.archive.modules.deciderules.surt.OnHostsDecideRule">
</bean>

PathologicalPathDecideRule

Rule REJECTs any URI which contains an excessive number of identical, consecutive path-segments (eg http://example.com/a/a/a/boo.html == 3 ‘/a’ segments)

<bean id="pathologicalPathDecideRule" class="org.archive.modules.deciderules.PathologicalPathDecideRule">
  <!-- <property name="maxRepetitions" value="2" /> -->
</bean>
maxRepetitions
(int) Number of times the pattern should be allowed to occur. This rule returns its decision (usually REJECT) if a path-segment is repeated more than number of times.

PredicatedDecideRule

Rule which applies the configured decision only if a test evaluates to true. Subclasses override evaluate() to establish the test.

<bean id="predicatedDecideRule" class="org.archive.modules.deciderules.PredicatedDecideRule">
  <!-- <property name="decision" value="" /> -->
</bean>
decision
(DecideResult)

PrerequisiteAcceptDecideRule

Rule which ACCEPTs all ‘prerequisite’ URIs (those with a ‘P’ in the last hopsPath position). Good in a late position to ensure other scope settings don’t lock out necessary prerequisites.

<bean id="prerequisiteAcceptDecideRule" class="org.archive.modules.deciderules.PrerequisiteAcceptDecideRule">
</bean>

RejectDecideRule

<bean id="rejectDecideRule" class="org.archive.modules.deciderules.RejectDecideRule">
</bean>

ResourceLongerThanDecideRule

Applies configured decision for URIs with content length greater than a given threshold length value. Examines either HTTP header Content-Length or actual downloaded content length (based on the useHeaderLength property), and has no effect on resources shorter than or equal to the given threshold value.

Note that because neither the Content-Length header nor the actual size are available at URI-scoping time, this rule is unusable in crawl scopes. Instead, the earliest it can be used is as a mid-fetch rule (in FetchHTTP), when the headers are available but not yet the body. It can also be used to affect processing after the URI is fully fetched.

<bean id="resourceLongerThanDecideRule" class="org.archive.modules.deciderules.ResourceLongerThanDecideRule">
</bean>

ResourceNoLongerThanDecideRule

Applies configured decision for URIs with content length less than or equal to a given threshold length value. Examines either HTTP header Content-Length or actual downloaded content length (based on the useHeaderLength property), and has no effect on resources longer than the given threshold value.

Note that because neither the Content-Length header nor the actual size are available at URI-scoping time, this rule is unusable in crawl scopes. Instead, the earliest it can be used is as a mid-fetch rule (in FetchHTTP), when the headers are available but not yet the body. It can also be used to affect processing after the URI is fully fetched.

<bean id="resourceNoLongerThanDecideRule" class="org.archive.modules.deciderules.ResourceNoLongerThanDecideRule">
  <!-- <property name="useHeaderLength" value="true" /> -->
  <!-- <property name="contentLengthThreshold" value="1" /> -->
</bean>
useHeaderLength
(boolean) Shall this rule be used as a midfetch rule? If true, this rule will determine content length based on HTTP header information, otherwise the size of the already downloaded content will be used.
contentLengthThreshold
(long) Max content-length this filter will allow to pass through. If -1, then no limit.

ResponseContentLengthDecideRule

Decide rule that will ACCEPT or REJECT a uri, depending on the “decision” property, after it’s fetched, if the content body is within a specified size range, specified in bytes.

<bean id="responseContentLengthDecideRule" class="org.archive.modules.deciderules.ResponseContentLengthDecideRule">
  <!-- <property name="lowerBound" value="0" /> -->
  <!-- <property name="upperBound" value="" /> -->
</bean>
lowerBound
(long) The rule will apply if the url has been fetched and content body length is greater than or equal to this number of bytes. Default is 0, meaning everything will match.
upperBound
(long) The rule will apply if the url has been fetched and content body length is less than or equal to this number of bytes. Default is Long.MAX_VALUE, meaning everything will match.

SchemeNotInSetDecideRule

Rule applies the configured decision (default REJECT) for any URI which has a URI-scheme NOT contained in the configured Set.

<bean id="schemeNotInSetDecideRule" class="org.archive.modules.deciderules.SchemeNotInSetDecideRule">
  <!-- <property name="schemes" value="" /> -->
</bean>
schemes
(Set) set of schemes to test URI scheme

ScriptedDecideRule

Rule which runs a JSR-223 script to make its decision.

Script source may be provided via a file local to the crawler or an inline configuration string.

The source must include a one-argument function “decisionFor” which returns the appropriate DecideResult.

Variables available to the script include ‘object’ (the object to be evaluated, typically a CrawlURI), ‘self’ (this ScriptedDecideRule instance), and ‘context’ (the crawl’s ApplicationContext, from which all named crawl beans are easily reachable).

<bean id="scriptedDecideRule" class="org.archive.modules.deciderules.ScriptedDecideRule">
  <!-- <property name="engineName" value="beanshell" /> -->
  <!-- <property name="scriptSource" value="null" /> -->
  <!-- <property name="isolateThreads" value="true" /> -->
  <!-- <property name="applicationContext" value="" /> -->
</bean>
engineName
(String) engine name; default “beanshell”
scriptSource
(ReadSource)
isolateThreads
(boolean) Whether each ToeThread should get its own independent script engine, or they should share synchronized access to one engine. Default is true, meaning each thread gets its own isolated engine.
applicationContext
(ApplicationContext)

SeedAcceptDecideRule

Rule which ACCEPTs all ‘seed’ URIs (those for which isSeed is true). Good in a late position to ensure other scope settings don’t lock out explicitly added seeds.

<bean id="seedAcceptDecideRule" class="org.archive.modules.deciderules.SeedAcceptDecideRule">
</bean>

SourceSeedDecideRule

Rule applies the configured decision for any URI with discovered from one of the seeds in sourceSeeds.

SeedModule#getSourceTagSeeds() must be enabled or the rule will never apply.

<bean id="sourceSeedDecideRule" class="org.archive.modules.deciderules.SourceSeedDecideRule">
  <!-- <property name="sourceSeeds" value="" /> -->
</bean>
sourceSeeds
(Set)

SurtPrefixedDecideRule

Rule applies configured decision to any URIs that, when expressed in SURT form, begin with one of the prefixes in the configured set.

The set can be filled with SURT prefixes implied or listed in the seeds file, or another external file.

The “also-check-via” option to implement “one hop off” scoping derives from a contribution by Shifra Raffel of the California Digital Library.

<bean id="surtPrefixedDecideRule" class="org.archive.modules.deciderules.surt.SurtPrefixedDecideRule">
  <!-- <property name="surtsSourceFile" value="" /> -->
  <!-- <property name="surtsSource" value="null" /> -->
  <!-- <property name="seedsAsSurtPrefixes" value="true" /> -->
  <!-- <property name="surtsDumpFile" value="" /> -->
  <!-- <property name="alsoCheckVia" value="false" /> -->
  <!-- <property name="seeds" value="" /> -->
  <!-- <property name="beanName" value="" /> -->
  <!-- <property name="recoveryCheckpoint" value="" /> -->
</bean>
surtsSourceFile
(ConfigFile)
surtsSource
(ReadSource) Text from which to infer SURT prefixes. Any URLs will be converted to the implied SURT prefix, and literal SURT prefixes may be listed on lines beginning with a ‘+’ character.
seedsAsSurtPrefixes
(boolean) Should seeds also be interpreted as SURT prefixes.
surtsDumpFile
(ConfigFile) Dump file to save SURT prefixes actually used: Useful debugging SURTs.
alsoCheckVia
(boolean) Whether to also make the configured decision if a URI’s ‘via’ URI (the URI from which it was discovered) in SURT form begins with any of the established prefixes. For example, can be used to ACCEPT URIs that are ‘one hop off’ URIs fitting the SURT prefixes. Default is false.
seeds
(SeedModule)
beanName
(String)
recoveryCheckpoint
(Checkpoint)

TooManyHopsDecideRule

Rule REJECTs any CrawlURIs whose total number of hops (length of the hopsPath string, traversed links of any type) is over a threshold. Otherwise returns PASS.

<bean id="tooManyHopsDecideRule" class="org.archive.modules.deciderules.TooManyHopsDecideRule">
  <!-- <property name="maxHops" value="20" /> -->
</bean>
maxHops
(int) Max path depth for which this filter will match.

TooManyPathSegmentsDecideRule

Rule REJECTs any CrawlURIs whose total number of path-segments (as indicated by the count of ‘/’ characters not including the first ‘//’) is over a given threshold.

<bean id="tooManyPathSegmentsDecideRule" class="org.archive.modules.deciderules.TooManyPathSegmentsDecideRule">
  <!-- <property name="maxPathDepth" value="20" /> -->
</bean>
maxPathDepth
(int) Number of path segments beyond which this rule will reject URIs.

TransclusionDecideRule

Rule ACCEPTs any CrawlURIs whose path-from-seed (‘hopsPath’ – see CrawlURI#getPathFromSeed() ends with at least one, but not more than, the given number of non-navlink (‘L’) hops.

Otherwise, if the path-from-seed is empty or if a navlink (‘L’) occurs within max-trans-hops of the tail of the path-from-seed, this rule returns PASS.

Thus, it allows things like embedded resources (frames/images/media) and redirects to be transitively included (‘transcluded’) in a crawl, even if they otherwise would not, for some reasonable number of hops (usually 1-5).

<bean id="transclusionDecideRule" class="org.archive.modules.deciderules.TransclusionDecideRule">
  <!-- <property name="maxTransHops" value="2" /> -->
  <!-- <property name="maxSpeculativeHops" value="1" /> -->
</bean>
maxTransHops
(int) Maximum number of non-navlink (non-‘L’) hops to ACCEPT.
maxSpeculativeHops
(int) Maximum number of speculative (‘X’) hops to ACCEPT.

ViaSurtPrefixedDecideRule

Rule applies the configured decision for any URI which has a ‘via’ whose surtform matches any surt specified in the surtPrefixes list

<bean id="viaSurtPrefixedDecideRule" class="org.archive.modules.deciderules.ViaSurtPrefixedDecideRule">
  <!-- <property name="surtPrefixes" value="" /> -->
</bean>
surtPrefixes
(List)

Candidate Processors

CandidateScoper

Simple single-URI scoper, considers passed-in URI as candidate; sets fetchstatus negative and skips to end of processing if out-of-scope.

<bean id="candidateScoper" class="org.archive.crawler.prefetch.CandidateScoper">
</bean>

FrontierPreparer

Processor to preload URI with as much precalculated policy-based info as possible before it reaches frontier critical sections.

Frontiers also maintain a direct reference to this class, in case they need to perform remedial preparation for URIs that do not pass through this processor on the CandidateChain.

<bean id="frontierPreparer" class="org.archive.crawler.prefetch.FrontierPreparer">
  <!-- <property name="preferenceDepthHops" value="1" /> -->
  <!-- <property name="preferenceEmbedHops" value="1" /> -->
  <!-- <property name="canonicalizationPolicy" value="" /> -->
  <!-- <property name="queueAssignmentPolicy" value="" /> -->
  <!-- <property name="uriPrecedencePolicy" value="" /> -->
  <!-- <property name="costAssignmentPolicy" value="" /> -->
</bean>
preferenceDepthHops
(int) Number of hops (of any sort) from a seed up to which a URI has higher priority scheduling than any remaining seed. For example, if set to 1 items one hop (link, embed, redirect, etc.) away from a seed will be scheduled with HIGH priority. If set to -1, no preferencing will occur, and a breadth-first search with seeds processed before discovered links will proceed. If set to zero, a purely depth-first search will proceed, with all discovered links processed before remaining seeds. Seed redirects are treated as one hop from a seed.
preferenceEmbedHops
(int) number of hops of embeds (ERX) to bump to front of host queue
canonicalizationPolicy
(UriCanonicalizationPolicy) Ordered list of url canonicalization rules. Rules are applied in the order listed from top to bottom.
queueAssignmentPolicy
(QueueAssignmentPolicy) Defines how to assign URIs to queues. Can assign by host, by ip, by SURT-ordered authority, by SURT-ordered authority truncated to a topmost-assignable domain, and into one of a fixed set of buckets (1k).
uriPrecedencePolicy
(UriPrecedencePolicy) URI precedence assignment policy to use.
costAssignmentPolicy
(CostAssignmentPolicy) cost assignment policy to use.

Pre-Fetch Processors

PreconditionEnforcer

Ensures the preconditions for a fetch – such as DNS lookup or acquiring and respecting a robots.txt policy – are satisfied before a URI is passed to subsequent stages.

<bean id="preconditionEnforcer" class="org.archive.crawler.prefetch.PreconditionEnforcer">
  <!-- <property name="ipValidityDurationSeconds" value="" /> -->
  <!-- <property name="robotsValidityDurationSeconds" value="" /> -->
  <!-- <property name="calculateRobotsOnly" value="false" /> -->
  <!-- <property name="metadata" value="" /> -->
  <!-- <property name="credentialStore" value="" /> -->
  <!-- <property name="serverCache" value="" /> -->
  <!-- <property name="loggerModule" value="" /> -->
</bean>
ipValidityDurationSeconds
(int) The minimum interval for which a dns-record will be considered valid (in seconds). If the record’s DNS TTL is larger, that will be used instead.
robotsValidityDurationSeconds
(int) The time in seconds that fetched robots.txt information is considered to be valid. If the value is set to ‘0’, then the robots.txt information will never expire.
calculateRobotsOnly
(boolean) Whether to only calculate the robots status of an URI, without actually applying any exclusions found. If true, exlcuded URIs will only be annotated in the crawl.log, but still fetched. Default is false.
metadata
(CrawlMetadata) Auto-discovered module providing configured (or overridden) User-Agent value and RobotsHonoringPolicy
credentialStore
(CredentialStore)
serverCache
(ServerCache)
loggerModule
(CrawlerLoggerModule)

Preselector

If set to recheck the crawl’s scope, gives a yes/no on whether a CrawlURI should be processed at all. If not, its status will be marked OUT_OF_SCOPE and the URI will skip directly to the first “postprocessor”.

<bean id="preselector" class="org.archive.crawler.prefetch.Preselector">
  <!-- <property name="recheckScope" value="false" /> -->
  <!-- <property name="blockAll" value="false" /> -->
  <!-- <property name="blockByRegex" value="" /> -->
  <!-- <property name="allowByRegex" value="" /> -->
</bean>
recheckScope
(boolean) Recheck if uri is in scope. This is meaningful if the scope is altered during a crawl. URIs are checked against the scope when they are added to queues. Setting this value to true forces the URI to be checked against the scope when it is coming out of the queue, possibly after the scope is altered.
blockAll
(boolean) Block all URIs from being processed. This is most likely to be used in overrides to easily reject certain hosts from being processed.
blockByRegex
(String) Block all URIs matching the regular expression from being processed.
allowByRegex
(String) Allow only URIs matching the regular expression to be processed.

Fetch Processors

FetchDNS

Processor to resolve ‘dns:’ URIs.

<bean id="fetchDNS" class="org.archive.modules.fetcher.FetchDNS">
  <!-- <property name="acceptNonDnsResolves" value="false" /> -->
  <!-- <property name="disableJavaDnsResolves" value="false" /> -->
  <!-- <property name="dnsOverHttpServer" value="" /> -->
  <!-- <property name="serverCache" value="" /> -->
  <!-- <property name="digestContent" value="true" /> -->
  <!-- <property name="digestAlgorithm" value="sha1" /> -->
</bean>
acceptNonDnsResolves
(boolean) If a DNS lookup fails, whether or not to fall back to InetAddress resolution, which may use local ‘hosts’ files or other mechanisms.
disableJavaDnsResolves
(boolean) Optionally, only allow InetAddress resolution, precisely because it may use local ‘hosts’ files or other mechanisms.

This should not generally be used in production as it will prevent DNS lookups from being recorded properly.

dnsOverHttpServer
(String) URL to the DNS-on-HTTP(S) server. If this not set or set to an empty string, no DNS-over-HTTP(S) will be used; otherwise if should contain the URL to the DNS-over-HTTPS server.
serverCache
(ServerCache) Used to do DNS lookups.
digestContent
(boolean) Whether or not to perform an on-the-fly digest hash of retrieved content-bodies.
digestAlgorithm
(String) Which algorithm (for example MD5 or SHA-1) to use to perform an on-the-fly digest hash of retrieved content-bodies.

FetchFTP

Fetches documents and directory listings using FTP. This class will also try to extract FTP “links” from directory listings. For this class to archive a directory listing, the remote FTP server must support the NLIST command. Most modern FTP servers should.

<bean id="fetchFTP" class="org.archive.modules.fetcher.FetchFTP">
  <!-- <property name="username" value="anonymous" /> -->
  <!-- <property name="password" value="password" /> -->
  <!-- <property name="extractFromDirs" value="true" /> -->
  <!-- <property name="extractParent" value="true" /> -->
  <!-- <property name="digestContent" value="true" /> -->
  <!-- <property name="digestAlgorithm" value="sha1" /> -->
  <!-- <property name="maxLengthBytes" value="0" /> -->
  <!-- <property name="maxFetchKBSec" value="0" /> -->
  <!-- <property name="timeoutSeconds" value="" /> -->
  <!-- <property name="soTimeoutMs" value="" /> -->
</bean>
username
(String) The username to send to FTP servers. By convention, the default value of “anonymous” is used for publicly available FTP sites.
password
(String) The password to send to FTP servers. By convention, anonymous users send their email address in this field.
extractFromDirs
(boolean) Set to true to extract further URIs from FTP directories. Default is true.
extractParent
(boolean) Set to true to extract the parent URI from all FTP URIs. Default is true.
digestContent
(boolean) Whether or not to perform an on-the-fly digest hash of retrieved content-bodies.
digestAlgorithm
(String) Which algorithm (for example MD5 or SHA-1) to use to perform an on-the-fly digest hash of retrieved content-bodies.
maxLengthBytes
(long) Maximum length in bytes to fetch. Fetch is truncated at this length. A value of 0 means no limit.
maxFetchKBSec
(int) The maximum KB/sec to use when fetching data from a server. The default of 0 means no maximum.
timeoutSeconds
(int) If the fetch is not completed in this number of seconds, give up (and retry later).
soTimeoutMs
(int) If the socket is unresponsive for this number of milliseconds, give up. Set to zero for no timeout (Not. recommended. Could hang a thread on an unresponsive server). This timeout is used timing out socket opens and for timing out each socket read. Make sure this value is < #TIMEOUT_SECONDS for optimal configuration: ensures at least one retry read.

FetchHTTP

HTTP fetcher that uses Apache HttpComponents.

<bean id="fetchHTTP" class="org.archive.modules.fetcher.FetchHTTP">
  <!-- <property name="serverCache" value="" /> -->
  <!-- <property name="digestContent" value="true" /> -->
  <!-- <property name="digestAlgorithm" value="sha1" /> -->
  <!-- <property name="userAgentProvider" value="" /> -->
  <!-- <property name="sendConnectionClose" value="true" /> -->
  <!-- <property name="defaultEncoding" value="ISO-8859-1" /> -->
  <!-- <property name="useHTTP11" value="false" /> -->
  <!-- <property name="ignoreCookies" value="false" /> -->
  <!-- <property name="sendReferer" value="true" /> -->
  <!-- <property name="acceptCompression" value="false" /> -->
  <!-- <property name="acceptHeaders" value="" /> -->
  <!-- <property name="cookieStore" value="" /> -->
  <!-- <property name="credentialStore" value="" /> -->
  <!-- <property name="httpBindAddress" value="" /> -->
  <!-- <property name="httpProxyHost" value="" /> -->
  <!-- <property name="httpProxyPort" value="" /> -->
  <!-- <property name="httpProxyUser" value="" /> -->
  <!-- <property name="httpProxyPassword" value="" /> -->
  <!-- <property name="maxFetchKBSec" value="0" /> -->
  <!-- <property name="timeoutSeconds" value="" /> -->
  <!-- <property name="soTimeoutMs" value="" /> -->
  <!-- <property name="maxLengthBytes" value="0" /> -->
  <!-- <property name="sendRange" value="false" /> -->
  <!-- <property name="sendIfModifiedSince" value="true" /> -->
  <!-- <property name="sendIfNoneMatch" value="true" /> -->
  <!-- <property name="shouldFetchBodyRule" value="" /> -->
  <!-- <property name="sslTrustLevel" value="" /> -->
  <!-- <property name="socksProxyHost" value="" /> -->
  <!-- <property name="socksProxyPort" value="" /> -->
</bean>
serverCache
(ServerCache) Used to do DNS lookups.
digestContent
(boolean) Whether or not to perform an on-the-fly digest hash of retrieved content-bodies.
digestAlgorithm
(String) Which algorithm (for example MD5 or SHA-1) to use to perform an on-the-fly digest hash of retrieved content-bodies.
userAgentProvider
(UserAgentProvider)
sendConnectionClose
(boolean) Send ‘Connection: close’ header with every request.
defaultEncoding
(String) The character encoding to use for files that do not have one specified in the HTTP response headers. Default: ISO-8859-1.
useHTTP11
(boolean) Use HTTP/1.1. Note: even when offering an HTTP/1.1 request, Heritrix may not properly handle persistent/keep-alive connections, so the sendConnectionClose parameter should remain ‘true’.
ignoreCookies
(boolean) Disable cookie handling.
sendReferer
(boolean) Send ‘Referer’ header with every request.

The ‘Referer’ header contans the location the crawler came from, the page the current URI was discovered in. The ‘Referer’ usually is logged on the remote server and can be of assistance to webmasters trying to figure how a crawler got to a particular area on a site.

acceptCompression
(boolean) Set headers to accept compressed responses.
acceptHeaders
(List) Accept Headers to include in each request. Each must be the complete header, e.g., ‘Accept-Language: en’. (Thus, this can also be used to other headers not beginning ‘Accept-’ as well.) By default heritrix sends an Accept header similar to what a typical browser would send (the value comes from Firefox 4.0).
cookieStore
(AbstractCookieStore)
credentialStore
(CredentialStore) Used to store credentials.
httpBindAddress
(String) Local IP address or hostname to use when making connections (binding sockets). When not specified, uses default local address(es).
httpProxyHost
(String) Proxy host IP (set only if needed).
httpProxyPort
(Integer) Proxy port (set only if needed).
httpProxyUser
(String) Proxy user (set only if needed).
httpProxyPassword
(String) Proxy password (set only if needed).
maxFetchKBSec
(int) The maximum KB/sec to use when fetching data from a server. The default of 0 means no maximum.
timeoutSeconds
(int) If the fetch is not completed in this number of seconds, give up (and retry later).
soTimeoutMs
(int) If the socket is unresponsive for this number of milliseconds, give up. Set to zero for no timeout (Not. recommended. Could hang a thread on an unresponsive server). This timeout is used timing out socket opens and for timing out each socket read. Make sure this value is < #getTimeoutSeconds() for optimal configuration: ensures at least one retry read.
maxLengthBytes
(long) Maximum length in bytes to fetch. Fetch is truncated at this length. A value of 0 means no limit.
sendRange
(boolean) Send ‘Range’ header when a limit (#MAX_LENGTH_BYTES) on document size.

Be polite to the HTTP servers and send the ‘Range’ header, stating that you are only interested in the first n bytes. Only pertinent if #MAX_LENGTH_BYTES > 0. Sending the ‘Range’ header results in a ‘206 Partial Content’ status response, which is better than just cutting the response mid-download. On rare occasion, sending ‘Range’ will generate ‘416 Request Range Not Satisfiable’ response.

sendIfModifiedSince
(boolean) Send ‘If-Modified-Since’ header, if previous ‘Last-Modified’ fetch history information is available in URI history.
sendIfNoneMatch
(boolean) Send ‘If-None-Match’ header, if previous ‘Etag’ fetch history information is available in URI history.
shouldFetchBodyRule
(DecideRule) DecideRules applied after receipt of HTTP response headers but before we start to download the body. If any filter returns FALSE, the fetch is aborted. Prerequisites such as robots.txt by-pass filtering (i.e. they cannot be midfetch aborted.
sslTrustLevel
(TrustLevel) SSL certificate trust level. Range is from the default ‘open’ (trust all certs including expired, selfsigned, and those for which we do not have a CA) through ‘loose’ (trust all valid certificates including selfsigned), ‘normal’ (all valid certificates not including selfsigned) to ‘strict’ (Cert is valid and DN must match servername).
socksProxyHost
(String) Sets a SOCKS5 proxy host to use. This will override any set HTTP proxy.
socksProxyPort
(Integer) Sets a SOCKS5 proxy port to use.

FetchSFTP

<bean id="fetchSFTP" class="org.archive.modules.fetcher.FetchSFTP">
  <!-- <property name="username" value="anonymous" /> -->
  <!-- <property name="password" value="password" /> -->
  <!-- <property name="extractFromDirs" value="true" /> -->
  <!-- <property name="extractParent" value="true" /> -->
  <!-- <property name="digestContent" value="true" /> -->
  <!-- <property name="digestAlgorithm" value="sha1" /> -->
  <!-- <property name="maxLengthBytes" value="0" /> -->
  <!-- <property name="maxFetchKBSec" value="0" /> -->
  <!-- <property name="timeoutSeconds" value="" /> -->
  <!-- <property name="soTimeoutMs" value="" /> -->
</bean>
username
(String) The username to send to SFTP servers. By convention, the default value of “anonymous” is used for publicly available SFTP sites.
password
(String) The password to send to SFTP servers. By convention, anonymous users send their email address in this field.
extractFromDirs
(boolean) Set to true to extract further URIs from SFTP directories. Default is true.
extractParent
(boolean) Set to true to extract the parent URI from all SFTP URIs. Default is true.
digestContent
(boolean) Whether or not to perform an on-the-fly digest hash of retrieved content-bodies.
digestAlgorithm
(String) Which algorithm (for example MD5 or SHA-1) to use to perform an on-the-fly digest hash of retrieved content-bodies.
maxLengthBytes
(long) Maximum length in bytes to fetch. Fetch is truncated at this length. A value of 0 means no limit.
maxFetchKBSec
(int) The maximum KB/sec to use when fetching data from a server. The default of 0 means no maximum.
timeoutSeconds
(int) If the fetch is not completed in this number of seconds, give up (and retry later).
soTimeoutMs
(int) If the socket is unresponsive for this number of milliseconds, give up. Set to zero for no timeout (Not. recommended. Could hang a thread on an unresponsive server). This timeout is used timing out socket opens and for timing out each socket read. Make sure this value is < #TIMEOUT_SECONDS for optimal configuration: ensures at least one retry read.

FetchWhois

WHOIS Fetcher (RFC 3912). If this fetcher is enabled, Heritrix will attempt WHOIS lookups on the topmost assigned domain and the IP address of each URL.

WHOIS URIs

There is no pre-existing, canonical specification for WHOIS URIs. What follows is the the format that Heritrix uses, which we propose for general use.

Syntax in ABNF as used in RFC 3986 Uniform Resource Identifier (URI): Generic Syntax:

whoisurl = “whois:” [ “//” host [ “:” port ] “/” ] whoisquery

whoisquery is a url-encoded string. In ABNF, whoisquery = 1*pchar where pchar is defined in RFC 3986. host and port also as defined in RFC 3986.

To resolve a WHOIS URI which specifies host[:port], open a TCP connection to the host at the specified port (default 43), send the query (whoisquery, url-decoded) followed by CRLF, and read the response until the server closes the connection. For more details see RFC 3912.

Resolution of a “serverless” WHOIS URI, which does not specify host[:port], is implementation-dependent.

Serverless WHOIS URIs in Heritrix

For each non-WHOIS URI processed which has an authority, FetchWhois adds 1 or 2 serverless WHOIS URIs to the CrawlURI’s outlinks. These are “whois:{ipAddress}” and, if the authority includes a hostname, “whois:{topLevelDomain}”. See #addWhoisLinks(CrawlURI).

Heritrix resolves serverless WHOIS URIs by first querying an initial server, then following referrals to other servers. In pseudocode:


if query is an IPv4 address
    resolve whois://#DEFAULT_IP_WHOIS_SERVER/whoisquery
else
    let domainSuffix = part of query after the last ‘.’ (or the whole query if no ‘.’), url-encoded
    resolve whois://#ULTRA_SUFFIX_WHOIS_SERVER/domainSuffix

while last response refers to another server, i.e. matches regex #WHOIS_SERVER_REGEX if we have a special query formatting rule for this whois server, apply it - see #specialQueryTemplates resolve whois://referralServer/whoisquery

See #deferOrFinishGeneric(CrawlURI, String)

<bean id="fetchWhois" class="org.archive.modules.fetcher.FetchWhois">
  <!-- <property name="bdbModule" value="" /> -->
  <!-- <property name="specialQueryTemplates" value="" /> -->
  <!-- <property name="soTimeoutMs" value="" /> -->
  <!-- <property name="serverCache" value="" /> -->
</bean>
bdbModule
(BdbModule)
specialQueryTemplates
(Map)
soTimeoutMs
(int) If the socket is unresponsive for this number of milliseconds, give up. Set to zero for no timeout (Not. recommended. Could hang a thread on an unresponsive server). This timeout is used timing out socket opens and for timing out each socket read. Make sure this value is < #TIMEOUT_SECONDS for optimal configuration: ensures at least one retry read.
serverCache
(ServerCache)

Post-Processors

CandidatesProcessor

Processor which sends all candidate outlinks through the CandidateChain, scheduling those with non-negative status codes to the frontier. Also performs special handling for ‘discovered seeds’ – URIs, as with redirects from seeds, that may deserve special treatment to expand the scope.

<bean id="candidatesProcessor" class="org.archive.crawler.postprocessor.CandidatesProcessor">
  <!-- <property name="candidateChain" value="" /> -->
  <!-- <property name="frontier" value="" /> -->
  <!-- <property name="loggerModule" value="" /> -->
  <!-- <property name="seedsRedirectNewSeeds" value="true" /> -->
  <!-- <property name="seedsRedirectNewSeedsAllowTLDs" value="true" /> -->
  <!-- <property name="processErrorOutlinks" value="false" /> -->
  <!-- <property name="seeds" value="" /> -->
  <!-- <property name="sheetOverlaysManager" value="" /> -->
</bean>
candidateChain
(CandidateChain) Candidate chain
frontier
(Frontier) The frontier to use.
loggerModule
(CrawlerLoggerModule)
seedsRedirectNewSeeds
(boolean) If enabled, any URL found because a seed redirected to it (original seed returned 301 or 302), will also be treated as a seed, as long as the hop count is less than {@value #SEEDS_REDIRECT_NEW_SEEDS_MAX_HOPS}.
seedsRedirectNewSeedsAllowTLDs
(boolean) If enabled, any URL found because a seed redirected to it (original seed returned 301 or 302), will also be treated as a seed, as long as the hop count is less than {@value #SEEDS_REDIRECT_NEW_SEEDS_MAX_HOPS}.
processErrorOutlinks
(boolean) If true, outlinks from status codes <200 and >=400 will be sent through candidates processing. Default is false.
seeds
(SeedModule)
sheetOverlaysManager
(SheetOverlaysManager)

DispositionProcessor

A step, late in the processing of a CrawlURI, for marking-up the CrawlURI with values to affect frontier disposition, and updating information that may have been affected by the fetch. This includes robots info and other stats.

(Formerly called CrawlStateUpdater, when it did less.)

<bean id="dispositionProcessor" class="org.archive.crawler.postprocessor.DispositionProcessor">
  <!-- <property name="serverCache" value="" /> -->
  <!-- <property name="delayFactor" value="5.0f" /> -->
  <!-- <property name="minDelayMs" value="3000" /> -->
  <!-- <property name="respectCrawlDelayUpToSeconds" value="300" /> -->
  <!-- <property name="maxDelayMs" value="30000" /> -->
  <!-- <property name="maxPerHostBandwidthUsageKbSec" value="0" /> -->
  <!-- <property name="forceRetire" value="false" /> -->
  <!-- <property name="metadata" value="" /> -->
</bean>
serverCache
(ServerCache)
delayFactor
(float) How many multiples of last fetch elapsed time to wait before recontacting same server.
minDelayMs
(int) always wait this long after one completion before recontacting same server, regardless of multiple
respectCrawlDelayUpToSeconds
(int) Whether to respect a ‘Crawl-Delay’ (in seconds) given in a site’s robots.txt
maxDelayMs
(int) never wait more than this long, regardless of multiple
maxPerHostBandwidthUsageKbSec
(int) maximum per-host bandwidth usage
forceRetire
(boolean) Whether to set a CrawlURI’s force-retired directive, retiring its queue when it finishes. Mainly intended for URI-specific overlay settings; setting true globally will just retire all queues after they offer one URI, rapidly ending a crawl.
metadata
(CrawlMetadata) Auto-discovered module providing configured (or overridden) User-Agent value and RobotsHonoringPolicy

ReschedulingProcessor

The most simple forced-rescheduling step possible: use a local setting (perhaps overlaid to vary based on the URI) to set an exact future reschedule time, as a delay from now. Unless the reschedulDelaySeconds value is changed from its default, URIs are not rescheduled.

<bean id="reschedulingProcessor" class="org.archive.crawler.postprocessor.ReschedulingProcessor">
  <!-- <property name="rescheduleDelaySeconds" value="1" /> -->
</bean>
rescheduleDelaySeconds
(long) amount of time to wait before forcing a URI to be rescheduled default of -1 means “don’t reschedule”

WARCWriterChainProcessor

WARC writer processor. The types of records that to be written can be configured by including or excluding WARCRecordBuilder implementations (see #setChain(List)).

This is the default chain:


  <property name=”chain”>
   <list>
    <bean class=”org.archive.modules.warc.DnsResponseRecordBuilder”/>
    <bean class=”org.archive.modules.warc.HttpResponseRecordBuilder”/>
    <bean class=”org.archive.modules.warc.WhoisResponseRecordBuilder”/>
    <bean class=”org.archive.modules.warc.FtpControlConversationRecordBuilder”/>
    <bean class=”org.archive.modules.warc.FtpResponseRecordBuilder”/>
    <bean class=”org.archive.modules.warc.RevisitRecordBuilder”/>
    <bean class=”org.archive.modules.warc.HttpRequestRecordBuilder”/>
    <bean class=”org.archive.modules.warc.MetadataRecordBuilder”/>
   </list>
  </property>

Replaces WARCWriterProcessor.

<bean id="wARCWriterChainProcessor" class="org.archive.modules.writer.WARCWriterChainProcessor">
  <!-- <property name="chain" value="" /> -->
</bean>
chain
(List)