Bean Reference¶
Note
This reference is a work in progress and does not yet cover all available beans. For a more complete list of Heritrix beans please refer to the javadoc.
Core Beans¶
ActionDirectory¶
Directory watched for new files. Depending on their extension, will
process with regard to current crawl, and rename with a datestamp
into the ‘done’ directory. Currently supports:
- .seeds(.gz)
add each URI found in file as a new seed (to be crawled
if not already; to affect scope if appropriate).
- (.s).recover(.gz)
treat as traditional recovery log: consider all ‘Fs’-tagged lines
included, then try-rescheduling all ‘F+’-tagged lines. (If “.s.”
present, try scoping URIs before including/scheduling.)
- (.s).include(.gz)
add each URI found in a recover-log like file (regardless of its
tagging) to the frontier’s alreadyIncluded filter, preventing them
from being recrawled. (‘.s.’ indicates to apply scoping.)
- (.s).schedule(.gz)
add each URI found in a recover-log like file (regardless of its
tagging) to the frontier’s queues. (‘.s.’ indicates to apply
scoping.)
Future support planned:
- .robots: invalidate robots ASAP
- (?) .block: block-all on named site(s)
- .overlay: add new overlay settings
- .js .rb .bsh .rb etc - execute arbitrary script (a la ScriptedProcessor)
<bean id="actionDirectory" class="org.archive.crawler.framework.ActionDirectory">
<!-- <property name="initialDelaySeconds" value="10" /> -->
<!-- <property name="delaySeconds" value="30" /> -->
<!-- <property name="actionDir" value="" /> -->
<!-- <property name="doneDir" value="" /> -->
<!-- <property name="applicationContext" value="" /> -->
<!-- <property name="seeds" value="" /> -->
<!-- <property name="frontier" value="" /> -->
</bean>
- initialDelaySeconds
- (int) how long after crawl start to first scan action directory
- delaySeconds
- (int) delay between scans of actionDirectory for new files
- actionDir
- (ConfigPath)
- doneDir
- (ConfigPath)
- applicationContext
- (ApplicationContext)
- seeds
- (SeedModule)
- frontier
- (Frontier) autowired frontier for actions
BdbCookieStore¶
Cookie store using bdb for storage. Cookies are stored in a SortedMap keyed by #sortableKey(Cookie), so they are grouped together by domain. #cookieStoreFor(String) returns a facade whose CookieStore#getCookies() returns a list of cookies limited to the supplied host and parent domains, if applicable.
<bean id="bdbCookieStore" class="org.archive.modules.fetcher.BdbCookieStore">
<!-- <property name="bdbModule" value="" /> -->
<!-- <property name="recoveryCheckpoint" value="" /> -->
</bean>
- bdbModule
- (BdbModule)
- recoveryCheckpoint
- (Checkpoint)
BdbFrontier¶
A Frontier using several BerkeleyDB JE Databases to hold its record of known hosts (queues), and pending URIs.
<bean id="bdbFrontier" class="org.archive.crawler.frontier.BdbFrontier">
<!-- <property name="bdbModule" value="" /> -->
<!-- <property name="beanName" value="" /> -->
<!-- <property name="dumpPendingAtClose" value="false" /> -->
<!-- <property name="recoveryCheckpoint" value="" /> -->
</bean>
- bdbModule
- (BdbModule)
- beanName
- (String)
- dumpPendingAtClose
- (boolean)
- recoveryCheckpoint
- (Checkpoint)
BdbModule¶
Utility module for managing a shared BerkeleyDB-JE environment
<bean id="bdbModule" class="org.archive.bdb.BdbModule">
<!-- <property name="dir" value="" /> -->
<!-- <property name="cachePercent" value="1" /> -->
<!-- <property name="cacheSize" value="1" /> -->
<!-- <property name="useSharedCache" value="true" /> -->
<!-- <property name="maxLogFileSize" value="10000000" /> -->
<!-- <property name="expectedConcurrency" value="64" /> -->
<!-- <property name="cleanerThreads" value="" /> -->
<!-- <property name="evictorCoreThreads" value="1" /> -->
<!-- <property name="evictorMaxThreads" value="1" /> -->
<!-- <property name="useHardLinkCheckpoints" value="true" /> -->
<!-- <property name="recoveryCheckpoint" value="" /> -->
</bean>
- dir
- (ConfigPath)
- cachePercent
- (int)
- cacheSize
- (int)
- useSharedCache
- (boolean)
- maxLogFileSize
- (long)
- expectedConcurrency
- (int) Expected number of concurrent threads; used to tune nLockTables according to JE FAQ http://www.oracle.com/technology/products/berkeley-db/faq/je_faq.html#33
- cleanerThreads
- (int)
- evictorCoreThreads
- (int) Configure the number of evictor threads (-1 means use the default) https://docs.oracle.com/cd/E17277_02/html/java/com/sleepycat/je/EnvironmentConfig.html#EVICTOR_CORE_THREADS
- evictorMaxThreads
- (int) Configure the maximum number of evictor threads (-1 means use the default) https://docs.oracle.com/cd/E17277_02/html/java/com/sleepycat/je/EnvironmentConfig.html#EVICTOR_MAX_THREADS
- useHardLinkCheckpoints
- (boolean) Whether to use hard-links to log files to collect/retain the BDB log files needed for a checkpoint. Default is true. May not work on Windows (especially on pre-NTFS filesystems). If false, the BDB ‘je.cleaner.expunge’ value will be set to ‘false’, as well, meaning BDB will *not* delete obsolete JDB files, but only rename the ‘.DEL’. They will have to be manually deleted to free disk space, but .DEL files referenced in any checkpoint’s ‘jdbfiles.manifest’ should be retained to keep the checkpoint valid.
- recoveryCheckpoint
- (Checkpoint)
BdbServerCache¶
ServerCache backed by BDB big maps; the usual choice for crawls.
<bean id="bdbServerCache" class="org.archive.modules.net.BdbServerCache">
<!-- <property name="bdbModule" value="" /> -->
<!-- <property name="recoveryCheckpoint" value="" /> -->
</bean>
- bdbModule
- (BdbModule)
- recoveryCheckpoint
- (Checkpoint)
BdbUriUniqFilter¶
A BDB implementation of an AlreadySeen list. This implementation performs adequately without blowing out
the heap. See
AlreadySeen. Makes keys that have URIs from same server close to each other. Mercator
and 2.3.5 ‘Elminating Already-Visited URLs’ in ‘Mining the Web’ by Soumen
Chakrabarti talk of a two-level key with the first 24 bits a hash of the
host plus port and with the last 40 as a hash of the path. Testing
showed adoption of such a scheme halving lookup times (Tutilhis implementation
actually concatenates scheme + host in first 24 bits and path + query in
trailing 40 bits).
<bean id="bdbUriUniqFilter" class="org.archive.crawler.util.BdbUriUniqFilter">
<!-- <property name="bdbModule" value="" /> -->
<!-- <property name="beanName" value="" /> -->
<!-- <property name="recoveryCheckpoint" value="" /> -->
</bean>
- bdbModule
- (BdbModule)
- beanName
- (String)
- recoveryCheckpoint
- (Checkpoint)
CheckpointService¶
Executes checkpoints, and offers convenience methods for enumerating
available Checkpoints and injecting a recovery-Checkpoint after
build and before launch (setRecoveryCheckpointByName). Offers optional automatic checkpointing at a configurable interval
in minutes.
<bean id="checkpointService" class="org.archive.crawler.framework.CheckpointService">
<!-- <property name="checkpointsDir" value="" /> -->
<!-- <property name="checkpointIntervalMinutes" value="1" /> -->
<!-- <property name="forgetAllButLatest" value="false" /> -->
<!-- <property name="recoveryCheckpoint" value="" /> -->
<!-- <property name="crawlController" value="" /> -->
<!-- <property name="applicationContext" value="" /> -->
<!-- <property name="recoveryCheckpointByName" value="" /> -->
</bean>
- checkpointsDir
- (ConfigPath) Checkpoints directory
- checkpointIntervalMinutes
- (long) Period at which to create automatic checkpoints; -1 means no auto checkpointing.
- forgetAllButLatest
- (boolean) True to save only the latest checkpoint, false to save all of them. Default is false.
- recoveryCheckpoint
- (Checkpoint)
- crawlController
- (CrawlController)
- applicationContext
- (ApplicationContext)
- recoveryCheckpointByName
- (String) Given the name of a valid checkpoint subdirectory in the checkpoints directory, create a Checkpoint instance, and insert it into all Checkpointable beans.
CrawlController¶
CrawlController collects all the classes which cooperate to
perform a crawl and provides a high-level interface to the
running crawl. As the “global context” for a crawl, subcomponents will
often reach each other through the CrawlController.
<bean id="crawlController" class="org.archive.crawler.framework.CrawlController">
<!-- <property name="applicationContext" value="" /> -->
<!-- <property name="metadata" value="" /> -->
<!-- <property name="serverCache" value="" /> -->
<!-- <property name="frontier" value="" /> -->
<!-- <property name="scratchDir" value="" /> -->
<!-- <property name="statisticsTracker" value="" /> -->
<!-- <property name="seeds" value="" /> -->
<!-- <property name="fetchChain" value="" /> -->
<!-- <property name="dispositionChain" value="" /> -->
<!-- <property name="candidateChain" value="" /> -->
<!-- <property name="maxToeThreads" value="" /> -->
<!-- <property name="runWhileEmpty" value="false" /> -->
<!-- <property name="pauseAtStart" value="true" /> -->
<!-- <property name="recorderOutBufferBytes" value="" /> -->
<!-- <property name="recorderInBufferBytes" value="" /> -->
<!-- <property name="loggerModule" value="" /> -->
<!-- <property name="recoveryCheckpoint" value="" /> -->
</bean>
- applicationContext
- (ApplicationContext)
- metadata
- (CrawlMetadata)
- serverCache
- (ServerCache)
- frontier
- (Frontier) The frontier to use for the crawl.
- scratchDir
- (ConfigPath) Scratch directory for temporary overflow-to-disk
- statisticsTracker
- (StatisticsTracker) Statistics tracking modules. Any number of specialized statistics trackers that monitor a crawl and write logs, reports and/or provide information to the user interface.
- seeds
- (SeedModule)
- fetchChain
- (FetchChain) Fetch chain
- dispositionChain
- (DispositionChain) Disposition chain
- candidateChain
- (CandidateChain) Candidate chain
- maxToeThreads
- (int) Maximum number of threads processing URIs at the same time.
- runWhileEmpty
- (boolean) whether to keep running (without pause or finish) when frontier is empty
- pauseAtStart
- (boolean) whether to pause at crawl start
- recorderOutBufferBytes
- (int) Size in bytes of in-memory buffer to record outbound traffic. One such buffer is reserved for every ToeThread.
- recorderInBufferBytes
- (int) Size in bytes of in-memory buffer to record inbound traffic. One such buffer is reserved for every ToeThread.
- loggerModule
- (CrawlerLoggerModule)
- recoveryCheckpoint
- (Checkpoint)
CrawlerLoggerModule¶
Module providing all expected whole-crawl logging facilities
<bean id="crawlerLoggerModule" class="org.archive.crawler.reporting.CrawlerLoggerModule">
<!-- <property name="path" value="" /> -->
<!-- <property name="logExtraInfo" value="false" /> -->
<!-- <property name="crawlLogPath" value="" /> -->
<!-- <property name="alertsLogPath" value="" /> -->
<!-- <property name="progressLogPath" value="" /> -->
<!-- <property name="uriErrorsLogPath" value="" /> -->
<!-- <property name="runtimeErrorsLogPath" value="" /> -->
<!-- <property name="nonfatalErrorsLogPath" value="" /> -->
<!-- <property name="upSimpleLog" value="" /> -->
<!-- <property name="recoveryCheckpoint" value="" /> -->
</bean>
- path
- (ConfigPath)
- logExtraInfo
- (boolean) Whether to include the “extra info” field for each entry in crawl.log. “Extra info” is arbitrary JSON. It is the last field of the log line.
- crawlLogPath
- (ConfigPath)
- alertsLogPath
- (ConfigPath)
- progressLogPath
- (ConfigPath)
- uriErrorsLogPath
- (ConfigPath)
- runtimeErrorsLogPath
- (ConfigPath)
- nonfatalErrorsLogPath
- (ConfigPath)
- upSimpleLog
- (String)
- recoveryCheckpoint
- (Checkpoint)
CrawlLimitEnforcer¶
Bean to enforce limits on the size of a crawl in URI count, byte count, or elapsed time. Fires off the StatSnapshotEvent, so only checks at the interval (configured in StatisticsTracker) of those events.
<bean id="crawlLimitEnforcer" class="org.archive.crawler.framework.CrawlLimitEnforcer">
<!-- <property name="maxBytesDownload" value="0" /> -->
<!-- <property name="maxNovelBytes" value="0" /> -->
<!-- <property name="maxNovelUrls" value="0" /> -->
<!-- <property name="maxWarcNovelUrls" value="0" /> -->
<!-- <property name="maxWarcNovelBytes" value="0" /> -->
<!-- <property name="maxDocumentsDownload" value="0" /> -->
<!-- <property name="maxTimeSeconds" value="0" /> -->
<!-- <property name="crawlController" value="" /> -->
</bean>
- maxBytesDownload
- (long) Maximum number of bytes to download. Once this number is exceeded the crawler will stop. A value of zero means no upper limit.
- maxNovelBytes
- (long) Maximum number of uncompressed payload bytes to write to WARC response or resource records. Once this number is exceeded the crawler will stop. A value of zero means no upper limit.
- maxNovelUrls
- (long) Maximum number of novel (not deduplicated) urls to download. Once this number is exceeded the crawler will stop. A value of zero means no upper limit.
- maxWarcNovelUrls
- (long) Maximum number of urls to write to WARC response or resource records. Once this number is exceeded the crawler will stop. A value of zero means no upper limit.
- maxWarcNovelBytes
- (long) Maximum number of novel (not deduplicated) bytes to write to WARC response or resource records. Once this number is exceeded the crawler will stop. A value of zero means no upper limit.
- maxDocumentsDownload
- (long) Maximum number of documents to download. Once this number is exceeded the crawler will stop. A value of zero means no upper limit.
- maxTimeSeconds
- (long) Maximum amount of time to crawl (in seconds). Once this much time has elapsed the crawler will stop. A value of zero means no upper limit.
- crawlController
- (CrawlController)
CrawlMetadata¶
Basic crawl metadata, as consulted by functional modules and recorded in ARCs/WARCs.
<bean id="crawlMetadata" class="org.archive.modules.CrawlMetadata">
<!-- <property name="robotsPolicyName" value="obey" /> -->
<!-- <property name="availableRobotsPolicies" value="" /> -->
<!-- <property name="operator" value="" /> -->
<!-- <property name="description" value="" /> -->
<!-- <property name="userAgentTemplate" value="Mozilla/5.0 (compatible; heritrix/@VERSION@ +@OPERATOR_CONTACT_URL@)" /> -->
<!-- <property name="operatorFrom" value="" /> -->
<!-- <property name="operatorContactUrl" value="" /> -->
<!-- <property name="audience" value="" /> -->
<!-- <property name="organization" value="" /> -->
<!-- <property name="jobName" value="" /> -->
</bean>
- robotsPolicyName
- (String) Robots policy name
- availableRobotsPolicies
- (Map) Map of all available RobotsPolicies, by name, to choose from. assembled from declared instances in configuration plus the standard ‘obey’ (aka ‘classic’) and ‘ignore’ policies.
- operator
- (String)
- description
- (String)
- userAgentTemplate
- (String)
- operatorFrom
- (String)
- operatorContactUrl
- (String)
- audience
- (String)
- organization
- (String)
- jobName
- (String)
CredentialStore¶
Front door to the credential store. Come here to get at credentials. See Credential
Store Design.
<bean id="credentialStore" class="org.archive.modules.credential.CredentialStore">
<!-- <property name="credentials" value="" /> -->
</bean>
- credentials
- (Map) Credentials used by heritrix authenticating. See http://crawler.archive.org/proposals/auth/ for background.
DecideRuleSequence¶
<bean id="decideRuleSequence" class="org.archive.modules.deciderules.DecideRuleSequence">
<!-- <property name="logToFile" value="false" /> -->
<!-- <property name="logExtraInfo" value="false" /> -->
<!-- <property name="loggerModule" value="" /> -->
<!-- <property name="rules" value="" /> -->
<!-- <property name="serverCache" value="" /> -->
<!-- <property name="beanName" value="" /> -->
</bean>
- logToFile
- (boolean) If enabled, log decisions to file named logs/{spring-bean-id}.log. Format
is: [timestamp] [decisive-rule-num] [decisive-rule-class] [decision]
[uri] [extraInfo]
Relies on Spring Lifecycle to initialize the log. Only top-level beans get the Lifecycle treatment from Spring, so bean must be top-level for logToFile to work. (This is true of other modules that support logToFile, and anything else that uses Lifecycle, as well.)
- logExtraInfo
- (boolean) Whether to include the “extra info” field for each entry in crawl.log. “Extra info” is a json object with entries “host”, “via”, “source” and “hopPath”.
- loggerModule
- (SimpleFileLoggerProvider)
- rules
- (List)
- serverCache
- (ServerCache)
- beanName
- (String)
DiskSpaceMonitor¶
Monitors the available space on the paths configured. If the available space
drops below a specified threshold a crawl pause is requested.
Monitoring is done via the
Paths that do not resolve to actual filesystem folders or files will not be
evaluated (i.e. if
Paths are checked available space whenever a StatSnapshotEvent occurs.java.io.File.getUsableSpace()
method.
This method will sometimes fail on network attached storage, returning 0
bytes available even if that is not actually the case.
java.io.File.exists()
returns false
no further processing is carried out on that File).
<bean id="diskSpaceMonitor" class="org.archive.crawler.monitor.DiskSpaceMonitor">
<!-- <property name="monitorPaths" value="" /> -->
<!-- <property name="pauseThresholdMiB" value="8192" /> -->
<!-- <property name="monitorConfigPaths" value="true" /> -->
<!-- <property name="crawlController" value="" /> -->
<!-- <property name="configPathConfigurer" value="" /> -->
</bean>
- monitorPaths
- (List)
- pauseThresholdMiB
- (long) Set the minimum amount of space that must be available on all monitored paths. If the amount falls below this pause threshold on any path the crawl will be paused.
- monitorConfigPaths
- (boolean) If enabled, all the paths returned by ConfigPathConfigurer#getAllConfigPaths()
will be monitored in addition to any paths explicitly specified via
#setMonitorPaths(List).
true
by default.Note: This is not guaranteed to contain all paths that Heritrix writes to. It is the responsibility of modules that write to disk to register their activity with the ConfigPathConfigurer and some may not do so.
- crawlController
- (CrawlController) Autowire access to CrawlController *
- configPathConfigurer
- (ConfigPathConfigurer) Autowire access to ConfigPathConfigurer *
RulesCanonicalizationPolicy¶
URI Canonicalizatioon Policy
<bean id="rulesCanonicalizationPolicy" class="org.archive.modules.canonicalize.RulesCanonicalizationPolicy">
<!-- <property name="rules" value="" /> -->
</bean>
- rules
- (List)
SheetOverlaysManager¶
Manager which marks-up CrawlURIs with the names of all applicable Sheets, and returns overlay maps by name.
<bean id="sheetOverlaysManager" class="org.archive.crawler.spring.SheetOverlaysManager">
<!-- <property name="beanFactory" value="" /> -->
<!-- <property name="sheetsByName" value="" /> -->
</bean>
- beanFactory
- (BeanFactory)
- sheetsByName
- (Map) Collect all Sheets, by beanName.
StatisticsTracker¶
This is an implementation of the AbstractTracker. It is designed to function
with the WUI as well as performing various logging activity.
At the end of each snapshot a line is written to the
‘progress-statistics.log’ file.
The header of that file is as follows:
discovered, queued, downloaded and dl-failures
are (respectively) the discovered URI count, pending URI count, successfully
fetched count and failed fetch count from the frontier at the time of the
snapshot.
KB/s(avg) is the bandwidth usage. We use the total bytes downloaded
to calculate average bandwidth usage (KB/sec). Since we also note the value
each time a snapshot is made we can calculate the average bandwidth usage
during the last snapshot period to gain a “current” rate. The first number is
the current and the average is in parenthesis.
doc/s(avg) works the same way as doc/s except it show the number of
documents (URIs) rather then KB downloaded.
busy-threads is the total number of ToeThreads that are not available
(and thus presumably busy processing a URI). This information is extracted
from the crawl controller.
Finally mem-use-KB is extracted from the run time environment
(
In addition to the data collected for the above logs, various other data
is gathered and stored by this tracker.
Successfully downloaded documents per fetch status code
Successfully downloaded documents per document mime type
Amount of data per mime type
Successfully downloaded documents per host
Amount of data per host
Disposition of all seeds (this is written to ‘reports.log’ at end of
crawl)
Successfully downloaded documents per host per source
[timestamp] [discovered] [queued] [downloaded] [doc/s(avg)] [KB/s(avg)] [dl-failures] [busy-thread] [mem-use-KB]
Runtime.getRuntime().totalMemory()
).
<bean id="statisticsTracker" class="org.archive.crawler.reporting.StatisticsTracker">
<!-- <property name="seeds" value="" /> -->
<!-- <property name="bdbModule" value="" /> -->
<!-- <property name="reportsDir" value="" /> -->
<!-- <property name="serverCache" value="" /> -->
<!-- <property name="liveHostReportSize" value="20" /> -->
<!-- <property name="applicationContext" value="" /> -->
<!-- <property name="trackSeeds" value="true" /> -->
<!-- <property name="trackSources" value="true" /> -->
<!-- <property name="intervalSeconds" value="20" /> -->
<!-- <property name="keepSnapshotsCount" value="5" /> -->
<!-- <property name="crawlController" value="" /> -->
<!-- <property name="reports" value="" /> -->
<!-- <property name="beanName" value="" /> -->
<!-- <property name="recoveryCheckpoint" value="" /> -->
</bean>
- seeds
- (SeedModule)
- bdbModule
- (BdbModule)
- reportsDir
- (ConfigPath)
- serverCache
- (ServerCache)
- liveHostReportSize
- (int)
- applicationContext
- (ApplicationContext)
- trackSeeds
- (boolean) Whether to maintain seed disposition records (expensive in crawls with millions of seeds)
- trackSources
- (boolean) Whether to maintain hosts-per-source-tag records for; very expensive in crawls with large numbers of source-tags (seeds) or large crawls over many hosts
- intervalSeconds
- (int) The interval between writing progress information to log.
- keepSnapshotsCount
- (int) Number of crawl-stat sample snapshots to keep for calculation purposes.
- crawlController
- (CrawlController)
- reports
- (List)
- beanName
- (String)
- recoveryCheckpoint
- (Checkpoint)
TextSeedModule¶
Module that announces a list of seeds from a text source (such as a ConfigFile or ConfigString), and provides a mechanism for adding seeds after a crawl has begun.
<bean id="textSeedModule" class="org.archive.modules.seeds.TextSeedModule">
<!-- <property name="textSource" value="null" /> -->
<!-- <property name="blockAwaitingSeedLines" value="1" /> -->
</bean>
- textSource
- (ReadSource) Text from which to extract seeds
- blockAwaitingSeedLines
- (int) Number of lines of seeds-source to read on initial load before proceeding with crawl. Default is -1, meaning all. Any other value will cause that number of lines to be loaded before fetching begins, while all extra lines continue to be processed in the background. Generally, this should only be changed when working with very large seed lists, and scopes that do *not* depend on reading all seeds.
Decide Rules¶
AcceptDecideRule¶
<bean id="acceptDecideRule" class="org.archive.modules.deciderules.AcceptDecideRule">
</bean>
ClassKeyMatchesRegexDecideRule¶
Rule applies configured decision to any CrawlURI class key – i.e. CrawlURI#getClassKey() – matches matches supplied regex.
<bean id="classKeyMatchesRegexDecideRule" class="org.archive.crawler.deciderules.ClassKeyMatchesRegexDecideRule">
<!-- <property name="crawlController" value="" /> -->
</bean>
- crawlController
- (CrawlController)
ContentLengthDecideRule¶
<bean id="contentLengthDecideRule" class="org.archive.modules.deciderules.ContentLengthDecideRule">
<!-- <property name="contentLengthThreshold" value="" /> -->
</bean>
- contentLengthThreshold
- (long) Content-length threshold. The rule returns ACCEPT if the content-length is less than this threshold, or REJECT otherwise. The default is 2^63, meaning any document will be accepted.
ContentTypeMatchesRegexDecideRule¶
DecideRule whose decision is applied if the URI’s content-type is present and matches the supplied regular expression.
<bean id="contentTypeMatchesRegexDecideRule" class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule">
</bean>
ContentTypeNotMatchesRegexDecideRule¶
DecideRule whose decision is applied if the URI’s content-type is present and does not match the supplied regular expression.
<bean id="contentTypeNotMatchesRegexDecideRule" class="org.archive.modules.deciderules.ContentTypeNotMatchesRegexDecideRule">
</bean>
ExpressionDecideRule (contrib)¶
Example usage:
<bean class=”org.archive.modules.deciderules.ExpressionDecideRule”>
<property name=”groovyExpression” value=’curi.via == null && curi ==~ “^https?://(?:www\.)?(facebook|vimeo|flickr)\.com/.*”’/>
</bean>
<bean id="expressionDecideRule" class="org.archive.modules.deciderules.ExpressionDecideRule">
<!-- <property name="groovyExpression" value="" /> -->
</bean>
- groovyExpression
- (String)
ExternalGeoLocationDecideRule¶
A rule that can be configured to take alternate implementations
of the ExternalGeoLocationInterface.
If no implementation specified, or none found, returns configured decision.
If host in URI has been resolved checks CrawlHost for the country code
determination.
If country code is not present, does country lookup, and saves the country
code to CrawlHost
for future consultation.
If country code is present in CrawlHost
, compares it against
the configured code.
Note that if a host’s IP address changes during the crawl, we still consider
the associated hostname to be in the country of its original IP address.
<bean id="externalGeoLocationDecideRule" class="org.archive.modules.deciderules.ExternalGeoLocationDecideRule">
<!-- <property name="lookup" value="null" /> -->
<!-- <property name="countryCodes" value="" /> -->
<!-- <property name="serverCache" value="" /> -->
</bean>
- lookup
- (ExternalGeoLookupInterface)
- countryCodes
- (List) Country code name.
- serverCache
- (ServerCache)
FetchStatusDecideRule¶
Rule applies the configured decision for any URI which has a fetch status equal to the ‘target-status’ setting.
<bean id="fetchStatusDecideRule" class="org.archive.modules.deciderules.FetchStatusDecideRule">
<!-- <property name="statusCodes" value="" /> -->
</bean>
- statusCodes
- (List)
FetchStatusMatchesRegexDecideRule¶
<bean id="fetchStatusMatchesRegexDecideRule" class="org.archive.modules.deciderules.FetchStatusMatchesRegexDecideRule">
</bean>
FetchStatusNotMatchesRegexDecideRule¶
<bean id="fetchStatusNotMatchesRegexDecideRule" class="org.archive.modules.deciderules.FetchStatusNotMatchesRegexDecideRule">
</bean>
HasViaDecideRule¶
Rule applies the configured decision for any URI which has a ‘via’ (essentially, any URI that was a seed or some kinds of mid-crawl adds).
<bean id="hasViaDecideRule" class="org.archive.modules.deciderules.HasViaDecideRule">
</bean>
HopCrossesAssignmentLevelDomainDecideRule¶
Applies its decision if the current URI differs in that portion of its hostname/domain that is assigned/sold by registrars, its ‘assignment-level-domain’ (ALD) (AKA ‘public suffix’ or in previous Heritrix versions, ‘topmost assigned SURT’)
<bean id="hopCrossesAssignmentLevelDomainDecideRule" class="org.archive.modules.deciderules.HopCrossesAssignmentLevelDomainDecideRule">
</bean>
HopsPathMatchesRegexDecideRule¶
Rule applies configured decision to any CrawlURIs whose ‘hops-path’ (string like “LLXE” etc.) matches the supplied regex.
<bean id="hopsPathMatchesRegexDecideRule" class="org.archive.modules.deciderules.HopsPathMatchesRegexDecideRule">
</bean>
IdenticalDigestDecideRule¶
Rule applies configured decision to any CrawlURIs whose revisit profile is set with a profile matching WARCConstants#PROFILE_REVISIT_IDENTICAL_DIGEST
<bean id="identicalDigestDecideRule" class="org.archive.modules.deciderules.recrawl.IdenticalDigestDecideRule">
</bean>
IpAddressSetDecideRule¶
IpAddressSetDecideRule must be used with
org.archive.crawler.prefetch.Preselector#setRecheckScope(boolean) set
to true because it relies on Heritrix’ dns lookup to establish the ip address
for a URI before it can run.
<bean class=”org.archive.modules.deciderules.IpAddressSetDecideRule”>
<property name=”ipAddresses”>
<set>
<value>127.0.0.1</value>
<value>69.89.27.209</value>
</set>
</property>
<property name=’decision’ value=’REJECT’ />
</bean>
<bean id="ipAddressSetDecideRule" class="org.archive.modules.deciderules.IpAddressSetDecideRule">
<!-- <property name="ipAddresses" value="" /> -->
<!-- <property name="serverCache" value="" /> -->
</bean>
- ipAddresses
- (Set)
- serverCache
- (ServerCache)
MatchesFilePatternDecideRule¶
Compares suffix of a passed CrawlURI, UURI, or String against a regular
expression pattern, applying its configured decision to all matches. Several predefined patterns are available for convenience. Choosing
‘custom’ makes this the same as a regular MatchesRegexDecideRule.
<bean id="matchesFilePatternDecideRule" class="org.archive.modules.deciderules.MatchesFilePatternDecideRule">
<!-- <property name="usePreset" value="" /> -->
</bean>
- usePreset
- (Preset)
MatchesListRegexDecideRule¶
Rule applies configured decision to any CrawlURIs whose String URI
matches the supplied regexs.
The list of regular expressions can be considered logically AND or OR.
<bean id="matchesListRegexDecideRule" class="org.archive.modules.deciderules.MatchesListRegexDecideRule">
<!-- <property name="timeoutPerRegexSeconds" value="0" /> -->
<!-- <property name="regexList" value="" /> -->
<!-- <property name="listLogicalOr" value="true" /> -->
</bean>
- timeoutPerRegexSeconds
- (long) The timeout for regular expression matching, in seconds. If set to 0 or negative then no timeout is specified and there is no upper limit to how long the matching may take. See the corresponding test class MatchesListRegexDecideRuleTest for a pathological example.
- regexList
- (List) The list of regular expressions to evalute against the URI.
- listLogicalOr
- (boolean) True if the list of regular expression should be considered as logically AND when matching. False if the list of regular expressions should be considered as logically OR when matching.
MatchesRegexDecideRule¶
Rule applies configured decision to any CrawlURIs whose String URI matches the supplied regex.
<bean id="matchesRegexDecideRule" class="org.archive.modules.deciderules.MatchesRegexDecideRule">
<!-- <property name="regex" value="" /> -->
</bean>
- regex
- (Pattern)
MatchesStatusCodeDecideRule¶
Provides a rule that returns “true” for any CrawlURIs which have a fetch status code that falls within the provided inclusive range. For instance, to select only URIs with a “success” status code you must provide the range 200 to 299.
<bean id="matchesStatusCodeDecideRule" class="org.archive.modules.deciderules.MatchesStatusCodeDecideRule">
<!-- <property name="lowerBound" value="" /> -->
<!-- <property name="upperBound" value="" /> -->
</bean>
- lowerBound
- (Integer) Sets the lower bound on the range of acceptable status codes.
- upperBound
- (Integer) Sets the upper bound on the range of acceptable status codes.
NotMatchesFilePatternDecideRule¶
Rule applies configured decision to any URIs which do *not* match the supplied (file-pattern) regex.
<bean id="notMatchesFilePatternDecideRule" class="org.archive.modules.deciderules.NotMatchesFilePatternDecideRule">
</bean>
NotMatchesListRegexDecideRule¶
Rule applies configured decision to any URIs which do *not* match the supplied regex.
<bean id="notMatchesListRegexDecideRule" class="org.archive.modules.deciderules.NotMatchesListRegexDecideRule">
</bean>
NotMatchesRegexDecideRule¶
Rule applies configured decision to any URIs which do *not* match the supplied regex.
<bean id="notMatchesRegexDecideRule" class="org.archive.modules.deciderules.NotMatchesRegexDecideRule">
</bean>
NotMatchesStatusCodeDecideRule¶
Provides a rule that returns “true” for any CrawlURIs which has a fetch status code that does not fall within the provided inclusive range. For instance, to reject any URIs with a “client error” status code you must provide the range 400 to 499.
<bean id="notMatchesStatusCodeDecideRule" class="org.archive.modules.deciderules.NotMatchesStatusCodeDecideRule">
<!-- <property name="upperBound" value="" /> -->
</bean>
- upperBound
- (Integer) Sets the upper bound on the range of acceptable status codes.
NotOnDomainsDecideRule¶
Rule applies configured decision to any URIs that are *not* in one of the domains in the configured set of domains, filled from the seed set.
<bean id="notOnDomainsDecideRule" class="org.archive.modules.deciderules.surt.NotOnDomainsDecideRule">
</bean>
NotOnHostsDecideRule¶
Rule applies configured decision to any URIs that are *not* on one of the hosts in the configured set of hosts, filled from the seed set.
<bean id="notOnHostsDecideRule" class="org.archive.modules.deciderules.surt.NotOnHostsDecideRule">
</bean>
NotSurtPrefixedDecideRule¶
Rule applies configured decision to any URIs that, when
expressed in SURT form, do *not* begin with one of the prefixes
in the configured set. The set can be filled with SURT prefixes implied or
listed in the seeds file, or another external file.
<bean id="notSurtPrefixedDecideRule" class="org.archive.modules.deciderules.surt.NotSurtPrefixedDecideRule">
</bean>
OnDomainsDecideRule¶
Rule applies configured decision to any URIs that are on one of the domains in the configured set of domains, filled from the seed set.
<bean id="onDomainsDecideRule" class="org.archive.modules.deciderules.surt.OnDomainsDecideRule">
</bean>
OnHostsDecideRule¶
Rule applies configured decision to any URIs that are on one of the hosts in the configured set of hosts, filled from the seed set.
<bean id="onHostsDecideRule" class="org.archive.modules.deciderules.surt.OnHostsDecideRule">
</bean>
PathologicalPathDecideRule¶
Rule REJECTs any URI which contains an excessive number of identical, consecutive path-segments (eg http://example.com/a/a/a/boo.html == 3 ‘/a’ segments)
<bean id="pathologicalPathDecideRule" class="org.archive.modules.deciderules.PathologicalPathDecideRule">
<!-- <property name="maxRepetitions" value="2" /> -->
</bean>
- maxRepetitions
- (int) Number of times the pattern should be allowed to occur. This rule returns its decision (usually REJECT) if a path-segment is repeated more than number of times.
PredicatedDecideRule¶
Rule which applies the configured decision only if a test evaluates to true. Subclasses override evaluate() to establish the test.
<bean id="predicatedDecideRule" class="org.archive.modules.deciderules.PredicatedDecideRule">
<!-- <property name="decision" value="" /> -->
</bean>
- decision
- (DecideResult)
PrerequisiteAcceptDecideRule¶
Rule which ACCEPTs all ‘prerequisite’ URIs (those with a ‘P’ in the last hopsPath position). Good in a late position to ensure other scope settings don’t lock out necessary prerequisites.
<bean id="prerequisiteAcceptDecideRule" class="org.archive.modules.deciderules.PrerequisiteAcceptDecideRule">
</bean>
RejectDecideRule¶
<bean id="rejectDecideRule" class="org.archive.modules.deciderules.RejectDecideRule">
</bean>
ResourceLongerThanDecideRule¶
Applies configured decision for URIs with content length greater than
a given threshold length value. Examines either HTTP header Content-Length
or actual downloaded content length (based on the useHeaderLength property),
and has no effect on resources shorter than or equal to the given threshold
value. Note that because neither the Content-Length header nor the actual size are
available at URI-scoping time, this rule is unusable in crawl scopes.
Instead, the earliest it can be used is as a mid-fetch rule (in FetchHTTP),
when the headers are available but not yet the body. It can also be used
to affect processing after the URI is fully fetched.
<bean id="resourceLongerThanDecideRule" class="org.archive.modules.deciderules.ResourceLongerThanDecideRule">
</bean>
ResourceNoLongerThanDecideRule¶
Applies configured decision for URIs with content length less than or equal
to a given threshold length value. Examines either HTTP header Content-Length
or actual downloaded content length (based on the useHeaderLength property),
and has no effect on resources longer than the given threshold value. Note that because neither the Content-Length header nor the actual size are
available at URI-scoping time, this rule is unusable in crawl scopes.
Instead, the earliest it can be used is as a mid-fetch rule (in FetchHTTP),
when the headers are available but not yet the body. It can also be used
to affect processing after the URI is fully fetched.
<bean id="resourceNoLongerThanDecideRule" class="org.archive.modules.deciderules.ResourceNoLongerThanDecideRule">
<!-- <property name="useHeaderLength" value="true" /> -->
<!-- <property name="contentLengthThreshold" value="1" /> -->
</bean>
- useHeaderLength
- (boolean) Shall this rule be used as a midfetch rule? If true, this rule will determine content length based on HTTP header information, otherwise the size of the already downloaded content will be used.
- contentLengthThreshold
- (long) Max content-length this filter will allow to pass through. If -1, then no limit.
ResponseContentLengthDecideRule¶
Decide rule that will ACCEPT or REJECT a uri, depending on the “decision” property, after it’s fetched, if the content body is within a specified size range, specified in bytes.
<bean id="responseContentLengthDecideRule" class="org.archive.modules.deciderules.ResponseContentLengthDecideRule">
<!-- <property name="lowerBound" value="0" /> -->
<!-- <property name="upperBound" value="" /> -->
</bean>
- lowerBound
- (long) The rule will apply if the url has been fetched and content body length is greater than or equal to this number of bytes. Default is 0, meaning everything will match.
- upperBound
- (long) The rule will apply if the url has been fetched and content body length
is less than or equal to this number of bytes. Default is
Long.MAX_VALUE
, meaning everything will match.
SchemeNotInSetDecideRule¶
Rule applies the configured decision (default REJECT) for any URI which has a URI-scheme NOT contained in the configured Set.
<bean id="schemeNotInSetDecideRule" class="org.archive.modules.deciderules.SchemeNotInSetDecideRule">
<!-- <property name="schemes" value="" /> -->
</bean>
- schemes
- (Set) set of schemes to test URI scheme
ScriptedDecideRule¶
Rule which runs a JSR-223 script to make its decision. Script source may be provided via a file local to the crawler or
an inline configuration string. The source must include a one-argument function “decisionFor” which
returns the appropriate DecideResult. Variables available to the script include ‘object’ (the object to be
evaluated, typically a CrawlURI), ‘self’ (this ScriptedDecideRule
instance), and ‘context’ (the crawl’s ApplicationContext, from
which all named crawl beans are easily reachable).
<bean id="scriptedDecideRule" class="org.archive.modules.deciderules.ScriptedDecideRule">
<!-- <property name="engineName" value="beanshell" /> -->
<!-- <property name="scriptSource" value="null" /> -->
<!-- <property name="isolateThreads" value="true" /> -->
<!-- <property name="applicationContext" value="" /> -->
</bean>
- engineName
- (String) engine name; default “beanshell”
- scriptSource
- (ReadSource)
- isolateThreads
- (boolean) Whether each ToeThread should get its own independent script engine, or they should share synchronized access to one engine. Default is true, meaning each thread gets its own isolated engine.
- applicationContext
- (ApplicationContext)
SeedAcceptDecideRule¶
Rule which ACCEPTs all ‘seed’ URIs (those for which isSeed is true). Good in a late position to ensure other scope settings don’t lock out explicitly added seeds.
<bean id="seedAcceptDecideRule" class="org.archive.modules.deciderules.SeedAcceptDecideRule">
</bean>
SourceSeedDecideRule¶
Rule applies the configured decision for any URI with discovered from one of
the seeds in SeedModule#getSourceTagSeeds() must be enabled or the rule will never
apply.sourceSeeds
.
<bean id="sourceSeedDecideRule" class="org.archive.modules.deciderules.SourceSeedDecideRule">
<!-- <property name="sourceSeeds" value="" /> -->
</bean>
- sourceSeeds
- (Set)
SurtPrefixedDecideRule¶
Rule applies configured decision to any URIs that, when
expressed in SURT form, begin with one of the prefixes
in the configured set. The set can be filled with SURT prefixes implied or
listed in the seeds file, or another external file. The “also-check-via” option to implement “one hop off”
scoping derives from a contribution by Shifra Raffel
of the California Digital Library.
<bean id="surtPrefixedDecideRule" class="org.archive.modules.deciderules.surt.SurtPrefixedDecideRule">
<!-- <property name="surtsSourceFile" value="" /> -->
<!-- <property name="surtsSource" value="null" /> -->
<!-- <property name="seedsAsSurtPrefixes" value="true" /> -->
<!-- <property name="surtsDumpFile" value="" /> -->
<!-- <property name="alsoCheckVia" value="false" /> -->
<!-- <property name="seeds" value="" /> -->
<!-- <property name="beanName" value="" /> -->
<!-- <property name="recoveryCheckpoint" value="" /> -->
</bean>
- surtsSourceFile
- (ConfigFile)
- surtsSource
- (ReadSource) Text from which to infer SURT prefixes. Any URLs will be converted to the implied SURT prefix, and literal SURT prefixes may be listed on lines beginning with a ‘+’ character.
- seedsAsSurtPrefixes
- (boolean) Should seeds also be interpreted as SURT prefixes.
- surtsDumpFile
- (ConfigFile) Dump file to save SURT prefixes actually used: Useful debugging SURTs.
- alsoCheckVia
- (boolean) Whether to also make the configured decision if a URI’s ‘via’ URI (the URI from which it was discovered) in SURT form begins with any of the established prefixes. For example, can be used to ACCEPT URIs that are ‘one hop off’ URIs fitting the SURT prefixes. Default is false.
- seeds
- (SeedModule)
- beanName
- (String)
- recoveryCheckpoint
- (Checkpoint)
TooManyHopsDecideRule¶
Rule REJECTs any CrawlURIs whose total number of hops (length of the hopsPath string, traversed links of any type) is over a threshold. Otherwise returns PASS.
<bean id="tooManyHopsDecideRule" class="org.archive.modules.deciderules.TooManyHopsDecideRule">
<!-- <property name="maxHops" value="20" /> -->
</bean>
- maxHops
- (int) Max path depth for which this filter will match.
TooManyPathSegmentsDecideRule¶
Rule REJECTs any CrawlURIs whose total number of path-segments (as indicated by the count of ‘/’ characters not including the first ‘//’) is over a given threshold.
<bean id="tooManyPathSegmentsDecideRule" class="org.archive.modules.deciderules.TooManyPathSegmentsDecideRule">
<!-- <property name="maxPathDepth" value="20" /> -->
</bean>
- maxPathDepth
- (int) Number of path segments beyond which this rule will reject URIs.
TransclusionDecideRule¶
Rule ACCEPTs any CrawlURIs whose path-from-seed (‘hopsPath’ – see
CrawlURI#getPathFromSeed() ends
with at least one, but not more than, the given number of
non-navlink (‘L’) hops. Otherwise, if the path-from-seed is empty or if a navlink (‘L’) occurs
within max-trans-hops of the tail of the path-from-seed, this rule
returns PASS.
Thus, it allows things like embedded resources (frames/images/media)
and redirects to be transitively included (‘transcluded’) in a crawl,
even if they otherwise would not, for some reasonable number of hops
(usually 1-5).
<bean id="transclusionDecideRule" class="org.archive.modules.deciderules.TransclusionDecideRule">
<!-- <property name="maxTransHops" value="2" /> -->
<!-- <property name="maxSpeculativeHops" value="1" /> -->
</bean>
- maxTransHops
- (int) Maximum number of non-navlink (non-‘L’) hops to ACCEPT.
- maxSpeculativeHops
- (int) Maximum number of speculative (‘X’) hops to ACCEPT.
ViaSurtPrefixedDecideRule¶
Rule applies the configured decision for any URI which has a ‘via’ whose surtform matches any surt specified in the surtPrefixes list
<bean id="viaSurtPrefixedDecideRule" class="org.archive.modules.deciderules.ViaSurtPrefixedDecideRule">
<!-- <property name="surtPrefixes" value="" /> -->
</bean>
- surtPrefixes
- (List)
Candidate Processors¶
CandidateScoper¶
Simple single-URI scoper, considers passed-in URI as candidate; sets fetchstatus negative and skips to end of processing if out-of-scope.
<bean id="candidateScoper" class="org.archive.crawler.prefetch.CandidateScoper">
</bean>
FrontierPreparer¶
Processor to preload URI with as much precalculated policy-based
info as possible before it reaches frontier critical sections. Frontiers also maintain a direct reference to this class, in case
they need to perform remedial preparation for URIs that do not
pass through this processor on the CandidateChain.
<bean id="frontierPreparer" class="org.archive.crawler.prefetch.FrontierPreparer">
<!-- <property name="preferenceDepthHops" value="1" /> -->
<!-- <property name="preferenceEmbedHops" value="1" /> -->
<!-- <property name="canonicalizationPolicy" value="" /> -->
<!-- <property name="queueAssignmentPolicy" value="" /> -->
<!-- <property name="uriPrecedencePolicy" value="" /> -->
<!-- <property name="costAssignmentPolicy" value="" /> -->
</bean>
- preferenceDepthHops
- (int) Number of hops (of any sort) from a seed up to which a URI has higher priority scheduling than any remaining seed. For example, if set to 1 items one hop (link, embed, redirect, etc.) away from a seed will be scheduled with HIGH priority. If set to -1, no preferencing will occur, and a breadth-first search with seeds processed before discovered links will proceed. If set to zero, a purely depth-first search will proceed, with all discovered links processed before remaining seeds. Seed redirects are treated as one hop from a seed.
- preferenceEmbedHops
- (int) number of hops of embeds (ERX) to bump to front of host queue
- canonicalizationPolicy
- (UriCanonicalizationPolicy) Ordered list of url canonicalization rules. Rules are applied in the order listed from top to bottom.
- queueAssignmentPolicy
- (QueueAssignmentPolicy) Defines how to assign URIs to queues. Can assign by host, by ip, by SURT-ordered authority, by SURT-ordered authority truncated to a topmost-assignable domain, and into one of a fixed set of buckets (1k).
- uriPrecedencePolicy
- (UriPrecedencePolicy) URI precedence assignment policy to use.
- costAssignmentPolicy
- (CostAssignmentPolicy) cost assignment policy to use.
Pre-Fetch Processors¶
PreconditionEnforcer¶
Ensures the preconditions for a fetch – such as DNS lookup or acquiring and respecting a robots.txt policy – are satisfied before a URI is passed to subsequent stages.
<bean id="preconditionEnforcer" class="org.archive.crawler.prefetch.PreconditionEnforcer">
<!-- <property name="ipValidityDurationSeconds" value="" /> -->
<!-- <property name="robotsValidityDurationSeconds" value="" /> -->
<!-- <property name="calculateRobotsOnly" value="false" /> -->
<!-- <property name="metadata" value="" /> -->
<!-- <property name="credentialStore" value="" /> -->
<!-- <property name="serverCache" value="" /> -->
<!-- <property name="loggerModule" value="" /> -->
</bean>
- ipValidityDurationSeconds
- (int) The minimum interval for which a dns-record will be considered valid (in seconds). If the record’s DNS TTL is larger, that will be used instead.
- robotsValidityDurationSeconds
- (int) The time in seconds that fetched robots.txt information is considered to be valid. If the value is set to ‘0’, then the robots.txt information will never expire.
- calculateRobotsOnly
- (boolean) Whether to only calculate the robots status of an URI, without actually applying any exclusions found. If true, exlcuded URIs will only be annotated in the crawl.log, but still fetched. Default is false.
- metadata
- (CrawlMetadata) Auto-discovered module providing configured (or overridden) User-Agent value and RobotsHonoringPolicy
- credentialStore
- (CredentialStore)
- serverCache
- (ServerCache)
- loggerModule
- (CrawlerLoggerModule)
Preselector¶
If set to recheck the crawl’s scope, gives a yes/no on whether a CrawlURI should be processed at all. If not, its status will be marked OUT_OF_SCOPE and the URI will skip directly to the first “postprocessor”.
<bean id="preselector" class="org.archive.crawler.prefetch.Preselector">
<!-- <property name="recheckScope" value="false" /> -->
<!-- <property name="blockAll" value="false" /> -->
<!-- <property name="blockByRegex" value="" /> -->
<!-- <property name="allowByRegex" value="" /> -->
</bean>
- recheckScope
- (boolean) Recheck if uri is in scope. This is meaningful if the scope is altered during a crawl. URIs are checked against the scope when they are added to queues. Setting this value to true forces the URI to be checked against the scope when it is coming out of the queue, possibly after the scope is altered.
- blockAll
- (boolean) Block all URIs from being processed. This is most likely to be used in overrides to easily reject certain hosts from being processed.
- blockByRegex
- (String) Block all URIs matching the regular expression from being processed.
- allowByRegex
- (String) Allow only URIs matching the regular expression to be processed.
Fetch Processors¶
FetchDNS¶
Processor to resolve ‘dns:’ URIs.
<bean id="fetchDNS" class="org.archive.modules.fetcher.FetchDNS">
<!-- <property name="acceptNonDnsResolves" value="false" /> -->
<!-- <property name="disableJavaDnsResolves" value="false" /> -->
<!-- <property name="dnsOverHttpServer" value="" /> -->
<!-- <property name="serverCache" value="" /> -->
<!-- <property name="digestContent" value="true" /> -->
<!-- <property name="digestAlgorithm" value="sha1" /> -->
</bean>
- acceptNonDnsResolves
- (boolean) If a DNS lookup fails, whether or not to fall back to InetAddress resolution, which may use local ‘hosts’ files or other mechanisms.
- disableJavaDnsResolves
- (boolean) Optionally, only allow InetAddress resolution, precisely because it
may use local ‘hosts’ files or other mechanisms.
This should not generally be used in production as it will prevent DNS lookups from being recorded properly.
- dnsOverHttpServer
- (String) URL to the DNS-on-HTTP(S) server. If this not set or set to an empty string, no DNS-over-HTTP(S) will be used; otherwise if should contain the URL to the DNS-over-HTTPS server.
- serverCache
- (ServerCache) Used to do DNS lookups.
- digestContent
- (boolean) Whether or not to perform an on-the-fly digest hash of retrieved content-bodies.
- digestAlgorithm
- (String) Which algorithm (for example MD5 or SHA-1) to use to perform an on-the-fly digest hash of retrieved content-bodies.
FetchFTP¶
Fetches documents and directory listings using FTP. This class will also try to extract FTP “links” from directory listings. For this class to archive a directory listing, the remote FTP server must support the NLIST command. Most modern FTP servers should.
<bean id="fetchFTP" class="org.archive.modules.fetcher.FetchFTP">
<!-- <property name="username" value="anonymous" /> -->
<!-- <property name="password" value="password" /> -->
<!-- <property name="extractFromDirs" value="true" /> -->
<!-- <property name="extractParent" value="true" /> -->
<!-- <property name="digestContent" value="true" /> -->
<!-- <property name="digestAlgorithm" value="sha1" /> -->
<!-- <property name="maxLengthBytes" value="0" /> -->
<!-- <property name="maxFetchKBSec" value="0" /> -->
<!-- <property name="timeoutSeconds" value="" /> -->
<!-- <property name="soTimeoutMs" value="" /> -->
</bean>
- username
- (String) The username to send to FTP servers. By convention, the default value of “anonymous” is used for publicly available FTP sites.
- password
- (String) The password to send to FTP servers. By convention, anonymous users send their email address in this field.
- extractFromDirs
- (boolean) Set to true to extract further URIs from FTP directories. Default is true.
- extractParent
- (boolean) Set to true to extract the parent URI from all FTP URIs. Default is true.
- digestContent
- (boolean) Whether or not to perform an on-the-fly digest hash of retrieved content-bodies.
- digestAlgorithm
- (String) Which algorithm (for example MD5 or SHA-1) to use to perform an on-the-fly digest hash of retrieved content-bodies.
- maxLengthBytes
- (long) Maximum length in bytes to fetch. Fetch is truncated at this length. A value of 0 means no limit.
- maxFetchKBSec
- (int) The maximum KB/sec to use when fetching data from a server. The default of 0 means no maximum.
- timeoutSeconds
- (int) If the fetch is not completed in this number of seconds, give up (and retry later).
- soTimeoutMs
- (int) If the socket is unresponsive for this number of milliseconds, give up. Set to zero for no timeout (Not. recommended. Could hang a thread on an unresponsive server). This timeout is used timing out socket opens and for timing out each socket read. Make sure this value is < #TIMEOUT_SECONDS for optimal configuration: ensures at least one retry read.
FetchHTTP¶
HTTP fetcher that uses Apache HttpComponents.
<bean id="fetchHTTP" class="org.archive.modules.fetcher.FetchHTTP">
<!-- <property name="serverCache" value="" /> -->
<!-- <property name="digestContent" value="true" /> -->
<!-- <property name="digestAlgorithm" value="sha1" /> -->
<!-- <property name="userAgentProvider" value="" /> -->
<!-- <property name="sendConnectionClose" value="true" /> -->
<!-- <property name="defaultEncoding" value="ISO-8859-1" /> -->
<!-- <property name="useHTTP11" value="false" /> -->
<!-- <property name="ignoreCookies" value="false" /> -->
<!-- <property name="sendReferer" value="true" /> -->
<!-- <property name="acceptCompression" value="false" /> -->
<!-- <property name="acceptHeaders" value="" /> -->
<!-- <property name="cookieStore" value="" /> -->
<!-- <property name="credentialStore" value="" /> -->
<!-- <property name="httpBindAddress" value="" /> -->
<!-- <property name="httpProxyHost" value="" /> -->
<!-- <property name="httpProxyPort" value="" /> -->
<!-- <property name="httpProxyUser" value="" /> -->
<!-- <property name="httpProxyPassword" value="" /> -->
<!-- <property name="maxFetchKBSec" value="0" /> -->
<!-- <property name="timeoutSeconds" value="" /> -->
<!-- <property name="soTimeoutMs" value="" /> -->
<!-- <property name="maxLengthBytes" value="0" /> -->
<!-- <property name="sendRange" value="false" /> -->
<!-- <property name="sendIfModifiedSince" value="true" /> -->
<!-- <property name="sendIfNoneMatch" value="true" /> -->
<!-- <property name="shouldFetchBodyRule" value="" /> -->
<!-- <property name="sslTrustLevel" value="" /> -->
<!-- <property name="socksProxyHost" value="" /> -->
<!-- <property name="socksProxyPort" value="" /> -->
</bean>
- serverCache
- (ServerCache) Used to do DNS lookups.
- digestContent
- (boolean) Whether or not to perform an on-the-fly digest hash of retrieved content-bodies.
- digestAlgorithm
- (String) Which algorithm (for example MD5 or SHA-1) to use to perform an on-the-fly digest hash of retrieved content-bodies.
- userAgentProvider
- (UserAgentProvider)
- sendConnectionClose
- (boolean) Send ‘Connection: close’ header with every request.
- defaultEncoding
- (String) The character encoding to use for files that do not have one specified in the HTTP response headers. Default: ISO-8859-1.
- useHTTP11
- (boolean) Use HTTP/1.1. Note: even when offering an HTTP/1.1 request, Heritrix may not properly handle persistent/keep-alive connections, so the sendConnectionClose parameter should remain ‘true’.
- ignoreCookies
- (boolean) Disable cookie handling.
- sendReferer
- (boolean) Send ‘Referer’ header with every request.
The ‘Referer’ header contans the location the crawler came from, the page the current URI was discovered in. The ‘Referer’ usually is logged on the remote server and can be of assistance to webmasters trying to figure how a crawler got to a particular area on a site.
- acceptCompression
- (boolean) Set headers to accept compressed responses.
- acceptHeaders
- (List) Accept Headers to include in each request. Each must be the complete header, e.g., ‘Accept-Language: en’. (Thus, this can also be used to other headers not beginning ‘Accept-’ as well.) By default heritrix sends an Accept header similar to what a typical browser would send (the value comes from Firefox 4.0).
- cookieStore
- (AbstractCookieStore)
- credentialStore
- (CredentialStore) Used to store credentials.
- httpBindAddress
- (String) Local IP address or hostname to use when making connections (binding sockets). When not specified, uses default local address(es).
- httpProxyHost
- (String) Proxy host IP (set only if needed).
- httpProxyPort
- (Integer) Proxy port (set only if needed).
- httpProxyUser
- (String) Proxy user (set only if needed).
- httpProxyPassword
- (String) Proxy password (set only if needed).
- maxFetchKBSec
- (int) The maximum KB/sec to use when fetching data from a server. The default of 0 means no maximum.
- timeoutSeconds
- (int) If the fetch is not completed in this number of seconds, give up (and retry later).
- soTimeoutMs
- (int) If the socket is unresponsive for this number of milliseconds, give up. Set to zero for no timeout (Not. recommended. Could hang a thread on an unresponsive server). This timeout is used timing out socket opens and for timing out each socket read. Make sure this value is < #getTimeoutSeconds() for optimal configuration: ensures at least one retry read.
- maxLengthBytes
- (long) Maximum length in bytes to fetch. Fetch is truncated at this length. A value of 0 means no limit.
- sendRange
- (boolean) Send ‘Range’ header when a limit (#MAX_LENGTH_BYTES) on
document size.
Be polite to the HTTP servers and send the ‘Range’ header, stating that you are only interested in the first n bytes. Only pertinent if #MAX_LENGTH_BYTES > 0. Sending the ‘Range’ header results in a ‘206 Partial Content’ status response, which is better than just cutting the response mid-download. On rare occasion, sending ‘Range’ will generate ‘416 Request Range Not Satisfiable’ response.
- sendIfModifiedSince
- (boolean) Send ‘If-Modified-Since’ header, if previous ‘Last-Modified’ fetch history information is available in URI history.
- sendIfNoneMatch
- (boolean) Send ‘If-None-Match’ header, if previous ‘Etag’ fetch history information is available in URI history.
- shouldFetchBodyRule
- (DecideRule) DecideRules applied after receipt of HTTP response headers but before we start to download the body. If any filter returns FALSE, the fetch is aborted. Prerequisites such as robots.txt by-pass filtering (i.e. they cannot be midfetch aborted.
- sslTrustLevel
- (TrustLevel) SSL certificate trust level. Range is from the default ‘open’ (trust all certs including expired, selfsigned, and those for which we do not have a CA) through ‘loose’ (trust all valid certificates including selfsigned), ‘normal’ (all valid certificates not including selfsigned) to ‘strict’ (Cert is valid and DN must match servername).
- socksProxyHost
- (String) Sets a SOCKS5 proxy host to use. This will override any set HTTP proxy.
- socksProxyPort
- (Integer) Sets a SOCKS5 proxy port to use.
FetchSFTP¶
<bean id="fetchSFTP" class="org.archive.modules.fetcher.FetchSFTP">
<!-- <property name="username" value="anonymous" /> -->
<!-- <property name="password" value="password" /> -->
<!-- <property name="extractFromDirs" value="true" /> -->
<!-- <property name="extractParent" value="true" /> -->
<!-- <property name="digestContent" value="true" /> -->
<!-- <property name="digestAlgorithm" value="sha1" /> -->
<!-- <property name="maxLengthBytes" value="0" /> -->
<!-- <property name="maxFetchKBSec" value="0" /> -->
<!-- <property name="timeoutSeconds" value="" /> -->
<!-- <property name="soTimeoutMs" value="" /> -->
</bean>
- username
- (String) The username to send to SFTP servers. By convention, the default value of “anonymous” is used for publicly available SFTP sites.
- password
- (String) The password to send to SFTP servers. By convention, anonymous users send their email address in this field.
- extractFromDirs
- (boolean) Set to true to extract further URIs from SFTP directories. Default is true.
- extractParent
- (boolean) Set to true to extract the parent URI from all SFTP URIs. Default is true.
- digestContent
- (boolean) Whether or not to perform an on-the-fly digest hash of retrieved content-bodies.
- digestAlgorithm
- (String) Which algorithm (for example MD5 or SHA-1) to use to perform an on-the-fly digest hash of retrieved content-bodies.
- maxLengthBytes
- (long) Maximum length in bytes to fetch. Fetch is truncated at this length. A value of 0 means no limit.
- maxFetchKBSec
- (int) The maximum KB/sec to use when fetching data from a server. The default of 0 means no maximum.
- timeoutSeconds
- (int) If the fetch is not completed in this number of seconds, give up (and retry later).
- soTimeoutMs
- (int) If the socket is unresponsive for this number of milliseconds, give up. Set to zero for no timeout (Not. recommended. Could hang a thread on an unresponsive server). This timeout is used timing out socket opens and for timing out each socket read. Make sure this value is < #TIMEOUT_SECONDS for optimal configuration: ensures at least one retry read.
FetchWhois¶
WHOIS Fetcher (RFC 3912). If this fetcher is enabled, Heritrix will attempt
WHOIS lookups on the topmost assigned domain and the IP address of each URL. WHOIS URIs
There is no pre-existing, canonical specification for WHOIS URIs. What
follows is the the format that Heritrix uses, which we propose for general
use.
Syntax in ABNF as used in RFC 3986 Uniform Resource Identifier (URI):
Generic Syntax:
whoisurl = “whois:” [ “//” host [ “:” port ] “/” ] whoisquery
whoisquery is a url-encoded string. In ABNF,
To resolve a WHOIS URI which specifies host[:port], open a TCP connection to
the host at the specified port (default 43), send the query (whoisquery,
url-decoded) followed by CRLF, and read the response until the server closes
the connection. For more details see RFC 3912.
Resolution of a “serverless” WHOIS URI, which does not specify host[:port],
is implementation-dependent.
Serverless WHOIS URIs in Heritrix
For each non-WHOIS URI processed which has an authority, FetchWhois adds 1 or
2 serverless WHOIS URIs to the CrawlURI’s outlinks. These are
“whois:{ipAddress}” and, if the authority includes a hostname,
“whois:{topLevelDomain}”. See #addWhoisLinks(CrawlURI).
Heritrix resolves serverless WHOIS URIs by first querying an initial server,
then following referrals to other servers. In pseudocode: while last response refers to another server, i.e. matches regex #WHOIS_SERVER_REGEX
if we have a special query formatting rule for this whois server, apply it - see #specialQueryTemplates
resolve whois://referralServer/whoisquery
See #deferOrFinishGeneric(CrawlURI, String)whoisquery = 1*pchar
where pchar is defined in RFC 3986.
host and port also as defined in RFC 3986.
if query is an IPv4 address
resolve whois://#DEFAULT_IP_WHOIS_SERVER/whoisquery
else
let domainSuffix = part of query after the last ‘.’ (or the whole query if no ‘.’), url-encoded
resolve whois://#ULTRA_SUFFIX_WHOIS_SERVER/domainSuffix
<bean id="fetchWhois" class="org.archive.modules.fetcher.FetchWhois">
<!-- <property name="bdbModule" value="" /> -->
<!-- <property name="specialQueryTemplates" value="" /> -->
<!-- <property name="soTimeoutMs" value="" /> -->
<!-- <property name="serverCache" value="" /> -->
</bean>
- bdbModule
- (BdbModule)
- specialQueryTemplates
- (Map)
- soTimeoutMs
- (int) If the socket is unresponsive for this number of milliseconds, give up. Set to zero for no timeout (Not. recommended. Could hang a thread on an unresponsive server). This timeout is used timing out socket opens and for timing out each socket read. Make sure this value is < #TIMEOUT_SECONDS for optimal configuration: ensures at least one retry read.
- serverCache
- (ServerCache)
Link Extractors¶
ExtractorChrome (contrib)¶
Extracts links using a web browser via the Chrome Devtools Protocol.
To use, first define this as a top-level bean:
By default an instance of the browser will be run as a subprocess for the duration of the crawl. Alternatively set
<bean id=”extractorChrome” class=”org.archive.modules.extractor.ExtractorChrome”>
<!– <property name=”captureRequests” value=”true” /> –>
<!– <property name=”devtoolsUrl” value=”ws://127.0.0.1:1234/devtools/browser/2bc831e8-6c02-4c9b-affd-14c93b8579d7” /> –>
<!– <property name=”executable” value=”chromium-browser” /> –>
<!– <property name=”loadTimeoutSeconds” value=”30” /> –>
<!– <property name=”maxOpenWindows” value=”16” /> –>
<!– <property name=”windowWidth” value=”1366” /> –>
<!– <property name=”windowWidth” value=”768” /> –>
</bean>
Then add <ref bean=”extractorChrome”/>
to the fetch chain before extractorHTML
.
devtoolsUrl
to connect to an existing instance of the browser (run with
–headless –remote-debugging-port=1234
).
<bean id="extractorChrome" class="org.archive.modules.extractor.ExtractorChrome">
<!-- <property name="executable" value="null" /> -->
<!-- <property name="maxOpenWindows" value="16" /> -->
<!-- <property name="devtoolsUrl" value="null" /> -->
<!-- <property name="windowWidth" value="1366" /> -->
<!-- <property name="windowHeight" value="768" /> -->
<!-- <property name="loadTimeoutSeconds" value="30" /> -->
<!-- <property name="commandLineOptions" value="" /> -->
</bean>
- executable
- (String) The name or path to the browser executable. If null common locations will be searched. Not used if devtoolsUrl is set.
- maxOpenWindows
- (int) The maximum number of browser windows that are allowed to be opened simultaneously. Feel free to increase this if you have lots of RAM available.
- devtoolsUrl
- (String) URL of the devtools server to connect. If null a new browser process will be launched.
- windowWidth
- (int) Width of the browser window.
- windowHeight
- (int) Height of the browser window.
- loadTimeoutSeconds
- (int) Number of seconds to wait for the page to load.
- commandLineOptions
- (List) Extra command-line options passed to the browser process. Not used if devtoolsUrl is null.
ExtractorCSS¶
This extractor is parsing URIs from CSS type files.
The format of a CSS URL value is ‘url(’ followed by optional white space
followed by an optional single quote (’) or double quote (”) character
followed by the URL itself followed by an optional single quote (’) or
double quote (”) character followed by optional white space followed by ‘)’.
Parentheses, commas, white space characters, single quotes (’) and double
quotes (”) appearing in a URL must be escaped with a backslash:
‘\(’, ‘\)’, ‘\,’. Partial URLs are interpreted relative to the source of
the style sheet, not relative to the document.
Source: www.w3.org Note
<bean id="extractorCSS" class="org.archive.modules.extractor.ExtractorCSS">
</bean>
ExtractorDOC¶
This class allows the caller to extract href style links from word97-format word documents.
<bean id="extractorDOC" class="org.archive.modules.extractor.ExtractorDOC">
</bean>
ExtractorHTML¶
Basic link-extraction, from an HTML content-body,
using regular expressions. Note
<bean id="extractorHTML" class="org.archive.modules.extractor.ExtractorHTML">
<!-- <property name="maxElementLength" value="64" /> -->
<!-- <property name="maxAttributeNameLength" value="64" /> -->
<!-- <property name="maxAttributeValLength" value="2048" /> -->
<!-- <property name="treatFramesAsEmbedLinks" value="true" /> -->
<!-- <property name="ignoreFormActionUrls" value="false" /> -->
<!-- <property name="extractOnlyFormGets" value="true" /> -->
<!-- <property name="extractJavascript" value="true" /> -->
<!-- <property name="extractValueAttributes" value="true" /> -->
<!-- <property name="ignoreUnexpectedHtml" value="true" /> -->
<!-- <property name="metadata" value="" /> -->
<!-- <property name="extractorJS" value="" /> -->
</bean>
- maxElementLength
- (int)
- maxAttributeNameLength
- (int)
- maxAttributeValLength
- (int)
- treatFramesAsEmbedLinks
- (boolean) If true, FRAME/IFRAME SRC-links are treated as embedded resources (like IMG, ‘E’ hop-type), otherwise they are treated as navigational links. Default is true.
- ignoreFormActionUrls
- (boolean) If true, URIs appearing as the ACTION attribute in HTML FORMs are ignored. Default is false.
- extractOnlyFormGets
- (boolean) If true, only ACTION URIs with a METHOD of GET (explicit or implied) are extracted. Default is true.
- extractJavascript
- (boolean) If true, in-page Javascript is scanned for strings that appear likely to be URIs. This typically finds both valid and invalid URIs, and attempts to fetch the invalid URIs sometimes generates webmaster concerns over odd crawler behavior. Default is true.
- extractValueAttributes
- (boolean) If true, strings that look like URIs found in unusual places (such as form VALUE attributes) will be extracted. This typically finds both valid and invalid URIs, and attempts to fetch the invalid URIs sometimes generate webmaster concerns over odd crawler behavior. Default is true.
- ignoreUnexpectedHtml
- (boolean) If true, URIs which end in typical non-HTML extensions (such as .gif) will not be scanned as if it were HTML. Default is true.
- metadata
- (CrawlMetadata) CrawlMetadata provides the robots honoring policy to use when considering a robots META tag.
- extractorJS
- (ExtractorJS) Javascript extractor to use to process inline javascript. Autowired if available. If null, links will not be extracted from inline javascript.
AggressiveExtractorHTML¶
Extended version of ExtractorHTML with more aggressive javascript link extraction where javascript code is parsed first with general HTML tags regex, and than by javascript speculative link regex.
<bean id="aggressiveExtractorHTML" class="org.archive.modules.extractor.AggressiveExtractorHTML">
</bean>
JerichoExtractorHTML¶
Improved link-extraction from an HTML content-body using jericho-html parser. This extractor extends ExtractorHTML and mimics its workflow - but has some substantial differences when it comes to internal implementation. Instead of heavily relying upon java regular expressions it uses a real html parser library - namely Jericho HTML Parser (http://jerichohtml.sourceforge.net). Using this parser it can better handle broken html (i.e. missing quotes) and also offer improved extraction of HTML form URLs (not only extract the action of a form, but also its default values). Unfortunately this parser also has one major drawback - it has to read the whole document into memory for parsing, thus has an inherent OOME risk. This OOME risk can be reduced/eleminated by limiting the size of documents to be parsed (i.e. using NotExceedsDocumentLengthTresholdDecideRule). Also note that this extractor seems to have a lower overall memory consumption compared to ExtractorHTML. (still to be confirmed on a larger scale crawl)
<bean id="jerichoExtractorHTML" class="org.archive.modules.extractor.JerichoExtractorHTML">
</bean>
ExtractorHTMLForms¶
Extracts extra information about FORMs in HTML, loading this
into the CrawlURI (for potential later use by FormLoginProcessor)
and adding a small annotation to the crawl.log. Must come after ExtractorHTML, as it relies on information left
in the CrawlURI’s A_FORM_OFFSETS data key. By default (with ‘extractAllForms’ equal false), only
saves-to-CrawlURI and annotates forms that appear to be login
forms, by the test HTMLForm.seemsLoginForm(). Typical CXML configuration would be, first, as top-level named beans: Then, inside the fetch chain, after all other extractors: Note
{@code
<property name=”extractAllForms” value=”false” />
generally these are overlaid with sheets rather than set directly
<property name=”applicableSurtPrefix” value=”” />
<property name=”loginUsername” value=”” />
<property name=”loginPassword” value=”” />
}
{@code
…ALL USUAL PREPROCESSORS/FETCHERS/EXTRACTORS HERE, THEN…
}
<bean id="extractorHTMLForms" class="org.archive.modules.forms.ExtractorHTMLForms">
<!-- <property name="extractAllForms" value="false" /> -->
</bean>
- extractAllForms
- (boolean) If true, report all FORMs. If false, report only those that appear to be a login-enabling FORM. Default is false.
ExtractorHTTP¶
Extracts URIs from HTTP response headers.
<bean id="extractorHTTP" class="org.archive.modules.extractor.ExtractorHTTP">
<!-- <property name="inferRootPage" value="false" /> -->
</bean>
- inferRootPage
- (boolean) should all HTTP URIs be used to infer a link to the site’s root?
ExtractorImpliedURI¶
An extractor for finding ‘implied’ URIs inside other URIs. If the
‘trigger’ regex is matched, a new URI will be constructed from the
‘build’ replacement pattern. Unlike most other extractors, this works on URIs discovered by
previous extractors. Thus it should appear near the end of any
set of extractors. Initially, only finds absolute HTTP(S) URIs in query-string or its
parameters.
<bean id="extractorImpliedURI" class="org.archive.modules.extractor.ExtractorImpliedURI">
<!-- <property name="regex" value="" /> -->
<!-- <property name="format" value="" /> -->
<!-- <property name="removeTriggerUris" value="false" /> -->
</bean>
- regex
- (Pattern) Triggering regular expression. When a discovered URI matches this pattern, the ‘implied’ URI will be built. The capturing groups of this expression are available for the build replacement pattern.
- format
- (String) Replacement pattern to build ‘implied’ URI, using captured groups of trigger expression.
- removeTriggerUris
- (boolean) If true, all URIs that match trigger regular expression are removed from the list of extracted URIs. Default is false.
ExtractorJS¶
Processes Javascript files for strings that are likely to be
crawlable URIs. Note
<bean id="extractorJS" class="org.archive.modules.extractor.ExtractorJS">
</bean>
KnowledgableExtractorJS (contrib)¶
A subclass of ExtractorJS that has some customized behavior for specific kinds of web pages. As of April 2015, the one special behavior it has is for drupal generated pages. See https://webarchive.jira.com/browse/ARI-4190
<bean id="knowledgableExtractorJS" class="org.archive.modules.extractor.KnowledgableExtractorJS">
</bean>
ExtractorMultipleRegex¶
An extractor that uses regular expressions to find strings in the fetched
content of a URI, and constructs outlink URIs from those strings.
The crawl operator configures these parameters:
The URI is checked against
Then the extractor looks for matches for each of the
Outlinks are constructed using the URI-building
The template is evaluated as a Groovy Template, so further capabilities
beyond simple variable interpolation are available.uriRegex
: a regular expression to match against the url
contentRegexes
a map of named regular expressions { name =>
regex } to run against the content
template
: the template for constructing the outlinks
uriRegex
. The match is done using
Matcher#matches(), so the full URI string must match, not just a
substring. If it does match, then the matching groups are available to the
URI-building template as ${uriRegex[n]}
. If it does not match,
processing of the URI is finished and no outlinks are extracted.contentRegexes
in the fetched content. If any of the regular
expressions produce no matches, processing of the URI is finished and no
outlinks are extracted. If at least one match is found for each regular
expression, then an outlink is constructed, using the URI-building template,
for every combination of matches. The matching groups are available to the
template as ${name[n]}
.template
.
Variable interpolation using the familiar ${…} syntax is supported. The
template is evaluated for each combination of regular expression matches
found, and the matching groups are available to the template as
${regexName[n]}
. An example template might look like:
http://example.org/${uriRegex[1]}/foo?bar=${myContentRegex[0]}
.
<bean id="extractorMultipleRegex" class="org.archive.modules.extractor.ExtractorMultipleRegex">
<!-- <property name="uriRegex" value="" /> -->
<!-- <property name="contentRegexes" value="" /> -->
<!-- <property name="template" value="" /> -->
</bean>
- uriRegex
- (String) Regular expression against which to match the URI. If the URI matches,
then the matching groups are available to the URI-building template as
${uriRegex[n]}
. If it does not match, processing of this URI is finished and no outlinks are extracted. - contentRegexes
- (Map) A map of { name => regex }. The extractor looks for matches for each
regular expression in the content of the URI being processed. If any of
the regular expressions produce no matches, processing of the URI is
finished and no outlinks are extracted. If at least one match is found
for each regular expression, then an outlink is constructed for every
combination of matches. The matching groups are available to the
URI-building template as
${name[n]}
. - template
- (String) URI-building template. Provides variable interpolation using the familiar
${…} syntax. The template is evaluated for each combination of regular
expression matches found, and the matching groups are available to the
template as
${regexName[n]}
. An example template might look like:http://example.org/${uriRegex[1]}/foo?bar=${myContentRegex[0]}
.The template is evaluated as a Groovy Template, so further capabilities beyond simple variable interpolation are available.
ExtractorPDF¶
Allows the caller to process a CrawlURI representing a PDF for the purpose of extracting URIs
<bean id="extractorPDF" class="org.archive.modules.extractor.ExtractorPDF">
<!-- <property name="maxSizeToParse" value="" /> -->
</bean>
- maxSizeToParse
- (long) The maximum size of PDF files to consider. PDFs larger than this maximum will not be searched for links.
ExtractorPDFContent (contrib)¶
PDF Content Extractor. This will parse the text content of a PDF and apply a
regex to search for links within the body of the text. Requires itextpdf jar: http://repo1.maven.org/maven2/com/itextpdf/itextpdf/5.5.0/itextpdf-5.5.0.jar
<bean id="extractorPDFContent" class="org.archive.modules.extractor.ExtractorPDFContent">
<!-- <property name="maxSizeToParse" value="" /> -->
</bean>
- maxSizeToParse
- (long) The maximum size of PDF files to consider. PDFs larger than this maximum will not be searched for links.
ExtractorRobotsTxt¶
<bean id="extractorRobotsTxt" class="org.archive.modules.extractor.ExtractorRobotsTxt">
</bean>
ExtractorSitemap¶
<bean id="extractorSitemap" class="org.archive.modules.extractor.ExtractorSitemap">
<!-- <property name="urlPattern" value="null" /> -->
<!-- <property name="enableLenientExtraction" value="false" /> -->
</bean>
- urlPattern
- (String) If urlPattern is not null then any url marked as a sitemap and matching the pattern is assumed to be a sitemap. Otherwise the mime-type is checked (must be “text/xml” or “application/xml”) and the file is “sniffed” for the expected start of a sitemap file.
- enableLenientExtraction
- (boolean) If true, all urls in the sitemap file are extracted, regardless of whether or not they obey the scoping rules specified in the sitemap protocol (https://www.sitemaps.org/protocol.html).
ExtractorSWF¶
Extracts URIs from SWF (flash/shockwave) files. To test, here is a link to an swf that has links
embedded inside of it: http://www.hitspring.com/index.swf.
<bean id="extractorSWF" class="org.archive.modules.extractor.ExtractorSWF">
<!-- <property name="extractorJS" value="" /> -->
</bean>
- extractorJS
- (ExtractorJS) Javascript extractor to use to process inline javascript. Autowired if available. If null, links will not be extracted from inline javascript.
ExtractorUniversal¶
A last ditch extractor that will look at the raw byte code and try to extract
anything that looks like a link. If used, it should always be specified as the last link extractor in the
order file.
To accomplish this it will scan through the bytecode and try and build up
strings of consecutive bytes that all represent characters that are valid
in a URL (see #isURLableChar(int) for details).
Once it hits the end of such a string (i.e. finds a character that
should not be in a URL) it will try to determine if it has found a URL.
This is done be seeing if the string is an IP address prefixed with
http(s):// or contains a dot followed by a Top Level Domain and end of
string or a slash.
<bean id="extractorUniversal" class="org.archive.modules.extractor.ExtractorUniversal">
<!-- <property name="maxSizeToParse" value="" /> -->
</bean>
- maxSizeToParse
- (long) How deep to look into files for URI strings, in bytes.
ExtractorURI¶
An extractor for finding URIs inside other URIs. Unlike most other
extractors, this works on URIs discovered by previous extractors. Thus
it should appear near the end of any set of extractors. Initially, only finds absolute HTTP(S) URIs in query-string or its
parameters.
<bean id="extractorURI" class="org.archive.modules.extractor.ExtractorURI">
</bean>
ExtractorXML¶
A simple extractor which finds HTTP URIs inside XML/RSS files,
inside attribute values and simple elements (those with only
whitespace + HTTP URI + whitespace as contents). Note
<bean id="extractorXML" class="org.archive.modules.extractor.ExtractorXML">
</bean>
ExtractorYoutubeDL (contrib)¶
Extracts links to media by running youtube-dl in a subprocess. Runs only on
html.
Also implements WARCRecordBuilder to write youtube-dl json to the
warc.
To use Then add
Keeps a log of containing pages and media captured as a result of youtube-dl
extraction. The format of the log is as follows:
For containing pages, all of the
For media, the annotation field looks like ExtractorYoutubeDL
, add this top-level bean:
<bean id=”extractorYoutubeDL” class=”org.archive.modules.extractor.ExtractorYoutubeDL”/>
<ref bean=”extractorYoutubeDL”/>
to end of the
fetch chain, and to the end of the warc writer chain.[timestamp] [media-http-status] [media-length] [media-mimetype] [media-digest] [media-timestamp] [media-url] [annotation] [containing-page-digest] [containing-page-timestamp] [containing-page-url] [seed-url]
media-*
fields have the value
“-”
, and the annotation field looks like “youtube-dl:3”
,
meaning that ExtractorYoutubeDL extracted 3 media links from the page.“youtube-dl:1/3”
, meaning
this is the first of three media links extracted from the containing page.
The intention is to use this for playback. The rest of the fields included in
the log were also chosen to support creation of an index of media by
containing page, to be used for playback.
<bean id="extractorYoutubeDL" class="org.archive.modules.extractor.ExtractorYoutubeDL">
<!-- <property name="crawlerLoggerModule" value="" /> -->
</bean>
- crawlerLoggerModule
- (CrawlerLoggerModule)
ExtractorYoutubeFormatStream (contrib)¶
Youtube stream URI extractor. This will check the content of the youtube
watch page looking for the url_encoded_fmt_stream_map json value. The json
object is decoded and the stream URIs are constructed and queued.{@code
enable via sheet for http://(com,youtube,
38 MP4 3072p (Discontinued)
37 MP4 1080p (Discontinued)
22 MP4 720p
18 MP4 270p/360p
35 FLV 480p (Discontinued)
34 FLV 360p (Discontinued)
36 3GP 240p
5 FLV 240p
17 3GP 144p
}
<bean id="extractorYoutubeFormatStream" class="org.archive.modules.extractor.ExtractorYoutubeFormatStream">
<!-- <property name="extractLimit" value="1" /> -->
<!-- <property name="itagPriority" value="" /> -->
</bean>
- extractLimit
- (Integer) Maximum number of video urls to extract. A value of 0 means extract all discovered video urls. Default is 1.
- itagPriority
- (List) Itag priority list. Youtube itag parameter specifies the video and audio format and quality. The default is an empty list, which tells the extractor to extract up to extractLimit video urls. When the list is not empty, only video urls with itag values in the list are extracted.
ExtractorYoutubeChannelFormatStream (contrib)¶
<bean id="extractorYoutubeChannelFormatStream" class="org.archive.modules.extractor.ExtractorYoutubeChannelFormatStream">
</bean>
TrapSuppressExtractor¶
Pseudo-extractor that suppresses link-extraction of likely trap pages, by noticing when content’s digest is identical to that of its ‘via’.
<bean id="trapSuppressExtractor" class="org.archive.modules.extractor.TrapSuppressExtractor">
</bean>
Post-Processors¶
CandidatesProcessor¶
Processor which sends all candidate outlinks through the CandidateChain, scheduling those with non-negative status codes to the frontier. Also performs special handling for ‘discovered seeds’ – URIs, as with redirects from seeds, that may deserve special treatment to expand the scope.
<bean id="candidatesProcessor" class="org.archive.crawler.postprocessor.CandidatesProcessor">
<!-- <property name="candidateChain" value="" /> -->
<!-- <property name="frontier" value="" /> -->
<!-- <property name="loggerModule" value="" /> -->
<!-- <property name="seedsRedirectNewSeeds" value="true" /> -->
<!-- <property name="seedsRedirectNewSeedsAllowTLDs" value="true" /> -->
<!-- <property name="processErrorOutlinks" value="false" /> -->
<!-- <property name="seeds" value="" /> -->
<!-- <property name="sheetOverlaysManager" value="" /> -->
</bean>
- candidateChain
- (CandidateChain) Candidate chain
- frontier
- (Frontier) The frontier to use.
- loggerModule
- (CrawlerLoggerModule)
- seedsRedirectNewSeeds
- (boolean) If enabled, any URL found because a seed redirected to it (original seed returned 301 or 302), will also be treated as a seed, as long as the hop count is less than {@value #SEEDS_REDIRECT_NEW_SEEDS_MAX_HOPS}.
- seedsRedirectNewSeedsAllowTLDs
- (boolean) If enabled, any URL found because a seed redirected to it (original seed returned 301 or 302), will also be treated as a seed, as long as the hop count is less than {@value #SEEDS_REDIRECT_NEW_SEEDS_MAX_HOPS}.
- processErrorOutlinks
- (boolean) If true, outlinks from status codes <200 and >=400 will be sent through candidates processing. Default is false.
- seeds
- (SeedModule)
- sheetOverlaysManager
- (SheetOverlaysManager)
DispositionProcessor¶
A step, late in the processing of a CrawlURI, for marking-up the
CrawlURI with values to affect frontier disposition, and updating
information that may have been affected by the fetch. This includes
robots info and other stats. (Formerly called CrawlStateUpdater, when it did less.)
<bean id="dispositionProcessor" class="org.archive.crawler.postprocessor.DispositionProcessor">
<!-- <property name="serverCache" value="" /> -->
<!-- <property name="delayFactor" value="5.0f" /> -->
<!-- <property name="minDelayMs" value="3000" /> -->
<!-- <property name="respectCrawlDelayUpToSeconds" value="300" /> -->
<!-- <property name="maxDelayMs" value="30000" /> -->
<!-- <property name="maxPerHostBandwidthUsageKbSec" value="0" /> -->
<!-- <property name="forceRetire" value="false" /> -->
<!-- <property name="metadata" value="" /> -->
</bean>
- serverCache
- (ServerCache)
- delayFactor
- (float) How many multiples of last fetch elapsed time to wait before recontacting same server.
- minDelayMs
- (int) always wait this long after one completion before recontacting same server, regardless of multiple
- respectCrawlDelayUpToSeconds
- (int) Whether to respect a ‘Crawl-Delay’ (in seconds) given in a site’s robots.txt
- maxDelayMs
- (int) never wait more than this long, regardless of multiple
- maxPerHostBandwidthUsageKbSec
- (int) maximum per-host bandwidth usage
- forceRetire
- (boolean) Whether to set a CrawlURI’s force-retired directive, retiring its queue when it finishes. Mainly intended for URI-specific overlay settings; setting true globally will just retire all queues after they offer one URI, rapidly ending a crawl.
- metadata
- (CrawlMetadata) Auto-discovered module providing configured (or overridden) User-Agent value and RobotsHonoringPolicy
ReschedulingProcessor¶
The most simple forced-rescheduling step possible: use a local setting (perhaps overlaid to vary based on the URI) to set an exact future reschedule time, as a delay from now. Unless the reschedulDelaySeconds value is changed from its default, URIs are not rescheduled.
<bean id="reschedulingProcessor" class="org.archive.crawler.postprocessor.ReschedulingProcessor">
<!-- <property name="rescheduleDelaySeconds" value="1" /> -->
</bean>
- rescheduleDelaySeconds
- (long) amount of time to wait before forcing a URI to be rescheduled default of -1 means “don’t reschedule”
WARCWriterChainProcessor¶
WARC writer processor. The types of records that to be written can be
configured by including or excluding WARCRecordBuilder
implementations (see #setChain(List)). This is the default chain:
Replaces WARCWriterProcessor.
<property name=”chain”>
<list>
<bean class=”org.archive.modules.warc.DnsResponseRecordBuilder”/>
<bean class=”org.archive.modules.warc.HttpResponseRecordBuilder”/>
<bean class=”org.archive.modules.warc.WhoisResponseRecordBuilder”/>
<bean class=”org.archive.modules.warc.FtpControlConversationRecordBuilder”/>
<bean class=”org.archive.modules.warc.FtpResponseRecordBuilder”/>
<bean class=”org.archive.modules.warc.RevisitRecordBuilder”/>
<bean class=”org.archive.modules.warc.HttpRequestRecordBuilder”/>
<bean class=”org.archive.modules.warc.MetadataRecordBuilder”/>
</list>
</property>
<bean id="wARCWriterChainProcessor" class="org.archive.modules.writer.WARCWriterChainProcessor">
<!-- <property name="chain" value="" /> -->
</bean>
- chain
- (List)