Configuring Crawl Jobs
======================
Basic Job Settings
------------------
Crawl settings are configured by editing the job's primary configuration file, usually ``crawler-beans.cxml`` or
``crawler-beans.groovy``.
Profiles
~~~~~~~~
A profile is a non-launchable job template. To create one from the web interface, open an existing job, choose
``Copy Job``, enter the new profile name, check ``as profile``, and submit the form. Heritrix treats jobs
whose primary configuration filename starts with ``profile-`` as profiles. They can be built for validation, but not
launched directly.
Profiles appear in the create job profile selector by their job name. Choosing one creates a new launchable job
by copying the profile and removing the ``profile-`` prefix from the primary configuration filename.
The built-in defaults can be overridden by creating a profile with the same name as the built-in profile.
Crawl Limits
~~~~~~~~~~~~
In addition to limits imposed on the scope of the crawl it is possible to enforce arbitrary limits on the duration
and extent of the crawl with the following settings:
maxBytesDownload
Stop the crawl after a fixed number of bytes have been downloaded. Zero means unlimited.
maxDocumentDownload
Stop the crawl after downloading a fixed number of documents. Zero means unlimited.
maxTimeSeconds
Stop the crawl after a certain number of seconds have elapsed. Zero means unlimited. For reference there are 3600
seconds in an hour and 86400 seconds in a day.
To set these values modify the CrawlLimitEnforcer bean.
.. tab:: XML
.. code-block:: xml
.. tab:: Groovy
.. code-block:: groovy
crawlLimitEnforcer(CrawlLimitEnforcer) {
maxBytesDownload = 100000000
maxDocumentsDownload = 100
maxTimeSeconds = 10000
}
.. note::
These are not hard limits. Once one of these limits is hit it will trigger a graceful termination of the crawl job.
URIs already being crawled will be completed. As a result the set limit will be exceeded by some amount.
maxToeThreads
~~~~~~~~~~~~~
The maximum number of toe threads to run.
If running a domain crawl smaller than 100 hosts, a value approximately twice the number of hosts should be enough.
Values larger then 150-200 are rarely worthwhile unless running on machines with exceptional resources.
.. tab:: XML
.. code-block:: xml
.. tab:: Groovy
.. code-block:: groovy
crawlController(CrawlController) {
maxToeThreads = 50
}
metadata.operatorContactUrl
~~~~~~~~~~~~~~~~~~~~~~~~~~~
The URI of the crawl initiator. This setting gives the administrator of a crawled host a URI to refer to in case of
problems.
.. tab:: XML
.. code-block:: xml
metadata.operatorContactUrl=http://www.archive.org
metadata.jobName=basic
metadata.description=Basic crawl starting with useful defaults
.. tab:: Groovy
.. code-block:: groovy
simpleOverrides(PropertyOverrideConfigurer) {
properties = '''
metadata.operatorContactUrl=http://www.archive.org
metadata.jobName=basic
metadata.description=Basic crawl starting with useful defaults
'''
}
Robots.txt Honoring Policy
~~~~~~~~~~~~~~~~~~~~~~~~~~
The valid values of "robotsPolicyName" are:
obey
Obey robots.txt directives and nofollow robots meta tags
classic
Same as "obey"
robotsTxtOnly
Obey robots.txt directives but ignore robots meta tags
ignore
Ignore robots.txt directives and robots meta tags
.. tab:: XML
.. code-block:: xml
...
...
.. tab:: Groovy
.. code-block:: groovy
metadata(CrawlMetadata) { bean ->
bean.autowire = 'byName'
// ...
robotsPolicyName = 'obey'
// ...
}
.. note::
Heritrix supports RFC 9309 path wildcards (*, $) in robots.txt rules.
The only supported value for robots meta tags is "nofollow" which will cause the HTML extractor to stop processing
and ignore all links (including embeds like images and stylesheets).
.. code-block:: html
Obeying "rel=nofollow" on individual links is configured separately as ``obeyRelNoFollow`` on ``ExtractorHTML``.
Crawl Scope
-----------
The crawl scope defines the set of possible URIs that can be captured by a crawl. These URIs are determined by
DecideRules, which work in combination to limit or expand the set of crawled URIs. Each DecideRule, when presented
with an object (most often a URI of some form) responds with one of three decisions:
ACCEPT
the object is ruled in
REJECT
the object is ruled out
PASS
the rule has no opinion; retain the previous decision
A URI under consideration begins with no assumed status. Each rule is applied in turn to the candidate URI. If the
rule decides ACCEPT or REJECT, the URI's status is set accordingly. After all rules have been applied, the URI is
determined to be "in scope" if its status is ACCEPT. If its status is REJECT it is discarded.
We suggest starting with the rules in our recommended default configurations and performing small test crawls with
those rules. Understand why certain URIs are ruled in or ruled out under those rules. Then make small individual
changes to the scope to achieve non-default desired effects. Creating a new ruleset from scratch can be difficult and
can easily result in crawls that can't make the usual minimal progress that other parts of the crawler expect.
Similarly, making many changes at once can obscure the importance of the interplay and ordering of the rules.
Decide Rules
~~~~~~~~~~~~
:deciderule:`AcceptDecideRule`
This DecideRule accepts any URI.
:deciderule:`ContentLengthDecideRule`
This DecideRule accepts a URI if the content-length is less than the threshold. The default threshold is 2^63,
meaning any document will be accepted.
:deciderule:`PathologicalPathDecideRule`
This DecideRule rejects any URI that contains an excessive number of identical, consecutive path-segments. For
example, ``http://example.com/a/a/a/a/a/foo.html``.
:deciderule:`PredicatedDecideRule`
This DecideRule applies a configured decision only if a test evaluates to true.
:deciderule:`ExternalGeoLocationDecideRule`
This DecideRule accepts a URI if it is located in a particular country.
:deciderule:`FetchStatusDecideRule`
This DecideRule applies the configured decision to any URI that has a fetch staus equal to the "target-status" setting.
:deciderule:`HasViaDecideRule`
This DecideRule applies the configured decision to any URI that has a "via." A via is any URI that is a seed or some kind of mid-crawl addition.
:deciderule:`HopCrossesAssignmentLevelDomainDecideRule`
This DecideRule applies the configured decision to any URI that differs in the portion of its hostname/domain that is assigned/sold by registrars. The portion is referred to as the "assignment-level-domain" (ALD).
:deciderule:`IdenticalDigestDecideRule`
This DecideRule applies the configured decision to any URI whose prior-history content-digest matches the latest fetch.
:deciderule:`MatchesListRegexDecideRule`
This DecideRule applies the configured decision to any URI that matches the supplied regular expressions.
:deciderule:`NotMatchesListRegexDecideRule`
This DecideRule applies the configured decision to any URI that does not match the supplied regular expressions.
:deciderule:`MatchesRegexDecideRule`
This DecideRule applies the configured decision to any URI that matches the supplied regular expression.
:deciderule:`ClassKeyMatchesRegexDecideRule`
This DecideRule applies the configured decision to any URI class key that matches the supplied regular expression. A URI class key is a string that specifies the name of the Frontier queue into which a URI should be placed.
:deciderule:`ContentTypeMatchesRegexDecideRule`
This DecideRule applies the configured decision to any URI whose content-type is present and matches the supplied regular expression. The regular expression must match the full content-type sequence. Ex.: ``s/application/javascript;charset=UTF-8/^application\/javascript.*$/g``; ``s/text/html/^.*\/html.*$/g``
:deciderule:`ContentTypeNotMatchesRegexDecideRule`
This DecideRule applies the configured decision to any URI whose content-type does not match the supplied regular expression.
:deciderule:`FetchStatusMatchesRegexDecideRule`
This DecideRule applies the configured decision to any URI that has a fetch status that matches the supplied regular expression.
:deciderule:`FetchStatusNotMatchesRegexDecideRule`
This DecideRule applies the configured decision to any URI that has a fetch status that does not match the suppllied regular expression.
:deciderule:`HopsPathMatchesRegexDecideRule`
This DecideRule applies the configured decision to any URI whose "hops-path" matches the supplied regular expression. The hops-path is a string that consists of characters representing the path that was taken to access the URI. An example of a hops-path is "LLXE".
:deciderule:`MatchesFilePatternDecideRule`
This DecideRule applies the configured decision to any URI whose suffix matches the supplied regular expression.
:deciderule:`NotMatchesFilePatternDecideRule`
This DecideRule applies the configured decision to any URI whose suffix does not match the supplied regular expression.
:deciderule:`NotMatchesRegexDecideRule`
This DecideRule applies the configured decision to any URI that does not match the supplied regular expression.
:deciderule:`NotExceedsDocumentLengthThresholdDecideRule`
This DecideRule applies the configured decision to any URI whose content-length does not exceed the configured threshold. The content-length comes from either the HTTP header or the actual downloaded content length of the URI. As of Heritrix 3.1, this rule has been renamed to ResourceNoLongerThanDecideRule.
:deciderule:`ExceedsDocumentLengthThresholdDecideRule`
This DecideRule applies the configured decision to any URI whose content length exceeds the configured threshold. The content-length comes from either the HTTP header or the actual downloaded content length of the URI. As of Heritrix 3.1, this rule has been renamed to ResourceLongerThanDecideRule.
:deciderule:`SurtPrefixedDecideRule`
This DecideRule applies the configured decision to any URI (expressed in SURT form) that begins with one of the
prefixes in the configured set. This DecideRule returns true when the prefix of a given URI matches any of the
listed SURTs. The list of SURTs may be configured in different ways: the surtsSourceFile parameter specifies a file
to read the SURTs list from. If seedsAsSurtPrefixes parameter is set to true, SurtPrefixedDecideRule adds all seeds
to the SURTs list. If alsoCheckVia property is set to true (default false), SurtPrefixedDecideRule will also
consider Via URIs in the match.
As of Heritrix 3.1, the "surtsSource" parameter may be any ReadSource, such as a ConfigFile or a ConfigString.
This gives the SurtPrefixedDecideRule the flexibility of the TextSeedModule bean's "textSource" property.
:deciderule:`NotSurtPrefixedDecideRule`
This DecideRule applies the configured decision to any URI (expressed in SURT form) that does not begin with one of the prefixes in the configured set.
:deciderule:`OnDomainsDecideRule`
This DecideRule applies the configured decision to any URI that is in one of the domains of the configured set.
:deciderule:`NotOnDomainsDecideRule`
This DecideRule applies the configured decision to any URI that is not in one of the domains of the configured set.
:deciderule:`OnHostsDecideRule`
This DecideRule applies the configured decision to any URI that is in one of the hosts of the configured set.
:deciderule:`NotOnHostsDecideRule`
This DecideRule applies the configured decision to any URI that is not in one of the hosts of the configured set.
:deciderule:`ScopePlusOneDecideRule`
This DecideRule applies the configured decision to any URI that is one level beyond the configured scope.
:deciderule:`TooManyHopsDecideRule`
This DecideRule rejects any URI whose total number of hops is over the configured threshold.
:deciderule:`TooManyPathSegmentsDecideRule`
This DecideRule rejects any URI whose total number of path-segments is over the configured threshold. A
path-segment is a string in the URI separated by a "/" character, not including the first "//".
:deciderule:`TransclusionDecideRule`
This DecideRule accepts any URI whose path-from-seed ends in at least one non-navlink hop. A navlink hop is
represented by an "L". Also, the number of non-navlink hops in the path-from-seed cannot exceed the configured
value.
:deciderule:`PrerequisiteAcceptDecideRule`
This DecideRule accepts all "prerequisite" URIs. Prerequisite URIs are those whose hops-path has a "P" in the
last position.
:deciderule:`RejectDecideRule`
This DecideRule rejects any URI.
:deciderule:`ScriptedDecideRule`
This DecideRule applies the configured decision to any URI that passes the rules test of a JSR-223 script. The
script source must be a one-argument function called decisionFor." The function returns the appropriate
DecideResult. Variables available to the script include object (the object to be evaluated, such as a URI),
"self" (the ScriptDecideRule instance), and context (the crawl's ApplicationContext, from which all named crawl
beans are reachable).
:deciderule:`SeedAcceptDecideRule`
This DecideRule accepts all "seed" URIs (those for which isSeed is true).
DecideRuleSequence Logging
~~~~~~~~~~~~~~~~~~~~~~~~~~
Enable ``FINEST`` logging on the class ``org.archive.crawler.deciderules.DecideRuleSequence`` to watch each
DecideRule's evaluation of the processed URI. This can be done in the ``logging.properties`` file:
.. code-block:: bash
org.archive.modules.deciderules.DecideRuleSequence.level = FINEST
in conjunction with the ``-Dsysprop`` VM argument,
.. code-block::
-Djava.util.logging.config.file=/path/to/heritrix3/dist/src/main/conf/logging.properties
Frontier
--------
Politeness
~~~~~~~~~~
A combination of several settings control the politeness of the Frontier. It is important to note that at any given
time only one URI from any given host is processed. The following politeness rules impose additional wait time
between the end of processing one URI and the start of the next one.
delayFactor
This setting imposes a delay between the fetching of URIs from the same host. The delay is a multiple of the
amount of time it took to fetch the last URI downloaded from the host. For example, if it took 800 milliseconds to
fetch the last URI from a host and the ``delayFactor`` is 5 (a very high value), then the Frontier will wait 4000
milliseconds (4 seconds) before allowing another URI from that host to be processed.
maxDelayMs
This setting imposes a maximum upper limit on the wait time created by the ``delayFactor``. If set to 1000
milliseconds, then the maximum delay between URI fetches from the same host will never exceed this value.
minDelayMs
This setting imposes a minimum limit on politeness. It takes precedence over the value calculated by the
``delayFactor``. For example, the value of ``minDelayMs`` can be set to 100 milliseconds. If the ``delayFactor`` only
generates a 20 millisecond wait, the value of ``minDelayMs`` will override it and the URI fetch will be delayed for
100 milliseconds.
.. tab:: XML
.. code-block:: xml
.. tab:: Groovy
.. code-block:: groovy
disposition(DispositionProcessor) {
delayFactor = 5.0
maxDelayMs = 30000
minDelayMs = 3000
}
Retry Policy
~~~~~~~~~~~~
The Frontier can be used to limit the number of fetch retries for a URI. Heritrix will retry fetching a URI because
the initial fetch error may be a transitory condition.
maxRetries
This setting limits the number of fetch retries attempted on a URI due to transient errors.
retryDelaySeconds
This setting determines how long the wait period is between retries.
.. tab:: XML
.. code-block:: xml
.. tab:: Groovy
.. code-block:: groovy
frontier(BdbFrontier) {
retryDelaySeconds = 900
maxRetries = 30
}
Bandwidth Limits
~~~~~~~~~~~~~~~~
The Frontier allows the user to limit bandwidth usage. This is done by holding back URIs when bandwidth usage has
exceeded certain limits. Because bandwidth usage limitations are calculated over a period of time, there can still be
spikes in usage that greatly exceed the limits.
maxPerHostBandwidthUsageKbSec
This setting limits the maximum bandwidth to use for any host. This setting limits the load placed by Heritrix on the
host. It is therefore a politeness setting.
.. tab:: XML
.. code-block:: xml
.. tab:: Groovy
.. code-block:: groovy
disposition(DispositionProcessor) {
maxPerHostBandwidthUsageKbSec = 500
}
Extractor Parameters
~~~~~~~~~~~~~~~~~~~~
The Frontier's behavior with regard to link extraction can be controlled by the following parameters.
extract404s
This setting allows the operator to avoid extracting links from 404 (Not Found) pages. The default is true, which
maintains the pre-3.1 behavior of extracting links from 404 pages.
.. tab:: XML
.. code-block:: xml
.. tab:: Groovy
.. code-block:: groovy
frontier(BdbFrontier) {
extract404s = true
}
extractIndependently
This setting encourages extractor processors to always perform their best-effort extraction, even if a previous
extractor has marked a URI as already-handled. Set the value to true for best-effort extraction. The default is
false, which maintains the pre-3.1 behavior.
.. tab:: XML
.. code-block:: xml
.. tab:: Groovy
.. code-block:: groovy
frontier(BdbFrontier) {
extractIndependently = false
}
Sheets (Site-specific Settings)
-------------------------------
Sheets provide the ability to replace default settings on a per domain basis. Sheets are collections of overrides.
They contain alternative values for object properties that should apply in certain contexts. The target is specified
as an arbitrarily-long property-path, which is a string describing how to access the property starting from a
beanName in a BeanFactory.
Sheets allow settings to be overlaid with new values that apply by top level domains (com, net, org, etc), by
second-level domains (yahoo.com, archive.org, etc.), by subdomains (crawler.archive.org, tech.groups.yahoo.com, etc.)
, and leading URI paths (directory.google.com/Top/Computers/, etc.). There is no limit for how long the domain/path
prefix which specifies overlays can go; the `SURT Prefix `_
syntax is used.
Creating a new sheet involves configuring the ``crawler-beans.cxml`` file, which contains the Spring configuration of
a job.
For example, if you have explicit permission to crawl certain domains without the usual polite rate-limiting, then a
Sheet can be used to create a less polite crawling policy that is associated with a few such target domains. The
configuration of such a Sheet for the domains example.com and example1.com are shown below. This example allows up to
5 parallel outstanding requests at a time (rather than the default 1), and eliminates any usual pauses between
requests.
.. warning::
Unless a target site has given you explicit permission to crawl extra-aggressively, the typical Heritrix defaults,
which limit the crawler to no more than one outstanding request at a time, with multiple-second waits between
requests, and longer waits when the site is responding more slowly, are the safest course. Less-polite crawling
can result in your crawler being blocked entirely by webmasters.
Finally, even with permission, be sure your crawler's User-Agent string includes a link to valid crawl-operator
contact information so you can be alerted to, and correct, any unintended side-effects.
.. tab:: XML
.. code-block:: xml
http://(com,example,www,)/
http://(com,example1,www,)/
lessPolite
.. tab:: Groovy
.. code-block:: groovy
sheetOverlaysManager(SheetOverlaysManager) { bean ->
bean.autowire = 'byType'
}
lessPoliteAssociation(SurtPrefixesSheetAssociation) {
surtPrefixes = [
'http://(com,example,www,)/',
'http://(com,example1,www,)/',
]
targetSheetNames = [
'lessPolite',
]
}
lessPolite(Sheet) {
map = [
'disposition.delayFactor': '0.0',
'disposition.maxDelayMs': '0',
'disposition.minDelayMs': '0',
'queueAssignmentPolicy.parallelQueues': '5',
]
}
Authentication and Cookies
--------------------------
Heritrix can crawl sites behind login by using HTTP authentication, submitting a form or by loading cookies from a file.
Credential Store
~~~~~~~~~~~~~~~~
Credentials can be added so that Heritrix can gain access to areas of web sites requiring authentication. Credentials
need to listed in a CredentialStore.
.. tab:: XML
.. code-block:: xml
.. tab:: Groovy
.. code-block:: groovy
credentialStore(CredentialStore) {
credentials = [
exampleHttpCredential: ref('exampleHttpCredential'),
exampleFormCredential: ref('exampleFormCredential'),
]
}
To enable text console logging of authentication interactions, set the FetchHTTP and PreconditionEnforcer log levels to
fine in ``conf/logging.properties``:
.. code-block::
org.archive.crawler.fetcher.FetchHTTP.level = FINE
org.archive.crawler.prefetch.PreconditionEnforcer.level = FINE
HTTP Basic and Digest Authentication
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In response to a 401 Unauthorized response code Heritrix will do a lookup of a key based on the domain and
authentication realm in its CredentialStore. If a match is found, then the credential is loaded into the CrawlURI and
the CrawlURI is marked for immediate retry.
When the CrawlURI is retried, the found credentials are added to the request. If the request succeeds with a 200
response code, the credentials are promoted to the CrawlServer and all subsequent requests made against the CrawlServer
will preemptively volunteer the credential. If the credential fails with a 401 response code, the URI is no longer
retried.
The configured domain should be of the form "hostname:port" unless the port is 80 in which case it must be omitted. For
HTTPS URLs without an explicit port use port 443.
.. tab:: XML
.. code-block:: xml
.. tab:: Groovy
.. code-block:: groovy
exampleHttpCredential(HttpAuthenticationCredential) {
domain = 'www.example.org:443'
realm = 'myrealm'
login = 'user'
password = 'secret'
}
HTML Form Authentication
~~~~~~~~~~~~~~~~~~~~~~~~
Heritrix can be configured to submit credentials to a HTML form using a GET or POST request.
.. tab:: XML
.. code-block:: xml
.. tab:: Groovy
.. code-block:: groovy
exampleFormCredential(HtmlFormCredential) {
domain = 'example.com'
loginUri = 'http://example.com/login'
formItems = [
login: 'user',
password: 'secret',
submit: 'submit',
]
}
domain
The domain should be of the form "hostname:port" unless the port is 80 in which case it must be omitted. For
HTTPS URLs without an explicit port use port 443.
login-uri
A relative or absolute URI to which the HTML Form submits. It is not necessarily the page that contains the HTML
Form; rather it is the ACTION URI the to which the form submits.
form-items
Form-items are a listing of HTML Form key/value pairs. The submit button usually must be included in the form-items.
.. note::
There is currently no support for successfully submitting forms with dynamic fields whose required name or value
changes for each visitor (such as CSRF tokens).
For a site with an HTML Form credential, a login is performed against all listed HTML Form credential login-uris
after the DNS and robots.txt preconditions are fulfilled. The crawler will only view sites that have HTML Form
credentials from a logged-in perspective. There is no current way for a single Heritrix job to crawl a site in an
unauthenticated state and then re-crawl the site in an authenticated state. (You would have to do this in two
separately-configured job launches.)
The form login is only run once. Heritrix continues crawling regardless of whether the login succeeds. There is no
way of telling Heritrix to retry authentication if the first attempt is not successful. Neither is there a means for
the crawler to report success or failed authentications. The crawl operator should examine the logs to determine
whether authentication succeeded.
Loading Cookies
~~~~~~~~~~~~~~~
Heritrix can be configured to load a set of cookies from a file. This can be used for example to crawl a website behind
a login form by manually logging in through the browser and then copying the session cookie.
To enable loading of cookies set the cookiesLoadFile property of the BdbCookieStore bean to a ConfigFile:
.. tab:: XML
.. code-block:: xml
.. tab:: Groovy
.. code-block:: groovy
cookieStore(BdbCookieStore) {
cookiesLoadFile = new ConfigFile(path: 'cookies.txt')
}
The cookies.txt should be in the 7-field tab-separated Netscape cookie file format. An entry might look like::
www.example.org FALSE / FALSE 1311699995 sessionid 60ddb868550a
.. list-table:: Cookie file tab-separated fields
* - 1
- DOMAIN
- The domain that created and has access to the cookie.
* - 2
- FLAG
- A TRUE or FALSE value indicating if subdomains within the given domain can access the cookie.
* - 3
- PATH
- The path within the domain that the cookie is valid for.
* - 4
- SECURE
- A TRUE or FALSE value indicating if the cookie should be sent over HTTPS only.
* - 5
- EXPIRATION
- Expiration time in seconds since 1970-01-01T00:00:00Z, or -1 for no expiration
* - 6
- NAME
- The name of the cookie.
* - 7
- VALUE
- The value of the cookie.
Other Protocols
---------------
In addition to HTTP/1.0 Heritrix can be configured to fetch resources using several other internet protocols.
FTP
~~~
Heritrix supports crawling `FTP `_ sites. Seeds should be added
in the following format: ```ftp://sftp.example.org/directory``.
The FetchFTP bean needs to be defined:
.. bean-example:: org.archive.modules.fetcher.FetchFTP
and added to the FetchChain:
.. tab:: XML
.. code-block:: xml
...
...
.. tab:: Groovy
.. code-block:: groovy
fetchProcessors(FetchChain) {
processors = [
// ...
ref('fetchFTP'),
// ...
]
}
HTTP/2
~~~~~~
To use HTTP/2 the ``FetchHTTP`` bean should replaced with ``FetchHTTP2``:
.. bean-example:: org.archive.modules.fetcher.FetchHTTP2
``FetchHTTP2`` will use HTTP/1.1 for non-https URLs and for servers that do not support HTTP/2. Requests that used HTTP/2
will be annotated with ``h2`` in the crawl log and ``WARC-Protocol`` header.
Note that ``FetchHTTP2`` currently only supports a limited subset of the ``FetchHTTP`` options.
.. note::
The WARC standard (as of version 1.1) does not specify how to record HTTP/2 or 3 messages.
FetchHTTP2 does not record the original on-the-wire HTTP messages but instead a simplified HTTP/1.1
representation without transfer encoding.
If you want to stay within the bounds of the base WARC standard without extensions, or want to ensure the exact
bytes of the HTTP network message are recorded, you may prefer to use FetchHTTP.
HTTP/3
~~~~~~
HTTP/3 support is experimental and is not enabled by default. First replace ``FetchHTTP`` with ``FetchHTTP2`` as described
above and then enable the ``useHTTP3`` property:
.. tab:: XML
.. code-block:: xml
.. tab:: Groovy
.. code-block:: groovy
fetchHttp2(FetchHTTP2) {
useHTTP3 = true
}
An appropriate version of the jetty-quiche-native jar also needs to be placed in Heritrix's ``lib`` directory. To find out
which version you need, build a job with ``useHTTP3`` enabled and a warning will be logged to the job log with a download
link.
HTTP/3 requests will only be sent after a server first responds with an ``Alt-Svc`` header. Requests that used HTTP/3 will
be annotated with ``h3`` in the crawl log and ``WARC-Protocol`` header.
SFTP
~~~~
An optional fetcher for `SFTP `_ is provided. Seeds should
be added in the following format:``sftp://sftp.example.org/directory``.
The FetchSFTP bean needs to be defined:
.. bean-example:: org.archive.modules.fetcher.FetchSFTP
and added to the FetchChain:
.. tab:: XML
.. code-block:: xml
...
...
.. tab:: Groovy
.. code-block:: groovy
fetchProcessors(FetchChain) {
processors = [
// ...
ref('fetchSFTP'),
// ...
]
}
WHOIS
~~~~~
An optional fetcher for domain `WHOIS `_ data is provided. A small set of
well-established WHOIS servers are preconfigured. The fetcher uses an ad-hoc/intuitive interpretation of a 'whois:'
scheme URI.
Define the fetchWhois bean:
.. tab:: XML
.. code-block:: xml
.. tab:: Groovy
.. code-block:: groovy
fetchWhois(FetchWhois) {
specialQueryTemplates = [
'whois.verisign-grs.com': 'domain %s',
'whois.arin.net': 'z + %s',
'whois.denic.de': '-T dn %s',
]
}
and add it to the FetchChain:
.. tab:: XML
.. code-block:: xml
...
...
.. tab:: Groovy
.. code-block:: groovy
fetchProcessors(FetchChain) {
processors = [
// ...
ref('fetchWhois'),
// ...
]
}
To configure a whois seed, enter the seed in the following format: ``whois://hostname/path``. For example,
``whois://archive.org``. The whois fetcher will attempt to resolve each host that the crawl encounters using the
topmost assigned domain and the ip address of the url crawled. So if you crawl ``http://www.archive.org/details/texts``,
the whois fetcher will attempt to resolve ``whois:archive.org`` and ``whois:207.241.224.2``.
At this time, whois functionality is experimental. The fetchWhois bean is commented out in the default profile.
Modifying a Running Job
-----------------------
While changing a job's XML configuration normally requires relaunching it, some settings can be modified while the crawl
is running. This is done through the `Browse Beans`_ or the `Scripting Console`_ link on the job page. The Bean Browser
allows you to edit runtime properties of beans. You can also use the scripting console to programmatically edit
a running job.
If changing a non-atomic value, it is a good practice to pause the crawl prior to making the change, as some
modifications to composite configuration entities may not occur in a thread-safe manner. An example of a non-atomic
change is adding a new Sheet.
Browse Beans
~~~~~~~~~~~~
The WUI provides a way to view and edit the Spring beans that make up a crawl configuration. It is important to note
that changing the crawl configuration using the Bean Browser will not update the ``crawler-beans.cxml`` file. Thus,
changing settings with the Bean Browser is not permanent. The Bean Browser should only by used to change the settings
of a running crawl. To access the Bean Browser click on the Browse Beans link from the jobs page. The hierarchy of
Spring beans will be displayed.
.. image:: https://raw.githubusercontent.com/wiki/internetarchive/heritrix3/attachments/5735725/5865655.png
You can drill down on individual beans by clicking on them. The example below shows the display after clicking on the
seeds bean.
.. image:: https://raw.githubusercontent.com/wiki/internetarchive/heritrix3/attachments/5735725/5865656.png
Scripting Console
~~~~~~~~~~~~~~~~~
[This section to be written. For now see the
`Heritrix3 Useful Scripts `_ wiki page.]
Configuring HTTP Proxies
~~~~~~~~~~~~~~~~~~~~~~~~
There are two options to specify a proxy for crawling.
The command line options ``--proxy-host`` and ``--proxy-port`` can be used to define a proxy for all jobs.
If only the ``--proxy-host`` option is given, a default value of 8000 is used for the proxy port.
These proxy settings are also used when connecting to a "DNS-over-HTTP" server
(see the `section on DNS-over-HTTP <#configuring-dns-over-http-doh>`_ below).
Alternatively one can define a per-job proxy via a the ``httpProxyHost`` and ``httpProxyPort`` properties of the
``fetchHttp`` bean. These settings, if both defined, will overwrite the global options. These setting also allow for
a user and password in the ``httpProxyUser`` and ``httpProxyPassword`` properties, which the global options do not
support, due to incompatibilities of the different supported Java versions.
Also the optional "SOCKS5" proxy documented in the next section is used on a per-job basis; there are currently no
global options to define it.
Configuring SOCKS5 Proxy
~~~~~~~~~~~~~~~~~~~~~~~~
An optional configuration value to route Heritrix crawler traffic through a SOCKS5 proxy. This will override any set
HTTP proxy configuration. It is facilitated by extending the `org.archive.modules.fetcher.FetchHTTP` bean with
`socksProxyHost` and `socksProxyPort` values, as in the example below:
.. tab:: XML
.. code-block:: xml
.. tab:: Groovy
.. code-block:: groovy
fetchHttp(FetchHTTP) {
// ...
socksProxyHost = '127.0.0.1'
socksProxyPort = 24000
}
Configuring DNS over HTTP (DoH)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If the local DNS on the server running Heritrix is not able to resolve the DNS names of the crawled sites, e.g. because
the server is sitting behind an enterprise firewall and can only resolve names of the local network, then using
DNS-over-HTTP (DoH) might be an alternative to fetch DNS information.
To activate this, one needs to set the ``dnsOverHttpServer`` setting of the ``fetchDns`` bean to the URL of an DoH server.
If one has configured a global proxy via the ``--proxy-host`` and ``--proxy-port`` command line options,
these proxy settings will be used to contact the DoH server as well. However due to limitation of the library in use,
username and password information for the proxy are not supported. Also any per-job defined proxy settings in the
``fetchHttp`` bean are not used when contacting the DoH server.
As the implementation relies on the corresponding client in the "dnsjava" library, which is currently labeled as
experimental, this option comes with some limitations:
* If you use Java 11 then due to a `well known bug `_ it will not
close connections to the DoH server unless Heritrix shuts down.
As the DoH server might not accept new connections after some limits while these connections are still open, it is
not recommended to use this feature when running Heritrix with Java 11.
* For other Java versions, the connection to the DoH server will be closed when the garbage collector runs.
Depending on the garbage collector used this will cause a delay of anything between a few seconds and several
minutes before closing the connection. Also note that if the garbage collector is explicitely triggered via the
Heritrix UI one needs to add the ``-XX:-DisableExplicitGC`` option in the ``JAVA_OPTS`` for Java versions 13 and up
as otherwise this action has no effect.
Without making a recommendation the following DoH servers have been tested with the DoH feature:
* https://dns.google/dns-query
* https://cloudflare-dns.com/dns-query
However servers implementing the official `RFC 8484 `_ specification
unfortunately do not work with the current implementation. This includes e.g. the following server:
* https://dns.digitale-gesellschaft.ch/dns-query
This limitation might be overcome by a newer version of the "dnsjava" library.