Configuring Crawl Jobs

Basic Job Settings

Crawl settings are configured by editing the job’s primary configuration file, usually crawler-beans.cxml or crawler-beans.groovy.

Profiles

A profile is a non-launchable job template. To create one from the web interface, open an existing job, choose Copy Job, enter the new profile name, check as profile, and submit the form. Heritrix treats jobs whose primary configuration filename starts with profile- as profiles. They can be built for validation, but not launched directly.

Profiles appear in the create job profile selector by their job name. Choosing one creates a new launchable job by copying the profile and removing the profile- prefix from the primary configuration filename.

The built-in defaults can be overridden by creating a profile with the same name as the built-in profile.

Crawl Limits

In addition to limits imposed on the scope of the crawl it is possible to enforce arbitrary limits on the duration and extent of the crawl with the following settings:

maxBytesDownload: Stop the crawl after a fixed number of bytes have been downloaded. Zero means unlimited.
maxDocumentDownload: Stop the crawl after downloading a fixed number of documents. Zero means unlimited.
maxTimeSeconds: Stop the crawl after a certain number of seconds have elapsed. Zero means unlimited. For reference there are 3600 seconds in an hour and 86400 seconds in a day.

To set these values modify the CrawlLimitEnforcer bean.

XML

<bean id="crawlLimitEnforcer" class="org.archive.crawler.framework.CrawlLimitEnforcer">
  <property name="maxBytesDownload" value="100000000" />
  <property name="maxDocumentsDownload" value="100" />
  <property name="maxTimeSeconds" value="10000" />
</bean>

Groovy

crawlLimitEnforcer(CrawlLimitEnforcer) {
    maxBytesDownload = 100000000
    maxDocumentsDownload = 100
    maxTimeSeconds = 10000
}

Note

These are not hard limits. Once one of these limits is hit it will trigger a graceful termination of the crawl job. URIs already being crawled will be completed. As a result the set limit will be exceeded by some amount.

maxToeThreads

The maximum number of toe threads to run.

If running a domain crawl smaller than 100 hosts, a value approximately twice the number of hosts should be enough. Values larger then 150-200 are rarely worthwhile unless running on machines with exceptional resources.

XML

<bean id="crawlController" class="org.archive.crawler.framework.CrawlController">
  <property name="maxToeThreads" value="50" />
</bean>

Groovy

crawlController(CrawlController) {
    maxToeThreads = 50
}

metadata.operatorContactUrl

The URI of the crawl initiator. This setting gives the administrator of a crawled host a URI to refer to in case of problems.

XML

<bean id="simpleOverrides" class="org.springframework.beans.factory.config.PropertyOverrideConfigurer">
  <property name="properties">
    <value>
      metadata.operatorContactUrl=http://www.archive.org
      metadata.jobName=basic
      metadata.description=Basic crawl starting with useful defaults
    </value>
  </property>
</bean>

Groovy

simpleOverrides(PropertyOverrideConfigurer) {
    properties = '''
metadata.operatorContactUrl=http://www.archive.org
metadata.jobName=basic
metadata.description=Basic crawl starting with useful defaults
'''
}

Robots.txt Honoring Policy

The valid values of “robotsPolicyName” are:

obey: Obey robots.txt directives and nofollow robots meta tags
classic: Same as “obey”
robotsTxtOnly: Obey robots.txt directives but ignore robots meta tags
ignore: Ignore robots.txt directives and robots meta tags

XML

<bean id="metadata" class="org.archive.modules.CrawlMetadata" autowire="byName">
...
    <property name="robotsPolicyName" value="obey"/>
...
</bean>

Groovy

metadata(CrawlMetadata) { bean ->
    bean.autowire = 'byName'
    // ...
    robotsPolicyName = 'obey'
    // ...
}

Note

Heritrix supports RFC 9309 path wildcards (*, $) in robots.txt rules.

The only supported value for robots meta tags is “nofollow” which will cause the HTML extractor to stop processing and ignore all links (including embeds like images and stylesheets).

<meta name="robots" content="nofollow"/>

Obeying “rel=nofollow” on individual links is configured separately as obeyRelNoFollow on ExtractorHTML.

Crawl Scope

The crawl scope defines the set of possible URIs that can be captured by a crawl. These URIs are determined by DecideRules, which work in combination to limit or expand the set of crawled URIs. Each DecideRule, when presented with an object (most often a URI of some form) responds with one of three decisions:

ACCEPT: the object is ruled in
REJECT: the object is ruled out
PASS: the rule has no opinion; retain the previous decision

A URI under consideration begins with no assumed status. Each rule is applied in turn to the candidate URI. If the rule decides ACCEPT or REJECT, the URI’s status is set accordingly. After all rules have been applied, the URI is determined to be “in scope” if its status is ACCEPT. If its status is REJECT it is discarded.

We suggest starting with the rules in our recommended default configurations and performing small test crawls with those rules. Understand why certain URIs are ruled in or ruled out under those rules. Then make small individual changes to the scope to achieve non-default desired effects. Creating a new ruleset from scratch can be difficult and can easily result in crawls that can’t make the usual minimal progress that other parts of the crawler expect. Similarly, making many changes at once can obscure the importance of the interplay and ordering of the rules.

Decide Rules

DecideRule: AcceptDecideRule: This DecideRule accepts any URI.
DecideRule: ContentLengthDecideRule: This DecideRule accepts a URI if the content-length is less than the threshold. The default threshold is 2^63, meaning any document will be accepted.
DecideRule: PathologicalPathDecideRule: This DecideRule rejects any URI that contains an excessive number of identical, consecutive path-segments. For example, http://example.com/a/a/a/a/a/foo.html.
DecideRule: PredicatedDecideRule: This DecideRule applies a configured decision only if a test evaluates to true.
DecideRule: ExternalGeoLocationDecideRule: This DecideRule accepts a URI if it is located in a particular country.
DecideRule: FetchStatusDecideRule: This DecideRule applies the configured decision to any URI that has a fetch staus equal to the “target-status” setting.
DecideRule: HasViaDecideRule: This DecideRule applies the configured decision to any URI that has a “via.” A via is any URI that is a seed or some kind of mid-crawl addition.
DecideRule: HopCrossesAssignmentLevelDomainDecideRule: This DecideRule applies the configured decision to any URI that differs in the portion of its hostname/domain that is assigned/sold by registrars. The portion is referred to as the “assignment-level-domain” (ALD).
DecideRule: IdenticalDigestDecideRule: This DecideRule applies the configured decision to any URI whose prior-history content-digest matches the latest fetch.
DecideRule: MatchesListRegexDecideRule: This DecideRule applies the configured decision to any URI that matches the supplied regular expressions.
DecideRule: NotMatchesListRegexDecideRule: This DecideRule applies the configured decision to any URI that does not match the supplied regular expressions.
DecideRule: MatchesRegexDecideRule: This DecideRule applies the configured decision to any URI that matches the supplied regular expression.
DecideRule: ClassKeyMatchesRegexDecideRule: This DecideRule applies the configured decision to any URI class key that matches the supplied regular expression. A URI class key is a string that specifies the name of the Frontier queue into which a URI should be placed.
DecideRule: ContentTypeMatchesRegexDecideRule: This DecideRule applies the configured decision to any URI whose content-type is present and matches the supplied regular expression. The regular expression must match the full content-type sequence. Ex.: s/application/javascript;charset=UTF-8/^application\/javascript.*$/g; s/text/html/^.*\/html.*$/g
DecideRule: ContentTypeNotMatchesRegexDecideRule: This DecideRule applies the configured decision to any URI whose content-type does not match the supplied regular expression.
DecideRule: FetchStatusMatchesRegexDecideRule: This DecideRule applies the configured decision to any URI that has a fetch status that matches the supplied regular expression.
DecideRule: FetchStatusNotMatchesRegexDecideRule: This DecideRule applies the configured decision to any URI that has a fetch status that does not match the suppllied regular expression.
DecideRule: HopsPathMatchesRegexDecideRule: This DecideRule applies the configured decision to any URI whose “hops-path” matches the supplied regular expression. The hops-path is a string that consists of characters representing the path that was taken to access the URI. An example of a hops-path is “LLXE”.
DecideRule: MatchesFilePatternDecideRule: This DecideRule applies the configured decision to any URI whose suffix matches the supplied regular expression.
DecideRule: NotMatchesFilePatternDecideRule: This DecideRule applies the configured decision to any URI whose suffix does not match the supplied regular expression.
DecideRule: NotMatchesRegexDecideRule: This DecideRule applies the configured decision to any URI that does not match the supplied regular expression.
DecideRule: NotExceedsDocumentLengthThresholdDecideRule: This DecideRule applies the configured decision to any URI whose content-length does not exceed the configured threshold. The content-length comes from either the HTTP header or the actual downloaded content length of the URI. As of Heritrix 3.1, this rule has been renamed to ResourceNoLongerThanDecideRule.
DecideRule: ExceedsDocumentLengthThresholdDecideRule: This DecideRule applies the configured decision to any URI whose content length exceeds the configured threshold. The content-length comes from either the HTTP header or the actual downloaded content length of the URI. As of Heritrix 3.1, this rule has been renamed to ResourceLongerThanDecideRule.
DecideRule: SurtPrefixedDecideRule: This DecideRule applies the configured decision to any URI (expressed in SURT form) that begins with one of the prefixes in the configured set. This DecideRule returns true when the prefix of a given URI matches any of the listed SURTs. The list of SURTs may be configured in different ways: the surtsSourceFile parameter specifies a file to read the SURTs list from. If seedsAsSurtPrefixes parameter is set to true, SurtPrefixedDecideRule adds all seeds to the SURTs list. If alsoCheckVia property is set to true (default false), SurtPrefixedDecideRule will also consider Via URIs in the match. As of Heritrix 3.1, the “surtsSource” parameter may be any ReadSource, such as a ConfigFile or a ConfigString. This gives the SurtPrefixedDecideRule the flexibility of the TextSeedModule bean’s “textSource” property.
DecideRule: NotSurtPrefixedDecideRule: This DecideRule applies the configured decision to any URI (expressed in SURT form) that does not begin with one of the prefixes in the configured set.
DecideRule: OnDomainsDecideRule: This DecideRule applies the configured decision to any URI that is in one of the domains of the configured set.
DecideRule: NotOnDomainsDecideRule: This DecideRule applies the configured decision to any URI that is not in one of the domains of the configured set.
DecideRule: OnHostsDecideRule: This DecideRule applies the configured decision to any URI that is in one of the hosts of the configured set.
DecideRule: NotOnHostsDecideRule: This DecideRule applies the configured decision to any URI that is not in one of the hosts of the configured set.
DecideRule: ScopePlusOneDecideRule: This DecideRule applies the configured decision to any URI that is one level beyond the configured scope.
DecideRule: TooManyHopsDecideRule: This DecideRule rejects any URI whose total number of hops is over the configured threshold.
DecideRule: TooManyPathSegmentsDecideRule: This DecideRule rejects any URI whose total number of path-segments is over the configured threshold. A path-segment is a string in the URI separated by a “/” character, not including the first “//”.
DecideRule: TransclusionDecideRule: This DecideRule accepts any URI whose path-from-seed ends in at least one non-navlink hop. A navlink hop is represented by an “L”. Also, the number of non-navlink hops in the path-from-seed cannot exceed the configured value.
DecideRule: PrerequisiteAcceptDecideRule: This DecideRule accepts all “prerequisite” URIs. Prerequisite URIs are those whose hops-path has a “P” in the last position.
DecideRule: RejectDecideRule: This DecideRule rejects any URI.
DecideRule: ScriptedDecideRule: This DecideRule applies the configured decision to any URI that passes the rules test of a JSR-223 script. The script source must be a one-argument function called decisionFor.” The function returns the appropriate DecideResult. Variables available to the script include object (the object to be evaluated, such as a URI), “self” (the ScriptDecideRule instance), and context (the crawl’s ApplicationContext, from which all named crawl beans are reachable).
DecideRule: SeedAcceptDecideRule: This DecideRule accepts all “seed” URIs (those for which isSeed is true).

DecideRuleSequence Logging

Enable FINEST logging on the class org.archive.crawler.deciderules.DecideRuleSequence to watch each DecideRule’s evaluation of the processed URI. This can be done in the logging.properties file:

org.archive.modules.deciderules.DecideRuleSequence.level = FINEST

in conjunction with the -Dsysprop VM argument,

-Djava.util.logging.config.file=/path/to/heritrix3/dist/src/main/conf/logging.properties

Sheets (Site-specific Settings)

Sheets provide the ability to replace default settings on a per domain basis. Sheets are collections of overrides. They contain alternative values for object properties that should apply in certain contexts. The target is specified as an arbitrarily-long property-path, which is a string describing how to access the property starting from a beanName in a BeanFactory.

Sheets allow settings to be overlaid with new values that apply by top level domains (com, net, org, etc), by second-level domains (yahoo.com, archive.org, etc.), by subdomains (crawler.archive.org, tech.groups.yahoo.com, etc.) , and leading URI paths (directory.google.com/Top/Computers/, etc.). There is no limit for how long the domain/path prefix which specifies overlays can go; the SURT Prefix syntax is used.

Creating a new sheet involves configuring the crawler-beans.cxml file, which contains the Spring configuration of a job.

For example, if you have explicit permission to crawl certain domains without the usual polite rate-limiting, then a Sheet can be used to create a less polite crawling policy that is associated with a few such target domains. The configuration of such a Sheet for the domains example.com and example1.com are shown below. This example allows up to 5 parallel outstanding requests at a time (rather than the default 1), and eliminates any usual pauses between requests.

Warning

Unless a target site has given you explicit permission to crawl extra-aggressively, the typical Heritrix defaults, which limit the crawler to no more than one outstanding request at a time, with multiple-second waits between requests, and longer waits when the site is responding more slowly, are the safest course. Less-polite crawling can result in your crawler being blocked entirely by webmasters.

Finally, even with permission, be sure your crawler’s User-Agent string includes a link to valid crawl-operator contact information so you can be alerted to, and correct, any unintended side-effects.

XML

<bean id="sheetOverlaysManager" autowire="byType" class="org.archive.crawler.spring.SheetOverlaysManager">
</bean>

<bean class='org.archive.crawler.spring.SurtPrefixesSheetAssociation'>
  <property name='surtPrefixes'>
    <list>
      <value>http://(com,example,www,)/</value>
      <value>http://(com,example1,www,)/</value>
    </list>
  </property>
  <property name='targetSheetNames'>
    <list>
      <value>lessPolite</value>
    </list>
  </property>
</bean>

<bean id='lessPolite' class='org.archive.spring.Sheet'>
  <property name='map'>
    <map>
      <entry key='disposition.delayFactor' value='0.0'/>
      <entry key='disposition.maxDelayMs' value='0'/>
      <entry key='disposition.minDelayMs' value='0'/>
      <entry key='queueAssignmentPolicy.parallelQueues' value='5'/>
    </map>
  </property>
</bean>

Groovy

sheetOverlaysManager(SheetOverlaysManager) { bean ->
    bean.autowire = 'byType'
}

lessPoliteAssociation(SurtPrefixesSheetAssociation) {
    surtPrefixes = [
        'http://(com,example,www,)/',
        'http://(com,example1,www,)/',
    ]
    targetSheetNames = [
        'lessPolite',
    ]
}

lessPolite(Sheet) {
    map = [
        'disposition.delayFactor': '0.0',
        'disposition.maxDelayMs': '0',
        'disposition.minDelayMs': '0',
        'queueAssignmentPolicy.parallelQueues': '5',
    ]
}

Authentication and Cookies

Heritrix can crawl sites behind login by using HTTP authentication, submitting a form or by loading cookies from a file.

Credential Store

Credentials can be added so that Heritrix can gain access to areas of web sites requiring authentication. Credentials need to listed in a CredentialStore.

XML

<bean id="credentialStore" class="org.archive.modules.credential.CredentialStore">
  <property name="credentials">
    <map>
      <entry key="exampleHttpCredential" value-ref="exampleHttpCredential"/>
      <entry key="exampleFormCredential" value-ref="exampleFormCredential"/>
    </map>
  </property>
</bean>

Groovy

credentialStore(CredentialStore) {
    credentials = [
        exampleHttpCredential: ref('exampleHttpCredential'),
        exampleFormCredential: ref('exampleFormCredential'),
    ]
}

To enable text console logging of authentication interactions, set the FetchHTTP and PreconditionEnforcer log levels to fine in conf/logging.properties:

org.archive.crawler.fetcher.FetchHTTP.level = FINE
org.archive.crawler.prefetch.PreconditionEnforcer.level = FINE

HTTP Basic and Digest Authentication

In response to a 401 Unauthorized response code Heritrix will do a lookup of a key based on the domain and authentication realm in its CredentialStore. If a match is found, then the credential is loaded into the CrawlURI and the CrawlURI is marked for immediate retry.

When the CrawlURI is retried, the found credentials are added to the request. If the request succeeds with a 200 response code, the credentials are promoted to the CrawlServer and all subsequent requests made against the CrawlServer will preemptively volunteer the credential. If the credential fails with a 401 response code, the URI is no longer retried.

The configured domain should be of the form “hostname:port” unless the port is 80 in which case it must be omitted. For HTTPS URLs without an explicit port use port 443.

XML

<bean id="exampleHttpCredential" class="org.archive.modules.credential.HttpAuthenticationCredential">
  <property name="domain" value="www.example.org:443"/>
  <property name="realm" value="myrealm"/>
  <property name="login" value="user"/>
  <property name="password" value="secret"/>
</bean>

Groovy

exampleHttpCredential(HttpAuthenticationCredential) {
    domain = 'www.example.org:443'
    realm = 'myrealm'
    login = 'user'
    password = 'secret'
}

HTML Form Authentication

Heritrix can be configured to submit credentials to a HTML form using a GET or POST request.

XML

<bean id="exampleFormCredential" class="org.archive.modules.credential.HtmlFormCredential">
  <property name="domain" value="example.com"/>
  <property name="loginUri" value="http://example.com/login"/>
  <property name="formItems">
    <map>
      <entry key="login" value="user"/>
      <entry key="password" value="secret"/>
      <entry key="submit" value="submit"/>
    </map>
  </property>
</bean>

Groovy

exampleFormCredential(HtmlFormCredential) {
    domain = 'example.com'
    loginUri = 'http://example.com/login'
    formItems = [
        login: 'user',
        password: 'secret',
        submit: 'submit',
    ]
}

domain: The domain should be of the form “hostname:port” unless the port is 80 in which case it must be omitted. For HTTPS URLs without an explicit port use port 443.
login-uri: A relative or absolute URI to which the HTML Form submits. It is not necessarily the page that contains the HTML Form; rather it is the ACTION URI the to which the form submits.
form-items: Form-items are a listing of HTML Form key/value pairs. The submit button usually must be included in the form-items.

Note

There is currently no support for successfully submitting forms with dynamic fields whose required name or value changes for each visitor (such as CSRF tokens).

For a site with an HTML Form credential, a login is performed against all listed HTML Form credential login-uris after the DNS and robots.txt preconditions are fulfilled. The crawler will only view sites that have HTML Form credentials from a logged-in perspective. There is no current way for a single Heritrix job to crawl a site in an unauthenticated state and then re-crawl the site in an authenticated state. (You would have to do this in two separately-configured job launches.)

The form login is only run once. Heritrix continues crawling regardless of whether the login succeeds. There is no way of telling Heritrix to retry authentication if the first attempt is not successful. Neither is there a means for the crawler to report success or failed authentications. The crawl operator should examine the logs to determine whether authentication succeeded.

Loading Cookies

Heritrix can be configured to load a set of cookies from a file. This can be used for example to crawl a website behind a login form by manually logging in through the browser and then copying the session cookie.

To enable loading of cookies set the cookiesLoadFile property of the BdbCookieStore bean to a ConfigFile:

XML

<bean id="cookieStore" class="org.archive.modules.fetcher.BdbCookieStore">
  <property name="cookiesLoadFile">
     <bean class="org.archive.spring.ConfigFile">
       <property name="path" value="cookies.txt" />
     </bean>
  </property>
</bean>

Groovy

cookieStore(BdbCookieStore) {
    cookiesLoadFile = new ConfigFile(path: 'cookies.txt')
}

The cookies.txt should be in the 7-field tab-separated Netscape cookie file format. An entry might look like:

www.example.org FALSE / FALSE 1311699995 sessionid 60ddb868550a

Cookie file tab-separated fields
1	DOMAIN	The domain that created and has access to the cookie.
2	FLAG	A TRUE or FALSE value indicating if subdomains within the given domain can access the cookie.
3	PATH	The path within the domain that the cookie is valid for.
4	SECURE	A TRUE or FALSE value indicating if the cookie should be sent over HTTPS only.
5	EXPIRATION	Expiration time in seconds since 1970-01-01T00:00:00Z, or -1 for no expiration
6	NAME	The name of the cookie.
7	VALUE	The value of the cookie.

Modifying a Running Job

While changing a job’s XML configuration normally requires relaunching it, some settings can be modified while the crawl is running. This is done through the Browse Beans or the Scripting Console link on the job page. The Bean Browser allows you to edit runtime properties of beans. You can also use the scripting console to programmatically edit a running job.

If changing a non-atomic value, it is a good practice to pause the crawl prior to making the change, as some modifications to composite configuration entities may not occur in a thread-safe manner. An example of a non-atomic change is adding a new Sheet.

Browse Beans

The WUI provides a way to view and edit the Spring beans that make up a crawl configuration. It is important to note that changing the crawl configuration using the Bean Browser will not update the crawler-beans.cxml file. Thus, changing settings with the Bean Browser is not permanent. The Bean Browser should only by used to change the settings of a running crawl. To access the Bean Browser click on the Browse Beans link from the jobs page. The hierarchy of Spring beans will be displayed.

https://raw.githubusercontent.com/wiki/internetarchive/heritrix3/attachments/5735725/5865655.png

You can drill down on individual beans by clicking on them. The example below shows the display after clicking on the seeds bean.

https://raw.githubusercontent.com/wiki/internetarchive/heritrix3/attachments/5735725/5865656.png

Scripting Console

[This section to be written. For now see the Heritrix3 Useful Scripts wiki page.]

Configuring HTTP Proxies

There are two options to specify a proxy for crawling.

The command line options --proxy-host and --proxy-port can be used to define a proxy for all jobs. If only the --proxy-host option is given, a default value of 8000 is used for the proxy port. These proxy settings are also used when connecting to a “DNS-over-HTTP” server (see the section on DNS-over-HTTP below).

Alternatively one can define a per-job proxy via a the httpProxyHost and httpProxyPort properties of the fetchHttp bean. These settings, if both defined, will overwrite the global options. These setting also allow for a user and password in the httpProxyUser and httpProxyPassword properties, which the global options do not support, due to incompatibilities of the different supported Java versions.

Also the optional “SOCKS5” proxy documented in the next section is used on a per-job basis; there are currently no global options to define it.

Configuring SOCKS5 Proxy

An optional configuration value to route Heritrix crawler traffic through a SOCKS5 proxy. This will override any set HTTP proxy configuration. It is facilitated by extending the org.archive.modules.fetcher.FetchHTTP bean with socksProxyHost and socksProxyPort values, as in the example below:

XML

<bean class="org.archive.modules.fetcher.FetchHTTP" id="fetchHttp">
    <!--  ... -->
    <property name="socksProxyHost" value="127.0.0.1"/>
    <property name="socksProxyPort" value="24000"/>
</bean>

Groovy

fetchHttp(FetchHTTP) {
    // ...
    socksProxyHost = '127.0.0.1'
    socksProxyPort = 24000
}

Configuring DNS over HTTP (DoH)

If the local DNS on the server running Heritrix is not able to resolve the DNS names of the crawled sites, e.g. because the server is sitting behind an enterprise firewall and can only resolve names of the local network, then using DNS-over-HTTP (DoH) might be an alternative to fetch DNS information.

To activate this, one needs to set the dnsOverHttpServer setting of the fetchDns bean to the URL of an DoH server. If one has configured a global proxy via the --proxy-host and --proxy-port command line options, these proxy settings will be used to contact the DoH server as well. However due to limitation of the library in use, username and password information for the proxy are not supported. Also any per-job defined proxy settings in the fetchHttp bean are not used when contacting the DoH server.

As the implementation relies on the corresponding client in the “dnsjava” library, which is currently labeled as experimental, this option comes with some limitations:

If you use Java 11 then due to a well known bug it will not close connections to the DoH server unless Heritrix shuts down. As the DoH server might not accept new connections after some limits while these connections are still open, it is not recommended to use this feature when running Heritrix with Java 11.
For other Java versions, the connection to the DoH server will be closed when the garbage collector runs. Depending on the garbage collector used this will cause a delay of anything between a few seconds and several minutes before closing the connection. Also note that if the garbage collector is explicitely triggered via the Heritrix UI one needs to add the -XX:-DisableExplicitGC option in the JAVA_OPTS for Java versions 13 and up as otherwise this action has no effect.

Without making a recommendation the following DoH servers have been tested with the DoH feature:

However servers implementing the official RFC 8484 specification unfortunately do not work with the current implementation. This includes e.g. the following server:

https://dns.digitale-gesellschaft.ch/dns-query

This limitation might be overcome by a newer version of the “dnsjava” library.

Configuring Crawl Jobs

Basic Job Settings

Profiles

Crawl Limits

maxToeThreads

metadata.operatorContactUrl

Robots.txt Honoring Policy

Crawl Scope

Decide Rules

DecideRuleSequence Logging

Frontier

Politeness

Retry Policy

Bandwidth Limits

Extractor Parameters

Sheets (Site-specific Settings)

Authentication and Cookies

Credential Store

HTTP Basic and Digest Authentication

HTML Form Authentication

Loading Cookies

Other Protocols

FTP

HTTP/2

HTTP/3

SFTP

WHOIS

Modifying a Running Job

Browse Beans

Scripting Console

Configuring HTTP Proxies

Configuring SOCKS5 Proxy

Configuring DNS over HTTP (DoH)