Operating Heritrix
Running Heritrix
To launch Heritrix with the Web UI enabled, enter the following command. The username and password for the Web UI are set to “admin” and “admin”, respectively.
$HERITRIX_HOME/bin/heritrix -a admin:admin
By default, the Web UI listening address is only bound to the ‘localhost’ address. Therefore, the Web UI can only be accessed on the same machine from which it was launched. The ‘-b’ option may be used to listen on different/additional addresses. See Security Considerations before changing this default.
Command-line Options
- -a, --web-admin ARG
(Required) Sets the username and password required to access the Web UI.
The argument may be a
USERNAME:PASSWORD
such asadmin:admin
. If the argument is a string beginning with “@”, the rest of the string is interpreted as a local file name containing the operator login and password.- -b, --web-bind-hosts HOST
Specifies a comma-separated list of hostnames/IP-addresses to bind to the Web UI. You may use ‘/’ as a shorthand for ‘all addresses’. Default:
localhost/127.0.0.1
- -c,–checkpoint ARG
Recovers from the given checkpoint. May only be used with the –run-job option. The special value ‘latest’ will recover the last checkpoint or if none exist will launch a new crawl.
- -j, --job-dirs PATH
Sets the directory Heritrix stores jobs in. Default:
$HERITRIX_HOME/jobs
- -l, --logging-properties PATH
Reads logging configuration from a file. Default:
$HERITRIX_HOME/conf/logging.properties
- -p, --web-port PORT
Sets the port the Web UI will listen on. Default:
8443
- -r, --run-job JOBNAME
Runs the given Job when Heritrix starts. Heritrix will exit when the job finishes.
- -s, --ssl-params ARG
Specifies a keystore path, keystore password, and key password for HTTPS use. Separate the values with commas and do not include whitespace. By default Heritrix will generate a self-signed certificate the first time it is run.
Environment Variables
The Heritrix launch script ./bin/heritrix
obeys the following environment variables:
- FOREGROUND
Set to any value – e.g. ‘true’ – if you want to run heritrix in the foreground.
- JAVA_HOME
Directory where Java is installed.
- JAVA_OPTS
Additional options to pass to the JVM. For example specify
-Xmx1024M
to allocate 1GB of memory to Heritrix.- HERITRIX_HOME
Directory where Heritrix is installed.
- HERITRIX_OUT
Path messages will be logged to when running in background mode. Default:
$HERITRIX_HOME/heritrix_out.log
Running Heritrix under Docker
Heritrix can also be run under Docker. The Web UI is enabled by
default and exposed via port 8443. The following command creates
and runs a detached container with username and password for the
Web UI set to “admin” and “admin”, respectively. It also maps the
local jobs
directory into the container. Please ensure that
mounted directories already exist or have the correct permissions!
mkdir jobs
docker run --init --rm -d -p 8443:8443 -e "USERNAME=admin" -e "PASSWORD=admin" -v $(pwd)/jobs:/opt/heritrix/jobs iipc/heritrix
See Security Considerations about securely running Heritrix.
Configurations
To allow Heritrix to be run within a container, the environment variables
(FOREGROUND
, JAVA_HOME
, HERITRIX_HOME
) are already set and
should not be changed. The Heritrix command-line options can’t be
accessed directly and are only exposed via environment variables:
- USERNAME, PASSWORD
(Required) The Web UI username and password. Will be forwarded to
-a, --web-admin ARG
.- CREDSFILE
If either
USERNAME
orPASSWORD
are not set or empty,CREDSFILE
can alternatively be used to supply the Web UI credentials. It should be a path within the Heritrix container which can be used to bind-mount local files or docker secrets. See “@” description for-a, --web-admin ARG
.- JOBNAME
This forwards the jobname to the
-r, --run-job JOBNAME
command-line option, to run a single job and then quit. Note that your container should not have a restart policy set to automatically restart on exit.
Security Considerations
Heritrix is a large and active network application that presents security implications, both on the local machine, where it runs, and remotely, on machines it contacts.
Understanding the Risks
It is important to recognize that the Web UI allows remote control of the crawler in ways that could potentially disrupt a crawl, change the crawler’s behavior, read or write locally-accessible files, and perform or trigger other actions in the Java VM or local machine by the execution of arbitrary operator-supplied scripts.
Unauthorized access to the Web UI could end or corrupt a crawl. It could also change the crawler’s behavior to be a nuisance to other network hosts. Files accessible to the crawler process could potentially be deleted, corrupted, or replaced, which could cause extensive problems on the crawling machine.
Another potential risk is that worst-case or maliciously-crafted content, in conjunction with crawler issues, could disrupt the crawl or other files and operations on the local system. For example, in the past, without malicious intent, some rich-media content has caused runaway memory use in third-party libraries used by the crawler. This resulted in memory-exhaustion that stopped and corrupted the crawl in progress. Similarly, atypical input patterns have caused runaway CPU use by crawler link-extraction regular expressions, causing severely slow crawls. Crawl operators should monitor their crawls closely and use the project discussion list and issue database to stay current on crawler issues.
Network Access Control
Launched without any specified bind-address (‘-b’ flag), the crawler’s Web UI only binds to the localhost/loopback address (127.0.0.1), and therefore is only network-accessible from the same machine on which it was launched.
If practical, this default setting should be maintained. A technique such as SSH tunneling could be used by authorized users of the crawling machine to enable Web access from their local machine to the crawling machine.For example, consider Heritrix running on a machine ‘crawler.example.com’, with its Web UI only listening/bound on its localhost address. Assuming a user named ‘crawloperator’ has SSH access to ‘crawler.example.com’, she can issue the following SSH command from her local machine:
ssh -L localhost:9999:localhost:8443 crawloperator@crawler.example.com -N
This tells SSH to open a tunnel which forwards conections to “localhost:9999” (on the local machine) to the remote machines’ own idea of “localhost:8443”. As a result, the crawler’s Web UI will be available via “https://localhost:9999/” for as long as the tunnel exists (until the ssh command is killed or connection otherwise broken). No one else on the network may directly connect to port 8443 on ‘crawler.example.com’ (since it is only listening on the local loopback address), and no one elsewhere on the net may directly connect to the operator’s port 9999 (since it also is only listening on the local loopback address).
If you need Heritrix’s listening port bound to a public address, the ‘-b’ command-line flag may be used. This flag takes, as an argument, the hostname/address to use. The ‘/’ character can be used to indicate all addresses.
If you use this option, you should take special care to choose an even more unique/unguessable/brute-force-search-resistant set of login credentials. You may still want to consider using other network/firewall policies to block access from unauthorized origins.
Login Authentication Access Control
The administrative login and password only offer rudimentary protection against unauthorized access. For best security, you should be sure to:
Use a strong, unique username and password combination to secure the Web UI. Heritrix uses HTTPS to encrypt communication between the client and the Web UI. Keep in mind that setting the username and password on the command-line may result in their values being visible to other users of the crawling machine – for example, via the output of a tool like ‘ps’ that shows the command-lines used to launch processes. Additionally, note that these values are echoed in plain text in the
heritrix_out.log
for operator reference. As of Heritrix 3.1, the administrative username and password are no longer echoed toheritrix_out.log
. Also, if the parameter supplied to the -a command line option is a string beginning with “@”, the rest of the string is interpreted as a local file name containing the operator login and password. Thus, the credentials are not visible to other machines that use the process listing (ps) command.Launch the Heritrix-hosting Java VM with a user-account that has the minimum privileges necessary for operating the crawler. This will limit the damage in the event that the Web UI is accessed maliciously.
Log Files
Each crawl job has its own set of log files found in the logs
subdirectory of a job launch directory.
Logging can be configured by modifying the logging.properties
file
that is located under the $HERITRIX_HOME/conf
directory.
alerts.log
This log contains alerts that indicate problems with a crawl.
crawl.log
Each URI that Heritrix attempts to fetch will cause a log line to be
written to the crawl.log
file. Below is a two line extract from the
log.
2011-06-23T17:12:08.802Z 200 1299 http://content-5.powells.com/robots.txt LREP http://content-5.powells.com/cgi-bin/imageDB.cgi?isbn=9780385518635 text/plain #014 20110623171208574+225 sha1:YIUOKDGOLGI5JYHDTXRFFQ5FF4N2EJRV - -
2011-06-23T17:12:09.591Z 200 15829 http://www.identitytheory.com/etexts/poetics.html L http://www.identitytheory.com/ text/html #025 20110623171208546+922 sha1:7AJUMSDTOMT4FN7MBFGGNJU3Z56MLCMW - -
- Field 1. Timestamp
The timestamp in ISO8601 format, to millisecond resolution. The time is the instant of logging.
- Field 2. Fetch Status Code
Usually this is the HTTP response code but it can also be a negative number if URI processing was unexpectedly terminated.
- Field 3. Document Size
The size of the downloaded document in bytes. For HTTP, this is the size of content only. The size excludes the HTTP response headers. For DNS, the size field is the total size for the DNS response.
- Field 4. Downloaded URI
The URI of the document downloaded.
- Field 5. Discovery Path
The breadcrumb codes (discovery path) showing the trail of downloads that lead to the downloaded URI. The length of the discovery path is limited to the last 50 hop-types. For example, a 62-hop path might appear as “12+LLRLLLRELLLLRLLLRELLLLRLLLRELLLLRLLLRELLLLRLLLRELE”.
The breadcrumb codes are as follows.
R
Redirect
E
Embed
X
Speculative embed (aggressive/Javascript link extraction)
L
Link
P
Prerequisite (as for DNS or robots.txt before another URI)
- Field 6. Referrer
The URI that immediately preceded the downloaded URI. This is the referrer. Both the discovery path and the referrer will be empty for seed URIs.
- Field 7. Mime Type
The downloaded document mime type.
- Field 8. Worker Thread ID
The id of the worker thread that downloaded the document.
- Field 9. Fetch Timestamp
The timestamp in RFC2550/ARC condensed digits-only format indicating when the network fetch was started. If appropriate the millisecond duration of the fetch is appended to the timestamp with a “;” character as separator.
- Field 10. SHA1 Digest
The SHA1 digest of the content only (headers are not digested).
- Field 11. Source Tag
The source tag inherited by the URI, if source tagging is enabled.
- Field 12. Annotations
If an annotation has been set, it will be displayed. Possible annotations include: the number of times the URI was tried, the literal “lenTrunc”; if the download was truncanted due to exceeding configured size limits, the literal “timeTrunc”; if the download was truncated due to exceeding configured time limits or “midFetchTrunc”; if a midfetch filter determined the download should be truncated.
- Field 13. WARC Filename
The name of the WARC/ARC file to which the crawled content is written. This value will only be written if thelogExtraInfo property of the loggerModule bean is set to true. This logged information will be written in JSON format.
progress-statistics.log
This log is written by the StatisticsTracker bean. At configurable intervals, a log line detailing the progress of the crawl is written to this file.
- Field 1. timestamp
Timestamp in ISO8601 format indicating when the log line was written.
- Field 2. discovered
Number of URIs discovered to date.
- Field 3. queued
Number of URIs currently queued.
- Field 3. downloaded
Number of URIs downloaded to date.
- Field 4. doc/s(avg)
Number of document downloaded per second since the last snapshot. The value in parenthesis is measured since the crawl began.
- Field 5. KB/s(avg)
Amount in kilobytes downloaded per second since the last snapshot. The value in parenthesis is measured since the crawl began.
- Field 6. dl-failures
Number of URIs that Heritrix has failed to download.
- Field 7. busy-thread
Number of toe threads busy processing a URI.
- Field 8. mem-use-KB
Amount of memory in use by the Java Virtual Machine.
- Field 9. heap-size-KB
The current heap size of the Java Virtual Machine.
- Field 10. congestion
The congestion ratio is a rough estimate of how much initial capacity, as a multiple of current capacity, would be necessary to crawl the current workload at the maximum rate available given politeness settings. This value is calculated by comparing the number of internal queues that are progressing against those that are waiting for a thread to become available.
- Field 11. max-depth
The size of the Frontier queue with the largest number of queued URIs.
- Field 12. avg-depth
The average size of all the Frontier queues.
runtime-errors.log
This log captures unexpected exceptions and errors that occur during the crawl. Some may be due to hardware limitations (out of memory, although that error may occur without being written to this log), but most are probably due to software bugs, either in Heritrix’s core but more likely in one of its pluggable classes.
uri-errors.log
This log stores errors that resulted from attempted URI fetches. Usually the cause is non-existent URIs. This log is usually only of interest to advanced users trying to explain unexpected crawl behavior.
Reports
Reports are found in the “reports” directory, which exists under the directory of a specific job launch.
Crawl Summary (crawl-report.txt)
This file contains useful metrics about completed jobs. The report is created by the StatisticsTracker bean. This file is written at the end of the crawl.
Below is sample output from this file:
Crawl Name: basic
Crawl Status: Finished
Duration Time: 1h33m38s651ms
Total Seeds Crawled: 1
Total Seeds not Crawled: 0
Total Hosts Crawled: 1
Total URIs Processed: 1337
URIs Crawled successfully: 1337
URIs Failed to Crawl: 0
URIs Disregarded: 0
Processed docs/sec: 0.24
Bandwidth in Kbytes/sec: 4
Total Raw Data Size in Bytes: 23865329 (23 MB)
Novel Bytes: 23877375 (23 MB)
- Crawl Name
The user-defined name of the crawl.
- Crawl Status
The status of the crawl, such as “Aborted” or “Finished.”
- Duration Time
The duration of the crawl to the nearest millisecond.
- Total Seeds Crawled
The number of seeds that were successfully crawled.
- Total Seeds Not Crawled
The number of seeds that were not successfully crawled.
- Total Hosts Crawled
The number of hosts that were crawled.
- Total URIs Processed
The number of URIs that were processed.
- URIs Crawled Successfully
The number of URIs that were crawled successfully.
- URIs Failed to Crawl
The number of URIs that could not be crawled.
- URIs Disregarded
The number of URIs that were not selected for crawling.
- Processed docs/sec
The average number of documents processed per second.
- Bandwidth in Kbytes/sec
The average number of kilobytes processed per second.
- Total Raw Data Size in Bytes
The total amount of data crawled.
- Novel Bytes
New bytes since last crawl.
Seeds (seeds-report.txt)
This file contains the crawling status of each seed.
This file is created by the StatisticsTracker bean and is written at the end of the crawl.
Below is sample output from this report:
[code] [status] [seed] [redirect]
200 CRAWLED http://www.smokebox.net
- code
Status code for the seed URI
- status
Human readable description of whether the seed was crawled. For example, “CRAWLED.”
- seed
The seed URI.
- redirect
The URI to which the seed redirected.
Hosts (hosts-report.txt)
This file contains an overview of the hosts that were crawled. It also displays the number of documents crawled and the bytes downloaded per host.
This file is created by the StatisticsTracker bean and is written at the end of the crawl.
Below is sample output from this file:
1337 23877316 www.smokebox.net 0 0
1 59 dns: 0 0
0 0 dns: 0 0
- #urls
The number of URIs crawled for the host.
- #bytes
The number of bytes crawled for the host.
- host
The hostname.
- #robots
The number of URIs, for this host, excluded because of
robots.txt
restrictions. This number does not include linked URIs from the specifically excluded URIs.- #remaining
The number of URIs, for this host, that have not been crawled yet, but are in the queue.
- #novel-urls
The number of new URIs crawled for this host since the last crawl.
- #novel-bytes
The amount of new bytes crawled for this host since the last crawl.
- #dup-by-hash-urls
The number of URIs, for this host, that had the same hash code and are essentially duplicates.
- #dup-by-hash-bytes
The number of bytes of content, for this host, having the same hashcode.
- #not-modified-urls
The number of URIs, for this host, that returned a 304 status code.
- #not-modified-bytes
The amount of of bytes of content, for this host, whose URIs returned a 304 status code.
Mimetypes (mimetype-report.txt)
This file contains a report displaying the number of documents downloaded per mime type. Also, the amount of data downloaded per mime type is displayed.
This file is created by the StatisticsTracker bean and is written at the end of the crawl.
Below is sample output from this report:
624 13248443 image/jpeg
450 8385573 text/html
261 2160104 image/gif
1 74708 application/x-javascript
1 59 text/dns
1 8488 text/plain
- #urls
The number of URIs crawled for a given mime-type.
- #bytes
The number of bytes crawled for a given mime-type.
- mime-types
The mime-type.
ResponseCode (responsecode-report.txt)
This file contains a report displaying the number of documents downloaded per status code. It covers successful codes only. For failure codes see the crawl.log file.
This file is created by the StatisticsTracker bean and is written at the end of the crawl.
Below is sample output from this report:
[#urls] [rescode]
1306 200
31 404
1 1
- #urls
The number of URIs crawled for a given response code.
- rescode
The response code.
Processors (processors-report.txt)
This report shows the activity of each processor involved in the crawl.
For example, the FetchHTTP
processor is included in the report. For
this processor the number of URIs fetched is displayed. The report is
organized to report on each Chain (Candidate, Fetch, and Disposition)
and each processor in each chain. The order of the report is per the
configuration order in the crawler-beans.cxml
file.
Below is sample output from this report:
CandidateChain - Processors report - 200910300032
Number of Processors: 2
Processor: org.archive.crawler.prefetch.CandidateScoper
Processor: org.archive.crawler.prefetch.FrontierPreparer
FetchChain - Processors report - 200910300032
Number of Processors: 9
Processor: org.archive.crawler.prefetch.Preselector
Processor: org.archive.crawler.prefetch.PreconditionEnforcer
Processor: org.archive.modules.fetcher.FetchDNS
Processor: org.archive.modules.fetcher.FetchHTTP
Function: Fetch HTTP URIs
CrawlURIs handled: 1337
Recovery retries: 0
Processor: org.archive.modules.extractor.ExtractorHTTP
Function: Extracts URIs from HTTP response headers
CrawlURIs handled: 1337 Links extracted: 0
Processor: org.archive.modules.extractor.ExtractorHTML
Function: Link extraction on HTML documents
CrawlURIs handled: 449
Links extracted: 6894
...
FrontierSummary (frontier-summary-report.txt)
This link displays a report showing the hosts that are queued for capture. The hosts are contained in multiple queues. The details of each Frontier queue is reported.
ToeThreads (threads-report.txt)
This link displays a report showing the activity of each thread used by Heritrix. The amount of time the thread has been running is displayed as well as thread state and thread Blocked/Waiting status.
Action Directory
Each job directory contains an action directory. By placing files in the action directory you can trigger actions in a running crawl job, such as the addition of new URIs to the crawl.
At a regular interval (by default less than a minute), the crawl will notice any new files in this directory, and take action based on their filename suffix and their contents. When the action is done, the file will be moved to the nearby ‘done’ directory. (For this reason, files should be composed outside the action directory, then moved there as an atomic whole. Otherwise, a file may be processed-and-moved while still being composed.)
The following file suffixes are supported:
- .seeds
A .seeds file should contain seeds that the Heritrix operator wants to include in the crawl. Placing a .seeds file in the action directory will add the seeds to the running crawl. The same directives as may be used in seeds-lists during initial crawl configuration may be used here.
If seeds introduced into the crawl this way were already in the frontier (perhaps already a seed) this method does not force them.
- .recover
.recover file will be treated as a traditional recovery journal. (The recovery journal can approximately reproduce the state of a crawl’s queues and already-included set, by repeating all URI-completion and URI-discovery events. A recovery journal reproduces less state than a proper checkpoint.) In a first pass, all lines beginning with Fs in the recovery journal will be considered included, so that they can not be enqueued again. Then in a second pass, lines starting with F+ will be re-enqueued for crawling (if not precluded by the first pass).
- .include
A .include file will be treated as a recovery journal, but all URIs no matter what their line-prefix will be marked as already included, preventing them from being re-enqueued from that point on. (Already-enqueued URIs will still be eligible for crawling when they come up.) Using a .include file is a way to suppress the re-crawling of URIs.
- .schedule
A .schedule file will be treated as a recovery journal, but all URIs no matter what their line-prefix will be offered for enqueueing. (However, if they are recognized as already-included, they will not be enqueued.) Using a .schedule file is a way to include URIs in a running crawl by inserting them into the Heritrix crawling queues.
- .force
A .force file will be treated as a recovery journal with all the URIs marked for force scheduling. Using a .force file is a way to guarantee that already-included URIs will be re-enqueued and (and thus eventually re-crawled).
Any of these files may be gzipped. Any of the files in recovery journal
format (.recover
, .include
, .schedule
, .force
) may have a .s
inserted prior to the functional suffix (for example,
frontier.s.recover.gz
), which will cause the URIs to be scope-tested
before any other insertion occurs.
For example you could place the following example.schedule
file in the action directory
to schedule a URL:
F+ http://example.com
In order to use the action directory, the ActionDirectory
bean must be
configured in the crawler-beans.cxml
file as illustrated below.
<bean id="actionDirectory" class="org.archive.crawler.framework.ActionDirectory">
<property name="actionDir" value="action" />
<property name="initialDelaySeconds" value="10" />
<property name="delaySeconds" value="30" />
</bean>
The recovery journal directives are listed below:
F+ |
Add |
Fe |
Emit |
Fi |
Include |
Fd |
Disregard |
Fr |
Re-enqueued |
Fs |
Success |
Ff |
Failure |
Note that the recovery journal format’s ‘F+’ lines may include a ‘hops-path’ and ‘via URI’, which are preserved when a URI is enqueued via the above mechanisms, but that this may not be a complete representation of all URI state from its discovery in a normal crawl.
Checkpointing
Checkpointing a crawl job writes a representation of the current state of the job under the checkpoints
directory
which can be used to restart the job from the same point.
Checkpointed state includes serialization of the main crawl job objects, copies of the current set of bdbje log files, and other files that represent the state of the crawl. The checkpoint directory contains all that is required to recover a crawl. Checkpointing also rotates the crawl logs, including the recover.gz log, if enabled. Log files are NOT copied to the checkpoint directory. They are left under the logs directory and are distinguished by a suffix. The suffix is the checkpoint name. For example, for checkpoint cp00001-20220930061713 the crawl log would be named crawl.log.cp00001-20220930061713.
To make checkpointing faster and reduce disk space usage, hardlinks on systems that support them to collect the BerkeleyDB-JE files required to reproduce the crawler’s state.
To run a checkpoint, click the checkpoint button on the job page of the WUI or invoke the checkpoint functionality through the REST API. While checkpointing, the crawl status will show as CHECKPOINTING. When the checkpoint has completed, the crawler will resume crawling, unless it was in the paused state when the checkpoint was invoked. In this case, the crawler will re-enter the paused state.
Recovery from a checkpoint has much in common with the recovery of a crawl using the frontier.recovery.log.
Automated Checkpointing
To configure Heritrix to automatically run checkpoints periodically, set the
checkpointService.checkpointIntervalMinutes
property:
<bean id="checkpointService" class="org.archive.crawler.framework.CheckpointService">
<property name="checkpointIntervalMinutes" value="60"/>
<property name="checkpointOnShutdown" value="true"/>
<!-- <property name="checkpointsDir" value="checkpoints"/> -->
<property name="forgetAllButLatest" value="true"/>
</bean>
When checkpointOnShutdown
is enabled Heritrix will create a checkpoint if the job is running when the JVM is
gracefully shutdown. Note that if Heritrix is killed, crashes or the server it is running on unexpectedly loses
power the shutdown checkpoint will not be created. Consequently it may be ideal to enable both shutdown and interval
checkpoints together.
Setting forgetAllButLatest`
will ensure only the latest checkpoint is kept.
Restarting from a Checkpoint
The web UI provides an option to restart a crawl from a checkpoint:
Checkpoint the running crawl by clicking the “checkpoint” button.
When the checkpoint ends (a message will be displayed informing the operator of this event) terminate the crawl by clicking the “terminate” button.
Teardown the job by clicking the “teardown” button.
Re-build the job by clicking the “build” button. At this point a dropdown box should appear under the command buttons. The dropdown box has the names of the previously invoked checkpoints.
Select a checkpoint from the dropdown. The selected checkpoint will be used to start the newly built job.
Click launch
Click unpause
The job will now begin running from the chosen checkpoint.
When running a job from the command-line with the --run-job
CLI option you can use the --checkpoint
to restart
the job from a named checkpoint. The special name latest
will restart from the latest checkpoint if any exist,
otherwise it will launch a new crawl.
Crawl Recovery
During normal operation, the Heritrix Frontier keeps a journal. The
journal is kept in the logs directory. It is named
frontier.recovery.gz
. If a crash occurs during a crawl, the
frontier.recover.gz
journal can be used to recreate the approximate
status of the crawler at the time of the crash. In some cases, recovery
may take an extended period of time, but it is usually much quicker than
repeating the crashed crawl.
If using this process, you are starting an all-new crawl, with your same (or modified) configuration, but this new crawl will take an extended detour at the beginning where it uses the prior crawl’s frontier-recover.gz output(s) to simulate the frontier status (discovered-URIs, enqueued-URIs) of the previous crawl. You would move aside all ARC/WARCs, logs, and checkpoints from the earlier crawl, retaining the logs and ARC/WARCs as a record of the crawl so far.
Any ARC/WARC files that exist with the .open
suffix were not properly
closed by the previous run, and may include corrupt/truncated data in
their last partial record. You may rename files with a .warc.gz.open
suffix to .warc.gz
, but consider validating such ARC/WARCs (by
zcat’ing the file to /dev/null to check gzip validity, or other ARC/WARC
tools for record completeness) before removing the “.open” suffix.
Full recovery
To run the recovery process, relaunch the crashed crawler and copy the frontier.recover.gz
file into the Action
Directory. Then re-start the crawl. Heritrix will automatically load the recovery file and begin placing its URIs
into the Frontier for crawling.
If using a .recover.gz
file, a single complete file must be used.
(This is so that the action directory processing of one file at a time
can do the complete first pass of ‘includes’, then the complete full
pass of ‘schedules’, from one file. Supplying multiple .recover.gz
files in series will result in an includes/schedules,
includes/schedules, etc. cycle which will not produce the desired effect
on the frontier.)
While the file is being processed, any checkpoints (manual or auto-periodic) will not be a valid snapshot of the crawler state. (The frontier-recovery log process happens via a separate thread/path outside the newer checkpointing system.) Only when the file processing is completed (file moved to ‘done’) will the crawler be in an accurately checkpointable state.
Once URIs start appearing in the queues (the recovery has entered the ‘schedules’ pass), the crawler may be unpaused to begin fetching URIs while the rest of the ‘schedules’ recovery pass continues. However, the above note about checkpoints still applies: only when the frontier-recovery file-processing is finished may an accurate checkpoint occur. Also, unpausing the crawl in this manner may result in some URIs being rediscovered via new paths before the original discovery is replayed via the recovery process. (Many crawls may not mind this slight deviation from the recovered’ crawls state, but if your scoping is very path- or hop- dependent it could make a difference in what is scope-included.)
Note
Feeding the entire frontier back to the crawler is likely to produce many “Problem line” warnings in the job log. Some operators find it useful to allow the entire recovery file to be ingested by the crawler before attempting to resume (unpause), to help isolate this chatter, and to minimize generating duplicate crawldata during recovery.
Split Recovery
An alternate way to run the recovery process is illustrated below. By eliminating irrelevant lines early (outside the recovery process), it may allow the recovery process to complete more quickly than the standard process. It also allows the process to proceed from many files, rather than a single file, so may give a better running indication of progress, and chances to checkpoint the recover.
To run the alternate recovery process:
move aside prior logs and ARCs/WARCs as above
relaunch the crashed crawler
Split any source
frontier.recover.gz
files using commands like the following:zcat frontier.recover.gz | grep '^Fs' | gzip > frontier.include.gz zcat frontier.recover.gz | grep '^F+' | gzip > frontier.schedule.gz
Build and launch the previously failed job (with the same or adjusted configuration). The job will now be paused.
Move the
frontier.include.gz
file(s) into the action directory. Theaction
directory is located at the same level in the file structure hierarchy as thebin
directory. (If you have many, you may move them all in at once, or in small batches to better monitor their progress. At any point when all previously-presented files are processed – that is, moved to the ‘done’ directory – it is possible to make a valid checkpoint.)You may watch the progress of this ‘includes’ phase by viewing the web UI or
progress-statistics.log
and seeing thediscovered
count rise.When all
.includes
are finished loading, you can repeat the process with all the.schedule
logs.When you notice a large number (many thousands) of URIs in the
queued
count, you may unpause the crawl to let new crawling proceed in parallel to the enqueueing of older URIs.
You may drop all .include
and .schedule
files into the action
directory before launch, if you are confident that the lexicographic
ordering of their names will do the right thing (present all
.include
files first, and the .schedule
files in the same order as the
original crawl). But, that leave little opportunity to adjust/checkpoint
the process: the action directory will discover them all and process
them all in one tight loop.
Note
To be sure of success and current crawl status against any sort of possible IO/format errors, in large recoveries of millions of records, you may want to wait for each step to complete before moving a file, or unpausing the job. Instead of looking at progress-statistics, simply wait for the file to move from action to action/done. Then add the second file. Wait again. Finally unpause the crawler.
A recovery of 100M URIs may take days, so please be patient.