REST API ======== This manual describes the REST application programming interface (API) of the Heritrix Web crawler. This document is intended for application developers and administrators interested in controlling the Heritrix Web crawler through its REST API. Any client that supports HTTPS can be used to invoke the Heritrix API. The examples in this document use the command line tool curl which is typically found in most unix environments. Curl is \ `available `__ for many systems including Windows. Get Engine Status ~~~~~~~~~~~~~~~~~ .. http:get:: https://(heritrixhost):8443/engine Returns information about this instance of Heritrix such as version number, memory usage and the list of crawl jobs. **XML Example:** .. code:: bash curl -v -k -u admin:admin --anyauth --location -H "Accept: application/xml" https://localhost:8443/engine Response: .. code:: xml 3.3.0-SNAPSHOT-2017-07-12T04:17:56Z 69529904 589824000 2885681152 /heritrix/jobs https://localhost:8443/engine/jobsdir/ rescan add create myjob https://localhost:8443/engine/job/myjob false 0 false Unbuilt false /heritrix/jobs/myjob/crawler-beans.cxml https://localhost:8443/engine/jobdir/crawler-beans.cxml myjob Defaults (XML) Defaults (Groovy) myprofile **JSON Example:** .. code:: bash curl -v -k -u admin:admin --anyauth --location -H "Accept: application/json" https://localhost:8443/engine Response: .. code:: json { "availableActions": [ "rescan", "add", "create" ], "heapReport": { "usedBytes": 69529904, "totalBytes": 589824000, "maxBytes": 2885681152 }, "jobsDirUrl": "https://localhost:8443/engine/jobsdir/", "heritrixVersion": "3.3.0-SNAPSHOT-2017-07-12T04:17:56Z", "jobsDir": "/heritrix/jobs", "jobs": [{ "isProfile": false, "launchCount": 0, "statusDescription": "Unbuilt", "hasApplicationContext": false, "shortName": "myjob", "isLaunchInfoPartial": false, "url": "https://localhost:8443/engine/job/myjob", "key": "myjob", "primaryConfig": "/heritrix/jobs/myjob/crawler-beans.cxml", "primaryConfigUrl": "https://localhost:8443/engine/jobdir/crawler-beans.cxml" }] } Create New Job ~~~~~~~~~~~~~~ .. http:post:: https://(heritrixhost):8443/engine [action=create] Creates a new crawl job from the selected profile. If no profile is supplied, ``Defaults (XML)`` is used. Built-in profiles include ``Defaults (XML)`` and ``Defaults (Groovy)``. Existing job profiles may also be selected by passing the profile job's short name. :form action: must be ``create`` :form createpath: the name of the new job :form profile: optional profile name. Existing profiles use their job short names. **HTML Example:** .. code:: bash curl -v -d "createpath=myjob&action=create" -k -u admin:admin --anyauth --location \ https://localhost:8443/engine To create a job from the Groovy defaults: .. code:: bash curl -v --data-urlencode "createpath=mygroovyjob" --data-urlencode "action=create" \ --data-urlencode "profile=Defaults (Groovy)" -k -u admin:admin --anyauth --location \ https://localhost:8443/engine **XML Example:** .. code:: bash curl -v -d "createpath=myjob&action=create" -k -u admin:admin --anyauth --location \ -H "Accept: application/xml" https://localhost:8443/engine **JSON Example:** .. code:: bash curl -v -d "createpath=myjob&action=create" -k -u admin:admin --anyauth --location \ -H "Accept: application/json" https://localhost:8443/engine Add Job Directory ~~~~~~~~~~~~~~~~~ .. http:post:: https://(heritrixhost):8443/engine [action=add] Adds a new job directory to the Heritrix configuration. The directory must contain a cxml configuration file. :form action: must be ``add`` :form addpath: the job directory to add **HTML Example:** .. code:: bash curl -v -d "action=add&addpath=/Users/hstern/job" -k -u admin:admin --anyauth --location https://localhost:8443/engine **XML Example:** .. code:: bash curl -v -d "action=add&addpath=/Users/hstern/job" -k -u admin:admin --anyauth --location -H "Accept: application/xml" https://localhost:8443/engine **JSON Example:** .. code:: bash curl -v -d "action=add&addpath=/Users/hstern/job" -k -u admin:admin --anyauth --location -H "Accept: application/json" https://localhost:8443/engine Get Job Status ~~~~~~~~~~~~~~ .. http:get:: https://(heritrixhost):8443/engine/job/(jobname) Returns status information and statistics about the chosen job. **XML Example:** .. code:: bash curl -v -k -u admin:admin --anyauth --location -H "Accept: application/xml" https://localhost:8443/engine/job/myjob Response: .. code:: xml myjob FINISHED FINISHED Finished: FINISHED teardown 1 2020-04-01T02:07:42.531Z false /heritrix/jobs/myjob/crawler-beans.cxml https://localhost:8443/engine/job/myjob/jobdir/crawler-beans.cxml https://localhost:8443/engine/job/myjob/job/myjob 2020-04-01T03:50:44.708Z INFO FINISHED 20200401020744 2020-04-01T03:50:42.670Z INFO EMPTY 20200401020744 2020-04-01T03:50:42.669Z INFO STOPPING 20200401020744 3920 0 3920 0 0 0 0 0 2177235508 3920 2177235508 3920 2177235508 3920 0.0 0.6354171124312226 0 344 0 0 NaN 0 0 6169176 1h42m49s176ms 1 0 0 0 0 0 0 0 1 FINISH ... ... false false false true 549 /heritrix/jobs/myjob/20200401020744/logs/alerts.log /heritrix/jobs/myjob/20200401020744/logs/crawl.log CrawlSummaryReport CrawlSummary ... 66893400 589824000 2885681152 **JSON Example:** .. code:: bash curl -v -k -u admin:admin --anyauth --location -H "Accept: application/json" https://localhost:8443/engine/job/myjob Response: .. code:: json { "uriTotalsReport": { "downloadedUriCount": 3920, "queuedUriCount": 2, "futureUriCount": 0, "totalUriCount": 3920 }, "reports": [ { "shortName": "CrawlSummary", "className": "CrawlSummaryReport" }, // ... ], "rateReport": { "currentDocsPerSecond": 0.6354171124312226, "averageKiBPerSec": 344, "currentKiBPerSec": 0, "averageDocsPerSecond": 0 }, "lastLaunch": "2020-04-01T02:07:42.531Z", "frontierReport": { "lastReachedState": "FINISH", "totalQueues": 1, "queueReadiedCount": 0, "inProcessQueues": 0, "readyQueues": 0, "activeQueues": 0, "inactiveQueues": 0, "ineligibleQueues": 0, "retiredQueues": 0, "exhaustedQueues": 1, "snoozedQueues": 0 }, "sizeTotalsReport": { "notModifiedCount": 0, "total": 2177235508, "notModified": 0, "dupByHashCount": 0, "novelCount": 3920, "sizeOnDisk": 0, "totalCount": 3920, "dupByHash": 0, "novel": 2177235508, "warcNovelContentBytes": 2177235508, "warcNovelUrls": 3920 }, "checkpointFiles": [], "crawlLogTail": [ // ... ], "primaryConfig": "/heritrix/jobs/myjob/crawler-beans.cxml", "primaryConfigUrl": "https://localhost:8443/engine/job/myjob/jobdir/crawler-beans.cxml", "crawlLogFileUrl": "jobdir/20200401020744/logs/crawl.log", "heapReport": { "usedBytes": 66893400, "totalBytes": 589824000, "maxBytes": 2885681152 }, "isRunning": false, "isLaunchable": false, "isLaunchInfoPartial": false, "alertLogFilePath": "/heritrix/jobs/myjob/20200401020744/logs/alerts.log", "alertLogFileUrl": "jobdir/20200401020744/logs/alerts.log", "availableActions": ["teardown"], "launchCount": 1, "isProfile": false, "loadReport": { "totalThreads": 0, "busyThreads": 0, "averageQueueDepth": 0, "deepestQueueDepth": 0, "congestionRatio": null }, "crawlExitStatus": "FINISHED", "jobLogTail": [ "2020-04-01T03:50:44.708Z INFO FINISHED 20200401020744", "2020-04-01T03:50:42.670Z INFO EMPTY 20200401020744", "2020-04-01T03:50:42.669Z INFO STOPPING 20200401020744" ], "url": "https://localhost:8443/engine/job/myjob/job/myjob", "elapsedReport": { "elapsedPretty": "1h42m49s176ms", "elapsedMilliseconds": 6169176 }, "statusDescription": "Finished: FINISHED", "configFiles": [ // ... ], "hasApplicationContext": true, "alertCount": 549, "crawlLogFilePath": "/heritrix/jobs/myjob/20200401020744/logs/crawl.log", "shortName": "myjob", "crawlControllerState": "FINISHED" } Build Job Configuration ~~~~~~~~~~~~~~~~~~~~~~~ .. http:post:: https://(heritrixhost):8443/engine/job/(jobname) [action=build] Builds the job configuration for the chosen job. It reads an XML descriptor file and uses Spring to build the Java objects that are necessary for running the crawl. Before a crawl can be run it must be built. :form action: must be ``build`` **HTML Example:** .. code:: bash curl -v -d "action=build" -k -u admin:admin --anyauth --location https://localhost:8443/engine/job/myjob **XML Example:** .. code:: bash curl -v -d "action=build" -k -u admin:admin --anyauth --location -H "Accept: application/xml" https://localhost:8443/engine/job/myjob **JSON Example:** .. code:: bash curl -v -d "action=build" -k -u admin:admin --anyauth --location -H "Accept: application/json" https://localhost:8443/engine/job/myjob Launch Job ~~~~~~~~~~ .. http:post:: https://(heritrixhost):8443/engine/job/(jobname) [action=launch] Launches a crawl job. The job can be launched in the "paused" state or the "unpaused" state. If launched in the "unpaused" state the job will immediately begin crawling. :form action: must be ``launch`` :form checkpoint: optional field: If supplied, Heritrix will attempt to launch from a checkpoint. Should be the name of a checkpoint (e.g. ``cp00001-20180102121229``) or (since version 3.3) the special value ``latest``, which will automatically select the most recent checkpoint. If no ``checkpoint`` is specified (or if the ``latest`` checkpoint is requested and there are no valid checkpoints) a new crawl will be launched. **HTML Example:** .. code:: bash curl -v -d "action=launch" -k -u admin:admin --anyauth --location https://localhost:8443/engine/job/myjob **XML Example:** .. code:: bash curl -v -d "action=launch" -k -u admin:admin --anyauth --location -H "Accept: application/xml" https://localhost:8443/engine/job/myjob **JSON Example:** .. code:: bash curl -v -d "action=launch" -k -u admin:admin --anyauth --location -H "Accept: application/json" https://localhost:8443/engine/job/myjob Rescan Job Directory ~~~~~~~~~~~~~~~~~~~~ .. http:post:: https://(heritrixhost):8443/engine [action=rescan] Rescans the main job directory and returns an HTML page containing all the job names. It also returns information about the jobs, such as the location of the job configuration file and the number of job launches. :form action: must be ``rescan`` **HTML Example:** .. code:: bash curl -v -d "action=rescan" -k -u admin:admin --anyauth --location https://localhost:8443/engine **XML Example:** .. code:: bash curl -v -d "action=rescan" -k -u admin:admin --anyauth --location -H "Accept: application/xml" https://localhost:8443/engine **JSON Example:** .. code:: bash curl -v -d "action=rescan" -k -u admin:admin --anyauth --location -H "Accept: application/json" https://localhost:8443/engine Pause Job ~~~~~~~~~ .. http:post:: https://(heritrixhost):8443/engine/job/(jobname) [action=pause] Pauses an unpaused job. No crawling will occur while a job is paused. :form action: must be ``pause`` **HTML Example:** .. code:: bash curl -v -d "action=pause" -k -u admin:admin --anyauth --location https://localhost:8443/engine/job/myjob **XML Example:** .. code:: bash curl -v -d "action=pause" -k -u admin:admin --anyauth --location -H "Accept: application/xml" https://localhost:8443/engine/job/myjob **JSON Example:** .. code:: bash curl -v -d "action=pause" -k -u admin:admin --anyauth --location -H "Accept: application/json" https://localhost:8443/engine/job/myjob Unpause Job ~~~~~~~~~~~ .. http:post:: https://(heritrixhost):8443/engine/job/(jobname) [action=unpause] This API unpauses a paused job. Crawling will resume (or begin, in the case of a job launched in the paused state) if possible. :form action: must be ``unpause`` **HTML Example:** .. code:: bash curl -v -d "action=unpause" -k -u admin:admin --anyauth --location https://localhost:8443/engine/job/myjob **XML Example:** .. code:: bash curl -v -d "action=unpause" -k -u admin:admin --anyauth --location -H "Accept: application/xml" https://localhost:8443/engine/job/myjob **JSON Example:** .. code:: bash curl -v -d "action=unpause" -k -u admin:admin --anyauth --location -H "Accept: application/json" https://localhost:8443/engine/job/myjob Terminate Job ~~~~~~~~~~~~~ .. http:post:: https://(heritrixhost):8443/engine/job/(jobname) [action=terminate] Terminates a running job. :form action: must be ``terminate`` **HTML Example:** .. code:: bash curl -v -d "action=terminate" -k -u admin:admin --anyauth --location https://localhost:8443/engine/job/myjob **XML Example:** .. code:: bash curl -v -d "action=terminate" -k -u admin:admin --anyauth --location -H "Accept: application/xml" https://localhost:8443/engine/job/myjob **JSON Example:** .. code:: bash curl -v -d "action=terminate" -k -u admin:admin --anyauth --location -H "Accept: application/json" https://localhost:8443/engine/job/myjob Teardown Job ~~~~~~~~~~~~ .. http:post:: https://(heritrixhost):8443/engine/job/(jobname) [action=teardown] Removes the Spring code that is used to run the job. Once a job is torn down it must be rebuilt in order to run. :form action: must be ``teardown`` **HTML Example:** .. code:: bash curl -v -d "action=teardown" -k -u admin:admin --anyauth --location https://localhost:8443/engine/job/myjob **XML Example:** .. code:: bash curl -v -d "action=teardown" -k -u admin:admin --anyauth --location -H "Accept: application/xml" https://localhost:8443/engine/job/myjob **JSON Example:** .. code:: bash curl -v -d "action=teardown" -k -u admin:admin --anyauth --location -H "Accept: application/json" https://localhost:8443/engine/job/myjob Copy Job ~~~~~~~~ .. http:post:: https://(heritrixhost):8443/engine/job/(jobname) [copyTo] Copies an existing job configuration to a new job configuration. If the ``asProfile`` option is submitted with value ``on``, then the copy is a non-runnable profile. Profiles are listed as options when creating new jobs, and profiles with built-in names override the built-ins. :form copyTo: the name of the new job or profile configuration :form asProfile: whether to copy the job as a runnable configuration or as a non-runnable profile. The value ``on`` means the job will be copied as a profile. If omitted the job will be copied as a runnable configuration. **HTML Example:** .. code:: bash curl -v -d "copyTo=mycopy&asProfile=on" -k -u admin:admin --anyauth --location https://localhost:8443/engine/job/myjob **XML Example:** .. code:: bash curl -v -d "copyTo=mycopy&asProfile=on" -k -u admin:admin --anyauth --location -H "Accept: application/xml" https://localhost:8443/engine/job/myjob **JSON Example:** .. code:: bash curl -v -d "copyTo=mycopy&asProfile=on" -k -u admin:admin --anyauth --location -H "Accept: application/json" https://localhost:8443/engine/job/myjob Delete Job ~~~~~~~~~~ .. http:post:: https://(heritrixhost):8443/engine/job/(jobname) [delete] Deletes an existing job. It will remove everything related to the job (configuration, statistics, logs and results) including the whole job folder. Everything to keep has to be copied outside of the job folder. Deleting a job is only possible if no active application context for the job exists which is the case for new or unbuilt jobs. If a job has been built, it must first be torn down to allow deletion. :form action: must be ``delete`` **HTML Example:** .. code:: bash curl -v -d "action=delete" -k -u admin:admin --anyauth --location https://localhost:8443/engine/job/myjob **XML Example:** .. code:: bash curl -v -d "action=delete" -k -u admin:admin --anyauth --location -H "Accept: application/xml" https://localhost:8443/engine/job/myjob **JSON Example:** .. code:: bash curl -v -d "action=delete" -k -u admin:admin --anyauth --location -H "Accept: application/json" https://localhost:8443/engine/job/myjob Checkpoint Job ~~~~~~~~~~~~~~ .. http:post:: https://(heritrixhost):8443/engine/job/(jobname) [action=checkpoint] This API checkpoints the chosen job. Checkpointing writes the current state of a crawl to the file system so that the crawl can be recovered if it fails. :form action: must be ``checkpoint`` **HTML Example:** .. code:: bash curl -v -d "action=checkpoint" -k -u admin:admin --anyauth --location https://localhost:8443/engine/job/myjob **XML Example:** .. code:: bash curl -v -d "action=checkpoint" -k -u admin:admin --anyauth --location -H "Accept: application/xml" https://localhost:8443/engine/job/myjob **JSON Example:** .. code:: bash curl -v -d "action=checkpoint" -k -u admin:admin --anyauth --location -H "Accept: application/json" https://localhost:8443/engine/job/myjob Execute Script in Job ~~~~~~~~~~~~~~~~~~~~~ .. http:post:: https://(heritrixhost):8443/engine/job/(jobname)/script Executes a script. The script can be written as Beanshell, ECMAScript, Groovy, or AppleScript. :form engine: the script engine to use. One of ``beanshell``, ``js``, ``groovy`` or ``AppleScriptEngine``. :form script: the script code to execute **HTML Example:** .. code:: bash curl -v -d "engine=beanshell&script=System.out.println%28%22test%22%29%3B" -k -u admin:admin --anyauth --location https://localhost:8443/engine/job/myjob/script **XML Example:** .. code:: bash curl -v -d "engine=beanshell&script=System.out.println%28%22test%22%29%3B" -k -u admin:admin --anyauth --location -H "Accept: application/xml" https://localhost:8443/engine/job/myjob/script **JSON Example:** .. code:: bash curl -v -d "engine=beanshell&script=System.out.println%28%22test%22%29%3B" -k -u admin:admin --anyauth --location -H "Accept: application/json" https://localhost:8443/engine/job/myjob/script Submitting a CXML Job Configuration File ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. http:put:: https://(heritrixhost):8443/engine/job/(jobname)/jobdir/crawler-beans.cxml Submits the contents of a CXML file for a chosen job. CXML files are the configuration files used to control a crawl job. Each job has a single CXML file. **Example:** .. code:: bash curl -v -T my-crawler-beans.cxml -k -u admin:admin --anyauth --location https://localhost:8443/engine/job/myjob/jobdir/crawler-beans.cxml :statuscode 200: On success, the Heritrix REST API will return a HTTP 200 with no body. Conventions and Assumptions ~~~~~~~~~~~~~~~~~~~~~~~~~~~ The following curl parameters are used when invoking the API. +-----------------------------------+-----------------------------------+ | | curl Parameter | | Description | +===================================+===================================+ | -v | Verbose. Output a detailed | | | account of the curl command to | | | standard out. | +-----------------------------------+-----------------------------------+ | -d | Data. These are the name/value | | | pairs that are send in the body | | | of a POST. | +-----------------------------------+-----------------------------------+ | -k | Insecure. Allows connections to | | | SSL sites without certificates. | +-----------------------------------+-----------------------------------+ | | -u | User. Allows the submission of a | | | username and password to | | | authenticate the HTTP request. | +-----------------------------------+-----------------------------------+ | --anyauth | Any authentication type. Allows | | | authentication of the request | | | based on any type of | | | authentication method. | +-----------------------------------+-----------------------------------+ | --location | Follows HTTP redirects. This | | | option is used so that API calls | | | that return data (such as HTML) | | | will not halt upon receipt of a | | | redirect code (such as an HTTP | | | 303). | +-----------------------------------+-----------------------------------+ | | -H | Set the value of an HTTP header. | | | For example, "Accept: | | | application/xml". | +-----------------------------------+-----------------------------------+ It is assumed that the reader has a working knowledge of the HTTP protocol and Heritrix functionality. Also, the examples assume that Heritrix is run with an administrative username and password of "admin." About the REST implementation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Representational State Transfer (REST) is a software architecture for distributed hypermedia systems such as the World Wide Web (WWW). REST is built on the concept of representations of resources. Resources can be any coherent and meaningful concept that may be addressed. A URI is an example of a resource. The representation of the resource is typically a document that captures the current or intended state of the resource. An example of a representation of a resource is an HTML page. Heritrix uses REST to expose its functionality. The REST implementation used by Heritrix is Restlet. Restlet implements the concepts defined by REST, including resources and representations. It also provides a REST container that processes RESTful requests. The container is the Noelios Restlet Engine. For detailed information on Restlet, visit \ http://www.restlet.org/. Heritrix exposes its REST functionality through HTTPS. The HTTPS protocol is used to send requests to retrieve or modify configuration settings and manage crawl jobs.