Tools / Crawler / Configuration

Parameter

appId

The ID of the application you want to store the crawler extractions in.

apiKey

API key for your targeted application.

indexPrefix

Prefix added to the names of all indices defined in the crawler’s configuration.

rateLimit

Number of concurrent tasks per second that can run for this configuration.

schedule

How often a complete crawl should be performed.

startUrls

The crawler uses these URLs as entry points to start crawling.

sitemaps

URLs found in sitemaps are treated as startUrls for the crawler: they are used as starting points for the crawl.

ignoreRobotsTxtRules

When set to true, the crawler will ignore rules set in your robots.txt.

ignoreNoIndex

Whether the Crawler should extract records from a page whose robots meta tag contains noindex or none.

ignoreNoFollowTo

Whether the Crawler should follow links with the rel=”nofollow” tag and extract links from a page whose robots meta tag contains nofollow or none.

ignoreCanonicalTo

Whether the Crawler should extract records from a page that has a canonical URL specified.

extraUrls

URLs found in extraUrls are treated as startUrls for your crawler: they are used as starting points for the crawl.

maxDepth

Limits the processing of URLs to the specified depth, inclusively.

maxUrls

Limits the number of URLs your crawler can process.

saveBackup

Whether to save a backup of your production index before it is overwritten by the index generated during a crawl.

renderJavaScript

When true, all web pages are rendered with a chrome headless browser. The crawler will use the rendered HTML.

initialIndexSettings

Defines the settings for the indices that the crawler updates.

exclusionPatterns

Tells the crawler which URLs to ignore or exclude.

ignoreQueryParams

Filters out specified query parameters from crawled URLs. This can help you avoid indexing duplicate URLs. You can use wildcards to pattern match.

requestOptions

Modify all crawler’s requests behavior.

linkExtractor

Override the default logic used to extract URLs from pages.

externalData

Defines the list of external data sources you want to use for this configuration, and make available to your extractor function.

login

This property defines how the crawler acquires a session to access protected content.

safetyChecks

A configurable collection of safety checks to make sure the crawl was successful.

actions

Determines which web pages are translated into Algolia records and in what way.

discoveryPatterns

Indicates additional web pages that the Crawler should visit.

hostnameAliases

Defines mappings to replace given hostname(s).

pathAliases

Defines mappings to replace a path in a hostname.

cache

Turn crawler’s cache on or off.

Did you find this page helpful?