Crawler: LinkExtractor

Type: function

Parameter syntax

linkExtractor: ({ $, url, defaultExtractor }) => {
  ...
  // return ['https://...']
}

See code examples

About this parameter# A

Override the default logic used to extract URLs from pages.

By default, we queue all URLs that comply with pathsToMatch, fileTypesToMatch, and exclusions. You can override this default logic by providing a custom function which executes on each crawled page, and returns the URLs to queue.

The expected return value is an array of URLs (as strings).

Examples# A

Copy
  {
    linkExtractor: ({ $, url, defaultExtractor }) => {
      if (/example.com\/doc\//.test(url.href) {
        // For all pages under /doc, only queue the first found link
        return defaultExtractor().slice(0,1);
      }
      // Otherwise, use the default logic (queue all found links)
      return defaultExtractor();
    },
  }

Copy
{
  linkExtractor: ({ $, url, defaultExtractor }) => {
    // This turns off link discovery, except for URLs listed in sitemap.xml
    return /sitemap.xml/.test(url.href) ? defaultExtractor() : [];
  },
}

Copy
{
  linkExtractor: ({ $ }) => {
    // Access the DOM and extract what you specify
    return [$('.my-link').attr('href')]
  },
}

Parameters # A

`url` #	type: URL Optional URL of the resource that was just crawled.
`defaultExtractor` #	type: function Optional Default function used internally by the Crawler to discover URLs from a resource’s content. It returns an array of strings containing all URLs found on the current resource (if they match the configuration).
`$` #	type: object (Cheerio instance) Optional A Cheerio instance containing the HTML of the crawled page.

requestOptions

externalData

Did you find this page helpful?

Crawler: LinkExtractor

About this parameter# Found an issue? Edit this guide A Edit this guide

Examples# Found an issue? Edit this guide A Edit this guide

Parameters # Found an issue? Edit this guide A Edit this guide

On this page

About this parameter# A

Examples# A

Parameters # A