Tools / Crawler / Crawler: Actions

Jan. 17, 2022

Crawler: Actions

Type: Action[]

Required

Parameter syntax

{
  actions: [
    {
      indexName: 'index_name',
      pathsToMatch: ['url_path', ...]
      fileTypesToMatch: ['file_type', ...],
      autoGenerateObjectIDs: true|false,
      schedule: 'every 1 day',
      recordExtractor: ({ url, $, contentLength, fileType, dataSources })  => {
      }
    },
  ],
}

See code examples

About this parameter# A

Determines which web pages are translated into Algolia records and in what way.

A single action defines:

the subset of your crawler’s websites it targets,
the extraction process for those websites,
and the indices to which the extracted records are pushed.

A single web page can match multiple actions. In this case, your crawler creates a record for each matched action.

Examples# A

Copy
{
  actions: [
    {
      indexName: 'dev_blog_algolia',
      pathsToMatch: ['https://blog.algolia.com/**'],
      fileTypesToMatch: ['pdf'],
      autoGenerateObjectIDs: false,
      schedule: 'every 1 day',
      recordExtractor: ({ url, $, contentLength, fileType, dataSources })  => {
        ...
      }
    },
  ],
}

Parameters # A

Action #


            name

type: string

Optional

The unique identifier of this action (useful for debugging). Required if schedule is set.


            indexName

type: string

Required

The index name targeted by this action. This value is appended to the indexPrefix, when specified.


            schedule

type: string

Optional

How often to perform a complete crawl for this action. See main property schedule for more information.


            pathsToMatch

type: string[]

Required

Determines which webpages match for this action. This list is checked against the url of webpages using micromatch. You can use negation, wildcards and more.


            selectorsToMatch

type: string

Optional

Checks for the presence of DOM nodes matching the given selectors: if the page doesn’t contain any node matching the selectors, it’s ignored. You can also check for the absence of selectors by using negation: if you want to ignore pages that contain a .main class, you can put !.main in the list.


            fileTypesToMatch

type: string[]

default: ["html"]

Optional

Set this value if you want to index documents. Chosen file types will be converted to HTML using Tika, then treated as a normal HTML page. See the documents guide for a list of available fileTypes.


            autoGenerateObjectIDs

type: bool

default: true

Generate an objectID for records that don’t have one. Setting this parameter to false means we’ll raise an error if an extracted record doesn’t have an objectID.


            recordExtractor

type: function

Required

A recordExtractor is a custom Javascript function that lets you execute your own code and extract what you want from a page. Your record extractor should return either an array of JSON or an empty array. If the function returns an empty array, the page is skipped.

Copy
recordExtractor: ({ url, $, contentLength, fileType})  => {
  return [
    {
      url: url.href,
      text: $('p').html()
      ... /* anything you want */
    }
  ];
  // return []; skips the page
}

action ➔ recordExtractor #

type: object (Cheerio instance)

Optional

A Cheerio instance containing the HTML of the crawled page.

url

type: Location object

Optional

A Location object containing the URL and metadata for the crawled page.


            fileType

type: string

Optional

The fileType of the crawled page (e.g.: html, pdf, …).


            contentLength

type: number

Optional

The number of bytes in the crawled page.


            dataSources

type: object

Optional

Object containing the external data sources of the current URL. Each key of the object corresponds to an externalData

Copy
{
  dataSources: {
    dataSourceId1: { data1: 'val1', data2: 'val2' },
    dataSourceId2: { data1: 'val1', data2: 'val2' },
  }
}


            helpers

type: object

Optional

Collection of functions to help you extract content and generate records.

recordExtractor ➔ helpers #


            docsearch

type: function

Optional

You can call the helpers.docsearch() function from your recordExtractor. It automatically extracts content and formats it to be compatible with DocSearch. It produces an optimized number of records for relevancy and hierarchy, and you can use it without DocSearch or to index non-documentation content.

Copy
recordExtractor: ({ url, $, helpers }) => {
  return helpers.docsearch({
    aggregateContent: true,
    indexHeadings: true,
    recordVersion: 'v3',
    recordProps: {
      lvl0: {
        selectors: "header h1",
      },
      lvl1: "article h2",
      lvl2: "article h3",
      lvl3: "article h4",
      lvl4: "article h5",
      lvl5: "article h6",
      content: "main p, main li",
    },
  });
}

You can find more examples in the DocSearch documentation


            splitContentIntoRecords

type: function

Optional

The helpers.splitContentIntoRecords() function is callable from your recordExtractor. It extracts textual content from the resource (i.e. HTML page or document) and splits it into in one or more records. It can be used to index the textual content exhaustively and in a way to prevent record_too_big errors.

Copy
recordExtractor: ({ url, $, helpers }) => {
  const baseRecord = {
    url,
    title: $('head title').text().trim(),
  };
  const records = helpers.splitContentIntoRecords({
    baseRecord,
    $elements: $('body'),
    maxRecordBytes: 1000,
    textAttributeName: 'text',
    orderingAttributeName: 'part',
  });
  // You can still alter produced records
  // afterwards, if needed.
  return records;
}

In the preceding recordExtractor() example function, crawling a long HTML page will return an array of records that will never exceed the limit of 1000 bytes per record. The records, extracted by the splitContentIntoRecords method, would look similar to this:

Copy
[
  {
    url: 'http://test.com/index.html',
    title: 'Home - Test.com',
    part: 0
    text: 'Welcome on test.com, the best resource to',
  },
  {
    url: 'http://test.com/index.html',
    title: 'Home - Test.com',
    part: 1
    text: 'find interesting content online.',
  }
]

Assuming that the automatic generation of objectIDs is enabled in your configuration, the crawler generates an objectID for each of the generated records.

In order to prevent duplicate results when searching for a word that appears in multiple records belonging to the same resource (page), we recommend that you enable distinct in your index settings, set the attributeForDistinct, searchableAttributes, and add a custom ranking from first record on your page to the last:

Copy
initialIndexSettings: {
  'my-index': {
    distinct: true,
    attributeForDistinct: 'url'
    searchableAttributes: [ 'title', 'text' ],
    customRanking: [ 'asc(part)' ],
  }
}

Please be aware that using distinct comes with some specificities.

helpers ➔ splitContentIntoRecords #

`$elements` #	type: string default: $("body") Optional A Cheerio selector that determines from which elements textual content will be extracted and turned into records.
`baseRecord` #	type: object default: {} Optional Attributes (and their values) to add to all resulting records.
`maxRecordBytes` #	type: number default: 10000 Optional Maximum number of bytes allowed per record, on the resulting Algolia index. You can refer to the record size limits for your plan to prevent any errors regarding record size.
`textAttributeName` #	type: string default: text Optional Name of the attribute in which to store the text of each record.
`orderingAttributeName` #	type: string Optional Name of the attribute in which to store the number of each record.

helpers ➔ docsearch #


            recordProps

type: object

Required

Main docsearch configuration.


            aggregateContent

type: boolean

default: true

Optional

Whether the helpers automatically merge sibling elements and separate them by a line break.

For: <p>Foo</p><p>Bar<p>

Copy
{
  aggregateContent: false,
  # creates 2 records
}

Copy
{
  aggregateContent: true,
  # creates 1 records
}


            indexHeadings

type: boolean | object

default: true

Optional

Whether the helpers create records for headings.

When false, only records for the content level are created. When from, to is provided, only records for the lvlX to lvlY are created.

Copy
{
  recordProps: {
    indexHeadings: false
    indexHeadings: { from: 4, to: 6 }
  }
}


            recordVersion

type: string

default: v2

Optional

Change the version of the extracted records. It’s not correlated with the DocSearch version and can be incremented independently.

v2: compatible with DocSearch >= @2
v3: compatible with DocSearch >= @3

docsearch ➔ recordProps #


            lvl0

type: object

Required

Select the main category of the page. You should index the title and h1 of the page in lvl1.

Copy
{
  lvl0: {
    selectors: '.page-category',
    defaultValue: 'documentation'
  }
}


            lvl1

type: string | string[]

Required

Select the main title of the page.

Copy
{
  lvl1: 'head > title'
}


            content

type: string | string[]

Required

Select the content elements of the page.

Copy
{
  lvl1: 'body > p, main li'
}


            pageRank

type: string

Optional

Add an attribute pageRank to the extracted records that you can use to boost the relevance of associated records in the index settings. Note that you can pass any numeric value as a string, including negative values.

Copy
{
  pageRank: "30"
}


            lv2, lvl3, lvl4, lvl5, lvl6

type: string | string[]

Optional

Select other headings of the page.

Copy
{
  lvl2: "main h2",
  lvl3: "footer h3"
  lvl4: ["h4", "div.important"],
}

type: string | string[] | object

Optional

All extra keys are added to the extracted records.

Copy
{
  myCustomAttribute: '.myCustomClass',
  ogDesc: {
    selectors: 'head meta[name="og:desc"]',
    defaultValue: 'Default description'
  }
}

safetyChecks

discoveryPatterns

Did you find this page helpful?

Crawler: Actions

About this parameter# Found an issue? Edit this guide A Edit this guide

Examples# Found an issue? Edit this guide A Edit this guide

Parameters # Found an issue? Edit this guide A Edit this guide

Action #

action ➔ recordExtractor #

recordExtractor ➔ helpers #

helpers ➔ splitContentIntoRecords #

helpers ➔ docsearch #

docsearch ➔ recordProps #

On this page

About this parameter# A

Examples# A

Parameters # A