Tools / Crawler / Crawler: Actions
Type: Action[]
Required
Parameter syntax
{
  actions: [
    {
      indexName: 'index_name',
      pathsToMatch: ['url_path', ...]
      fileTypesToMatch: ['file_type', ...],
      autoGenerateObjectIDs: true|false,
      schedule: 'every 1 day',
      recordExtractor: ({ url, $, contentLength, fileType, dataSources })  => {
      }
    },
  ],
}

About this parameter

Determines which web pages are translated into Algolia records and in what way.

A single action defines:

  1. the subset of your crawler’s websites it targets,
  2. the extraction process for those websites,
  3. and the indices to which the extracted records are pushed.

A single web page can match multiple actions. In this case, your crawler creates a record for each matched action.

Examples

1
2
3
4
5
6
7
8
9
10
11
12
13
14
{
  actions: [
    {
      indexName: 'dev_blog_algolia',
      pathsToMatch: ['https://blog.algolia.com/**'],
      fileTypesToMatch: ['pdf'],
      autoGenerateObjectIDs: false,
      schedule: 'every 1 day',
      recordExtractor: ({ url, $, contentLength, fileType, dataSources })  => {
        ...
      }
    },
  ],
}

Parameters

Action

name
type: string
Optional

The unique identifier of this action (useful for debugging). Required if schedule is set.

indexName
type: string
Required

The index name targeted by this action. This value is appended to the indexPrefix, when specified.

schedule
type: string
Optional

How often to perform a complete crawl for this action. See main property schedule for more information.

pathsToMatch
type: string[]
Required

Determines which webpages match for this action. This list is checked against the url of webpages using micromatch. You can use negation, wildcards and more.

selectorsToMatch
type: string
Optional

Checks for the presence of DOM nodes matching the given selectors: if the page doesn’t contain any node matching the selectors, it’s ignored. You can also check for the absence of selectors by using negation: if you want to ignore pages that contain a .main class, you can put !.main in the list.

fileTypesToMatch
type: string[]
default: ["html"]
Optional

Set this value if you want to index documents. Chosen file types will be converted to HTML using Tika, then treated as a normal HTML page. See the documents guide for a list of available fileTypes.

autoGenerateObjectIDs
type: bool
default: true

Generate an objectID for records that don’t have one. Setting this parameter to false means we’ll raise an error if an extracted record doesn’t have an objectID.

recordExtractor
type: function
Required

A recordExtractor is a custom Javascript function that lets you execute your own code and extract what you want from a page. Your record extractor should return either an array of JSON or an empty array. If the function returns an empty array, the page is skipped.

1
2
3
4
5
6
7
8
9
10
recordExtractor: ({ url, $, contentLength, fileType})  => {
  return [
    {
      url: url.href,
      text: $('p').html()
      ... /* anything you want */
    }
  ];
  // return []; skips the page
}

action ➔ recordExtractor

$
type: object (Cheerio instance)
Optional

A Cheerio instance containing the HTML of the crawled page.

url
type: Location object
Optional

A Location object containing the URL and metadata for the crawled page.

fileType
type: string
Optional

The fileType of the crawled page (e.g.: html, pdf, …).

contentLength
type: number
Optional

The number of bytes in the crawled page.

dataSources
type: object
Optional

Object containing the external data sources of the current URL. Each key of the object corresponds to an externalData

1
2
3
4
5
6
{
  dataSources: {
    dataSourceId1: { data1: 'val1', data2: 'val2' },
    dataSourceId2: { data1: 'val1', data2: 'val2' },
  }
}
helpers
type: object
Optional

Collection of functions to help you extract content and generate records.

recordExtractor ➔ helpers

docsearch
type: function
Optional

You can call the helpers.docsearch() function from your recordExtractor. It automatically extracts content and formats it to be compatible with DocSearch. It produces an optimized number of records for relevancy and hierarchy, and you can use it without DocSearch or to index non-documentation content.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
recordExtractor: ({ url, $, helpers }) => {
  return helpers.docsearch({
    aggregateContent: true,
    indexHeadings: true,
    recordVersion: 'v3',
    recordProps: {
      lvl0: {
        selectors: "header h1",
      },
      lvl1: "article h2",
      lvl2: "article h3",
      lvl3: "article h4",
      lvl4: "article h5",
      lvl5: "article h6",
      content: "main p, main li",
    },
  });
}

You can find more examples in the DocSearch documentation

splitContentIntoRecords
type: function
Optional

The helpers.splitContentIntoRecords() function is callable from your recordExtractor. It extracts textual content from the resource (i.e. HTML page or document) and splits it into in one or more records. It can be used to index the textual content exhaustively and in a way to prevent record_too_big errors.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
recordExtractor: ({ url, $, helpers }) => {
  const baseRecord = {
    url,
    title: $('head title').text().trim(),
  };
  const records = helpers.splitContentIntoRecords({
    baseRecord,
    $elements: $('body'),
    maxRecordBytes: 1000,
    textAttributeName: 'text',
    orderingAttributeName: 'part',
  });
  // You can still alter produced records
  // afterwards, if needed.
  return records;
}

In the preceding recordExtractor() example function, crawling a long HTML page will return an array of records that will never exceed the limit of 1000 bytes per record. The records, extracted by the splitContentIntoRecords method, would look similar to this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
[
  {
    url: 'http://test.com/index.html',
    title: 'Home - Test.com',
    part: 0
    text: 'Welcome on test.com, the best resource to',
  },
  {
    url: 'http://test.com/index.html',
    title: 'Home - Test.com',
    part: 1
    text: 'find interesting content online.',
  }
]

Assuming that the automatic generation of objectIDs is enabled in your configuration, the crawler generates an objectID for each of the generated records.

In order to prevent duplicate results when searching for a word that appears in multiple records belonging to the same resource (page), we recommend that you enable distinct in your index settings, set the attributeForDistinct, searchableAttributes, and add a custom ranking from first record on your page to the last:

1
2
3
4
5
6
7
8
initialIndexSettings: {
  'my-index': {
    distinct: true,
    attributeForDistinct: 'url'
    searchableAttributes: [ 'title', 'text' ],
    customRanking: [ 'asc(part)' ],
  }
}

Please be aware that using distinct comes with some specificities.

helpers ➔ splitContentIntoRecords

$elements
type: string
default: $("body")
Optional

A Cheerio selector that determines from which elements textual content will be extracted and turned into records.

baseRecord
type: object
default: {}
Optional

Attributes (and their values) to add to all resulting records.

maxRecordBytes
type: number
default: 10000
Optional

Maximum number of bytes allowed per record, on the resulting Algolia index. You can refer to the record size limits for your plan to prevent any errors regarding record size.

textAttributeName
type: string
default: text
Optional

Name of the attribute in which to store the text of each record.

orderingAttributeName
type: string
Optional

Name of the attribute in which to store the number of each record.

helpers ➔ docsearch

recordProps
type: object
Required

Main docsearch configuration.

aggregateContent
type: boolean
default: true
Optional

Whether the helpers automatically merge sibling elements and separate them by a line break.

For: <p>Foo</p><p>Bar<p>

1
2
3
4
{
  aggregateContent: false,
  # creates 2 records
}
1
2
3
4
{
  aggregateContent: true,
  # creates 1 records
}
indexHeadings
type: boolean | object
default: true
Optional

Whether the helpers create records for headings.

When false, only records for the content level are created. When from, to is provided, only records for the lvlX to lvlY are created.

1
2
3
4
5
6
{
  recordProps: {
    indexHeadings: false
    indexHeadings: { from: 4, to: 6 }
  }
}
recordVersion
type: string
default: v2
Optional

Change the version of the extracted records. It’s not correlated with the DocSearch version and can be incremented independently.

  • v2: compatible with DocSearch >= @2
  • v3: compatible with DocSearch >= @3

docsearch ➔ recordProps

lvl0
type: object
Required

Select the main category of the page. You should index the title and h1 of the page in lvl1.

1
2
3
4
5
6
{
  lvl0: {
    selectors: '.page-category',
    defaultValue: 'documentation'
  }
}
lvl1
type: string | string[]
Required

Select the main title of the page.

1
2
3
{
  lvl1: 'head > title'
}
content
type: string | string[]
Required

Select the content elements of the page.

1
2
3
{
  lvl1: 'body > p, main li'
}
pageRank
type: string
Optional

Add an attribute pageRank to the extracted records that you can use to boost the relevance of associated records in the index settings. Note that you can pass any numeric value as a string, including negative values.

1
2
3
{
  pageRank: "30"
}
lv2, lvl3, lvl4, lvl5, lvl6
type: string | string[]
Optional

Select other headings of the page.

1
2
3
4
5
{
  lvl2: "main h2",
  lvl3: "footer h3"
  lvl4: ["h4", "div.important"],
}
*
type: string | string[] | object
Optional

All extra keys are added to the extracted records.

1
2
3
4
5
6
7
{
  myCustomAttribute: '.myCustomClass',
  ogDesc: {
    selectors: 'head meta[name="og:desc"]',
    defaultValue: 'Default description'
  }
}
Did you find this page helpful?