Tools / Crawler / Crawler: Login

Jan. 17, 2022

Crawler: Login

Type: object

Parameter syntax

login: {
  fetchRequest: {
    url: 'your_url',
    requestOptions: {
      ...
    }
  }
}

login: {
  browserRequest: {
    url: 'your_login_page',
    username: 'login',
    password: 'password',
  }
}

login: {
  oauthRequest: {
    accessTokenRequest: {
      url: 'url_of_access_token_endpoint',
      grant_type: 'client_credentials',
      client_id: 'client-identifier',
      client_secret: 'client-secret',
    }
  }
}

See code examples

About this parameter

This property defines how the crawler acquires a session to access protected content.

The crawler supports multiple ways to authenticate to protected websites, such as:

Basic auth through an HTTP request
Basic auth by visiting a login page through a web browser and sending the login form like a human would do
OAuth 2.0

Basic auth

The crawler extracts the Set-Cookie response header from the login page, stores that cookie and sends it in a Cookie header when crawling all pages of the website defined in the configuration.

This cookie is only fetched at the beginning of each complete crawl. If it expires, we won’t renew it automatically.

There are two ways the crawler can interact with your login page:

By doing a direct request with the credentials to your login endpoint, like a standard cURL command.
By emulating a web browser, loading your login page, entering the credentials and validating the login form.

OAuth 2.0

The crawler supports OAuth 2.0 Client Credentials Grant flow. It performs an Access Token Request using the provided credentials, stores the fetched token in an Authorization header and sends it when crawling all pages of the website that are defined in the configuration.

This token is only fetched at the beginning of each complete crawl. If it expires, it won’t be renewed automatically.

Client authentication is performed by passing the client credentials (client_id / client_secret) in the request-body as described in the RFC.

The following providers are supported. You can reach out if you need to add others.

Azure AD v1.0

Examples

Copy
{
  login: {
    fetchRequest: {
      url: 'https://example.com/secure/login-with-post',
      requestOptions: {
        method: 'POST',
        headers: { 'Content-Type': 'application/x-www-form-urlencoded' },
        body: 'id=my-id&password=my-password',
        timeout: 5000 // in milliseconds
      }
    }
  }
}

Copy
  {
    login: {
      browserRequest: {
        url: 'https://example.com/secure/login-page',
        username: 'my-id',
        password: 'my-password',
      }
    }
  }

Copy
  {
    login: {
      oauthRequest: {
        accessTokenRequest: {
          url: 'https://example.com/oauth2/token',
          grant_type: 'client_credentials',
          client_id: 'my-client-id',
          client_secret: 'my-client-secret',
          extraParameters: {
            resource: 'https://protected.example.com/'
          }
        }
      }
    }
  }

Parameters

fetchRequest

This allows you to manually craft the login request that the crawler sends.

`url`	type: string Required The URL to target.
`requestOptions`	type: Object This object is passed to our extended version of the request library.

fetchRequest ➔ requestOptions

`method`	type: string default: GET The HTTP method to use.
`headers`	type: object default: {} HTTP headers to pass.
`body`	type: string The body of the request.
`timeout`	type: number Time to wait before aborting the request (in milliseconds).

requestOptions ➔ headers

`Content-Type`	type: string
`Authorization`	type: string
`Cookie`	type: string

browserRequest

Make the crawler use a web browser to visit your login page and validate the login form like a human would do.

`url`	type: string Required The URL of the login page. The HTML elements expected on this page are `input[type=text]` or `input[type=email]` for the username and `input[type=password]` for the password.
`username`	type: string Required The username
`password`	type: string Required The password
`waitTime`	type: object Optional Determines the shortest and longest wait time before considering the login done.

browserRequest ➔ waitTime

`min`	type: number default: 0 Optional If the login ends faster than this minimum execution time, the browser remains open at least this long before returning the cookies.
`max`	type: number default: 20000 Optional At this maximum execution time threshold, the execution stops and the cookies are returned as is.

oauthRequest

Make the crawler use OAuth 2.0 Client Credentials Grant flow to generate an Authorization header.


            accessTokenRequest

type: object

Required

Object containing the parameters needed to perform an Access Token Request.

oauthRequest ➔ accessTokenRequest

`url`	type: string Required The URL of the access token endpoint.
`grant_type`	type: string Required OAuth grant type. Must be “client_credentials”.
`client_id`	type: string Required The client identifier.
`client_secret`	type: string Required The client secret.
`scope`	type: string The scope of the access request.
`extraParameters`	type: object Object containing implementation-specific parameters, that aren’t part of the RFC.

accessTokenRequest ➔ extraParameters


            resource

type: string

Required parameter for Azure AD v1.0 implementations.

externalData

safetyChecks

Did you find this page helpful?