1. Overview

Before using the Octoparse API, you will need to hold a Standard or Professional account with at least one runnable task set up. (Haven’t got an account? Sign up here.) You can easily retrieve extracted data, task information and even control tasks (advanced API) by connecting to the Octoparse API, realizing efficient data extraction by coordinating with your own application.

1.1. Document Version

Current version: V1.0

1.2. Contact

Contact: Octoparse support team

Email: support@octoparse.com

1.3. URI Standard

All requests should be URL encoded with the base URL:

http://advancedapi.octoparse.com/

For example: A request for 'Get Data by Offset' should be: GET http://advancedapi.octoparse.com/api/alldata/GetDataOfTaskByOffset?taskId={taskId}&offset={offset}&size={size}

Note: {xxxx} in this document represents placeholder and users need to replace it with real value. For example, if your task ID is abc, offset is 1 and size is 10, then the URL should be http://advancedapi.octoparse.com/api/alldata/GetDataOfTaskByOffset?taskId=abc&offset=1&size=10.

1.4. Obtain OAuth2.0 Token

Before getting access to the Octoparse API, you need to obtain an Access Token based on OAuth2.0.

1.4.1. Obtain a New Token

You will need your username and password to get a new Access Token.

Request

POST http://advancedapi.octoparse.com/token

Parameters

username={userName}&password={password}&grant_type=password

Request Content Type

application/x-www-form-urlencoded

Response

Access Token

Response Content Type

application/json, text/json

Example
{
    "access_token": "ABCD1234",      //Access permission
    "token_type": "bearer",		     //Token type
    "expires_in": 86399,		     //Access Token Expiration time (in seconds)(It is recommended to use the same token repeatedly within this time frame.) 
    "refresh_token": "refresh_token" //To refresh Access Token
}

‘Access_Token’ is required for any API method invoked. Please add it to HTTP Header following the format below.

HeaderName: Authorization
Value: bearer {access_token}

Note: There is a space between ‘bearer’ and ‘Access Token’. For example, if the Access Token is AA11BB22...CC33, the Header should be ‘Authorization: bearer AA11BB22...CC33’. Access Token has an expiration time and it is recommended to be used repeatedly before it expires.

1.4.2. Refresh Token

Once an Access Token expires, users can refresh Access Token with 'Refresh_Token'. 'Refresh Token' is a more secure way of obtaining new token compared to making new requests with username and password.

Note: Each 'refresh_token' can only be used once. The new 'refresh token' returned from the current request should be used for the next request.

Request

POST http://advancedapi.octoparse.com/token

Parameters

refresh_token={refresh_token}&grant_type=refresh_token

Request Content Type

application/x-www-form-urlencoded

Response

Access Token

Note: The response HTTP status code should be ‘200’. If not, please refer to HTTP Status Code to solve the problem.

2. Instruction

Octoparse limits API usage to 20 requests/second. Please reduce access frequency if you receive status code ‘429’.

Note: Octoparse uses a leaky bucket algorithm to limit API access frequency. The maximum number of requests is 100 within any five-second time interval; no more requests will be taken thereafter until the next 5-second time interval.

Unusual Status

The response HTTP status code should be ‘200’. If not, please refer to HTTP Status Code to solve the problem.

2.1. Get Task Group Information

2.1.1.List All Task Groups

Request

GET api/TaskGroup

Response

Json-formatted text containing task group information and request status

Response Content Type

application/json, text/json

Example
{
  "data": [
    {
      "taskGroupId": 1,
      "taskGroupName": "Example Task Group 1"
    },
    {
      "taskGroupId": 2,
      "taskGroupName": "Example Task Group 2"
    }
  ],
  "error": "success",
  "error_Description": "Action Success"
}

2.2. Manage Task

2.2.1.List All Tasks in a Group

Request

GET api/Task?taskGroupId={taskGroupId}
Parameters
ParameterDescriptionRemark
taskGroupId

Task Group ID

Please define the parameter in request URL.

Response

Json-formatted text including task ID (taskId), task name(taskName)and user ID and request status.

Response Content Type

application/json, text/json

Example
{
  "data": [
    {
      "taskId": "337fd7d7-aded-4081-9104-2b551161ccc8",
      "taskName": "Example Task 1",
      "creationUserId": "5d1e4b3c-645c-44ab-ac0e-bfa9ad600ece"
    },
    {
      "taskId": "4adf489b-f883-43fa-b958-0cfde945ddb7",
      "taskName": "Example Task 2",
      "creationUserId": "5d1e4b3c-645c-44ab-ac0e-bfa9ad600ece"
    }
  ],
  "error": "success",
  "error_Description": "Action Success"
}

2.2.2.Get Task Status

This returns status of multiple tasks.

Request

POST api/task/GetTaskStatusByIdList
Parameters
ParameterDescriptionRemark
taskIdList

Json-formatted task ID list

Please define the parameter in request body.

Request Content Type

application/json, text/json

Example
{
  "taskIdList": [
    "337fd7d7-aded-4081-9104-2b551161ccc8",
    "4adf489b-f883-43fa-b958-0cfde945ddb7"
  ]
}

Response

Task status code: 0 = Running, 1 = Stopped, 2 = Completed, 3 = Waiting, 5 = Never Run

Response Content Type

application/json, text/json

Example
{
  "data": [
    {
      "taskId": "337fd7d7-aded-4081-9104-2b551161ccc8",
      "taskName": "Example Task 1",
      "status": 1
    },
    {
      "taskId": "4adf489b-f883-43fa-b958-0cfde945ddb7",
      "taskName": "Example Task 2",
      "status": 2
    }
  ],
  "error": "success",
  "error_Description": "Action Success"
}

2.2.3.Get Task Parameters

This returns the different parameters for a specific task, for example, the URL from ‘Go To The Web Page’ action, text value from ‘Enter Text’ action and text list/URL list from ‘Loop Item’ action.

Request

POST api/task/GetTaskRulePropertyByName?taskId={taskId}&name={name}
Parameters
ParameterDescriptionRemark
taskId

Task ID

Please define the parameter in request URL.

name

Configuration parameter name (navigateAction1.Url,loopAction1.UrlList,loopAction1.TextList, etc.)

Please define the parameter in request URL.

Response

Task parameters values (or value arrays) and request status

Response Content Type

application/json, text/json

Example
{
  "data": [
    "http://www.octoparse.com/",
    "http://www.skieer.com/"
  ],
  "error": "success",
  "error_Description": "Action Success"
}

2.2.4.Update Task Parameters

Use this method to update task parameters (currently only available to updating URL in ‘Go To The Web Page’ action, text value in ‘Enter Text’ action, and text list/URL list in ‘Loop Item’ action).

Note: For updating text list/URL list values, please use [‘text1’, ’text2’, ’text3’,’textN’] to represent N items.

Request

POST api/task/UpdateTaskRule
Parameters
ParameterDescriptionRemark
ruleParaInfo

Task parameters

Please define the parameter in request body.

Request Content Type

application/json, text/json

Example
{
  "taskId": "337fd7d7-aded-4081-9104-2b551161ccc8",
  "name": "loopAction2.TextList",
  "value": "['Octparse','Skieer Infomation']"
}

Response

The task parameter has been updated successfully or not.

Response Content Type

application/json, text/json

Example
{
  "error": "success",
  "error_Description": "Action Success"
}

2.2.5.Adding URL/Text to a Loop

Use this method to add new URLs/text to an existing loop.

Note: For updating text list/URL list values, please use [‘text1’, ’text2’, ’text3’,’textN’] to represent N items.

Request

POST api/task/AddUrlOrTextToTask
Parameters
ParameterDescriptionRemark
ruleParaInfo

Parameters of any existing loop

Please define the parameter in request body.

Request Content Type

application/json, text/json

Example
{
  "taskId": "4adf489b-f883-43fa-b958-0cfde945ddb7",
  "name": "navigateAction1.Url",
  "value": "http://www.octoparse.com/"
}

Response

The new parameter values have been added successfully or not.

Response Content Type

application/json, text/json

Example
{
  "error": "success",
  "error_Description": "Action Success"
}

2.2.6.Start Running Task

Request

POST api/task/StartTask?taskId={taskId}
Parameters
ParameterDescriptionRemark
taskId

Task ID

Please define the parameter in request URL.

Response

Status Codes ("data" parameter in response content): 1 = Task starts successfully, 2 = Task is running, 5 = Task Configuration is incorrect, 6 = Permission denied, 100 = Other Error

Response Content Type

application/json, text/json

Example
{
  "data": 1,
  "error": "success",
  "error_Description": "Action Success"
}

2.2.7.Stop Running Task

Request

POST api/task/StopTask?taskId={taskId}
Parameters
ParameterDescriptionRemark
taskId

Task ID

Please define the parameter in request URL.

Response

The task has been stopped successfully or not.

Response Content Type

application/json, text/json

Example
{
  "error": "success",
  "error_Description": "Action Success"
}

2.2.8.Clear Data

Request

POST api/task/RemoveDataByTaskId?taskId={taskId}
Parameters
ParameterDescriptionRemark
taskId

Task ID

Please define the parameter in request URL.

Response

Data has been cleared successfully or not.

Response Content Type

application/json, text/json

Example
{
  "error": "success",
  "error_Description": "Action Success"
}

2.3. Export Data

2.3.1.Export Non-exported Data

This returns non-exported data. Data will be tagged status = exporting (instead of status=exported) after the export. This way, the same set of data can be exported multiple times using this method. If the user has confirmed receipt of the data and wish to update data status to ‘exported’, please follow instruction 2.3.2 for status update.

Note: If the export gets interrupted (e.g. Due to network interruption), please re-export the data set once again using this method.

Request

GET api/notexportdata/gettop?taskId={taskId}&size={size}
Parameters
ParameterDescriptionRemark
taskId

Task ID

Please define the parameter in request URL.

size

The amount of data rows(range from 1 to 1000)

Please define the parameter in request URL.

Response

Data and request status

Response Content Type

application/json, text/json

Example
{
  "data": {
    "total": 100000,
    "currentTotal": 4,
    "dataList": [
      {
        "state": "Texas",
        "city": "Plano"
      },
      {
        "state": "Texas",
        "city": "Houston"
      },
      {
        "state": "Texas",
        "city": "Austin"
      },
      {
        "state": "Texas",
        "city": "Arlington"
      }
    ]
  },
  "error": "success",
  "error_Description": "Action Success"
}

2.3.2.Update Data Status

This updates data status from ‘exporting’ to ‘exported’.

Note: Please confirm data exported via the API ‘Export Task Data’ (api/notexportdata/gettop) have been retrieved successfully before using this method.

Request

POST api/notexportdata/update?taskId={taskId}
Parameters
ParameterDescriptionRemark
taskId

Task Id

Please define the parameter in request URL.

Response

Task status has been updated successfully or not.

Response Content Type

application/json, text/json

Example
{
  "error": "success",
  "error_Description": "Action Success"
}

2.4. Get Data

2.4.1.Get Data by Offset

To get data, parameters such as offset, size and task ID are all required in the request. Offset should default to 0 (offset=0), and size∈[1,1000] for making the initial request. The offset returned (could be any value greater than 0) should be used for making the next request. For example, if a task has 1000 data rows, using parameter: offset = 0, size = 100 will return the first 100 rows of data and the offset X (X can be any random number greater than or equal to 100). When making the second request, user should use the offset returned from the first request, offset = X, size = 100 to get the next 100 rows of data (row 101 to 200) as well as the new offset to use for the request follows.

Note: This method is only used to get data but will not affect the status of data. (Non-exported data will still remain as non-exported)

Request

GET api/alldata/GetDataOfTaskByOffset?taskId={taskId}&offset={offset}&size={size}
Parameters
ParameterDescriptionRemark
taskId

Task ID

Please define the parameter in request URL.

offset

If offset is less than or equal to 0, data will be returned starting from the first row.

Please define the parameter in request URL.

size

The amount of data that will be returned(range from 1 to 1000)

Please define the parameter in request URL.

Response

Data and request status

Response Content Type

application/json, text/json

Example
{
  "data": {
    "offset": 4,
    "total": 100000,
    "restTotal": 99996,
    "dataList": [
      {
        "state": "Texas",
        "city": "Plano"
      },
      {
        "state": "Texas",
        "city": "Houston"
      },
      {
        "state": "Texas",
        "city": "Austin"
      },
      {
        "state": "Texas",
        "city": "Arlington"
      }
    ]
  },
  "error": "success",
  "error_Description": "Action Success"
}

3. Reference

3.1. HTTP Status Code

Whenever an error code is returned, please refer to the following status code to solve the problem.

HTTP Status Code Inner Status Code Description

200

ok

Operation successful.

400

invalid_grant

Incorrect username or password.

400

unsupported_grant_type

Incorrect POST format. The correct format should be username={username}&password={password}&grant_type=password.

401

unauthorized

Access Token is invalid because it is expired or unauthorized. Please get a new token.

403

user_not_allowed

Permission denied. Please upgrade to Standard Plan to use Data API; upgrade to Professional Plan to use Advanced API.

404

not_found

The HTTP request is not recognized. Please request with the correct URL.

405

method_not_allowed

The HTTP method is not supported. Please use the method supported by the interface.

429

quota_exceeded

The request frequency has exceeded the limit. Please reduce access frequency to less than 20 times per second.

503

service_unavailable

The server is temporarily unavailable. Please try again later.

3.2. Example Code

Example Code:

C#: ApiSamples/Code/CSharp/

Python: ApiSamples/Code/Python/

(Other languages will be coming out soon.)

3.3. Terms and Conditions

Terms and Conditions