1. Overview
Before using the Octoparse Advanced API, you will need to hold a Professional account with at least one runnable task set up. (Haven’t got an account? Sign up here.) You can easily retrieve extracted data, task information and even control tasks (advanced API) by connecting to the Octoparse API, realizing efficient data extraction by coordinating with your own application.
1.2. Contact
Contact: Octoparse support team
Email: support@octoparse.com
1.3. URI Standard
All requests should be URL encoded with the base URL:
https://advancedapi.octoparse.com/
For example: A request for 'Get Data by Offset' should be: GET http://advancedapi.octoparse.com/api/alldata/GetDataOfTaskByOffset?taskId={taskId}&offset={offset}&size={size}
Note: {xxxx} in this document represents placeholder and users need to replace it with real value. For example, if your task ID is abc, offset is 1 and size is 10, then the URL should be https://advancedapi.octoparse.com/api/alldata/GetDataOfTaskByOffset?taskId=abc&offset=1&size=10.
1.4. Obtain OAuth2.0 Token
Before getting access to the Octoparse API, you need to obtain an Access Token based on OAuth2.0.
1.4.1. Obtain a New Token
You will need your username and password to get a new Access Token.
Request
POST https://advancedapi.octoparse.com/token
Parameters
username={userName}&password={password}&grant_type=password
Request Content Type
application/x-www-form-urlencoded
Response
Access Token
Response Content Type
application/json, text/json
Example
{ "access_token": "ABCD1234", //Access permission "token_type": "bearer", //Token type "expires_in": 86399, //Access Token Expiration time (in seconds)(It is recommended to use the same token repeatedly within this time frame.) "refresh_token": "refresh_token" //To refresh Access Token }
‘Access_Token’ is required for any API method invoked. Please add it to HTTP Header following the format below.
HeaderName: Authorization Value: bearer {access_token}
Note: There is a space between ‘bearer’ and ‘Access Token’. For example, if the Access Token is AA11BB22...CC33, the Header should be ‘Authorization: bearer AA11BB22...CC33’. Access Token has an expiration time and it is recommended to be used repeatedly before it expires.
1.4.2. Refresh Token
Once an Access Token expires, users can refresh Access Token with 'Refresh_Token'. 'Refresh Token' is a more secure way of obtaining new token compared to making new requests with username and password.
Note: Each 'refresh_token' can only be used once. The new 'refresh token' returned from the current request should be used for the next request.
Request
POST https://advancedapi.octoparse.com/token
Parameters
refresh_token={refresh_token}&grant_type=refresh_token
Request Content Type
application/x-www-form-urlencoded
Response
Access Token
Note: The response HTTP status code should be ‘200’. If not, please refer to HTTP Status Code to solve the problem.
2. Instruction
Octoparse limits API usage to 20 requests/second. Please reduce access frequency if you receive status code ‘429’.
Note: Octoparse uses a leaky bucket algorithm to limit API access frequency. The maximum number of requests is 100 within any five-second time interval; no more requests will be taken thereafter until the next 5-second time interval.
Unusual Status
The response HTTP status code should be ‘200’. If not, please refer to HTTP Status Code to solve the problem.
2.1. Get Task Group Information
2.1.1.List All Task Groups
Request
Response
Json-formatted text containing task group information and request status
Response Content Type
application/json, text/json
Example
{ "data": [ { "taskGroupId": 1, "taskGroupName": "Example Task Group 1" }, { "taskGroupId": 2, "taskGroupName": "Example Task Group 2" } ], "error": "success", "error_Description": "Operation successes." }
2.2. Manage Task
2.2.1.List All Tasks in a Group
Request
Parameters
Parameter | Description | Remark |
---|---|---|
taskGroupId |
Task Group ID |
Please define the parameter in request URL. |
Response
Json-formatted text including task ID (taskId), task name(taskName)and user ID and request status.
Response Content Type
application/json, text/json
Example
{ "data": [ { "taskId": "337fd7d7-aded-4081-9104-2b551161ccc8", "taskName": "Example Task 1", "creationUserId": "5d1e4b3c-645c-44ab-ac0e-bfa9ad600ece" }, { "taskId": "4adf489b-f883-43fa-b958-0cfde945ddb7", "taskName": "Example Task 2", "creationUserId": "5d1e4b3c-645c-44ab-ac0e-bfa9ad600ece" } ], "error": "success", "error_Description": "Operation successes." }
2.2.2.Get Task Status
This returns status of multiple tasks.
Request
Parameters
Parameter | Description | Remark |
---|---|---|
taskIdList |
Json-formatted task ID list |
Please define the parameter in request body. |
Request Content Type
application/json, text/json
Example
{ "taskIdList": [ "337fd7d7-aded-4081-9104-2b551161ccc8", "4adf489b-f883-43fa-b958-0cfde945ddb7" ] }
Response
Task status code: 0 = Running, 1 = Stopped, 2 = Completed, 3 = Waiting, 5 = Never Run
Response Content Type
application/json, text/json
Example
{ "data": [ { "taskId": "337fd7d7-aded-4081-9104-2b551161ccc8", "taskName": "Example Task 1", "status": 1 }, { "taskId": "4adf489b-f883-43fa-b958-0cfde945ddb7", "taskName": "Example Task 2", "status": 2 } ], "error": "success", "error_Description": "Operation successes." }
2.2.3.Get Task Parameters
This returns the different parameters for a specific task, for example, the URL from ‘Go To The Web Page’ action, text value from ‘Enter Text’ action and text list/URL list from ‘Loop Item’ action.
Request
Parameters
Parameter | Description | Remark |
---|---|---|
taskId |
Task ID |
Please define the parameter in request URL. |
name |
Configuration parameter name (navigateAction1.Url,loopAction1.UrlList,loopAction1.TextList, etc.) |
Please define the parameter in request URL. |
Response
Task parameters values (or value arrays) and request status
Response Content Type
application/json, text/json
Example
{ "data": [ "http://www.octoparse.com/", "http://www.skieer.com/" ], "error": "success", "error_Description": "Operation successes." }
2.2.4.Update Task Parameters
Use this method to update task parameters (currently only available to updating URL in ‘Go To The Web Page’ action, text value in ‘Enter Text’ action, and text list/URL list in ‘Loop Item’ action).
Note: For updating text list/URL list values, please use [‘text1’, ’text2’, ’text3’,’textN’] to represent N items.
Request
Parameters
Parameter | Description | Remark |
---|---|---|
ruleParaInfo |
Task parameters |
Please define the parameter in request body. |
Request Content Type
application/json, text/json
Example
{ "taskId": "337fd7d7-aded-4081-9104-2b551161ccc8", "name": "loopAction2.TextList", "value": [ "Octparse", "Skieer Infomation" ] }
Response
The task parameter has been updated successfully or not.
Response Content Type
application/json, text/json
Example
{ "error": "success", "error_Description": "Operation successes." }
2.2.5.Adding URL/Text to a Loop
Use this method to add new URLs/text to an existing loop.
Note: For updating text list/URL list values, please use [‘text1’, ’text2’, ’text3’,’textN’] to represent N items.
Request
Parameters
Parameter | Description | Remark |
---|---|---|
ruleParaInfo |
Parameters of any existing loop |
Please define the parameter in request body. |
Request Content Type
application/json, text/json
Example
{ "taskId": "4adf489b-f883-43fa-b958-0cfde945ddb7", "name": "loopAction1.UrlList", "value": [ "http://www.octoparse.com/", "http://www.skieer.com/" ] }
Response
The new parameter values have been added successfully or not.
Response Content Type
application/json, text/json
Example
{ "error": "success", "error_Description": "Operation successes." }
2.2.6.Start Running Task
Request
Parameters
Parameter | Description | Remark |
---|---|---|
taskId |
Task ID |
Please define the parameter in request URL. |
Response
Status Codes ("data" parameter in response content): 1 = Task starts successfully, 2 = Task is running, 5 = Task Configuration is incorrect, 6 = Permission denied, 100 = Other Error
Response Content Type
application/json, text/json
Example
{ "data": 1, "error": "success", "error_Description": "Operation successes." }
2.2.7.Stop Running Task
Request
Parameters
Parameter | Description | Remark |
---|---|---|
taskId |
Task ID |
Please define the parameter in request URL. |
Response
The task has been stopped successfully or not.
Response Content Type
application/json, text/json
Example
{ "error": "success", "error_Description": "Operation successes." }
2.2.8.Clear Data
Request
Parameters
Parameter | Description | Remark |
---|---|---|
taskId |
Task ID |
Please define the parameter in request URL. |
Response
Data has been cleared successfully or not.
Response Content Type
application/json, text/json
Example
{ "error": "success", "error_Description": "Operation successes." }
2.3. Export Data
2.3.1.Export Non-exported Data
This returns non-exported data. Data will be tagged status = exporting (instead of status=exported) after the export. This way, the same set of data can be exported multiple times using this method. If the user has confirmed receipt of the data and wish to update data status to ‘exported’, please follow instruction 2.3.2 for status update.
Note: If the export gets interrupted (e.g. Due to network interruption), please re-export the data set once again using this method.
Request
Parameters
Parameter | Description | Remark |
---|---|---|
taskId |
Task ID |
Please define the parameter in request URL. |
size |
The amount of data rows(range from 1 to 1000) |
Please define the parameter in request URL. |
Response
Data and request status
Response Content Type
application/json, text/json
Example
{ "data": { "total": 100000, "currentTotal": 4, "dataList": [ { "State": "Texas", "City": "Plano" }, { "State": "Texas", "City": "Houston" }, { "State": "Texas", "City": "Austin" }, { "State": "Texas", "City": "Arlington" } ] }, "error": "success", "error_Description": "Operation successes." }
2.3.2.Update Data Status
This updates data status from ‘exporting’ to ‘exported’.
Note: Please confirm data exported via the API ‘Export Task Data’ (api/notexportdata/gettop) have been retrieved successfully before using this method.
Request
Parameters
Parameter | Description | Remark |
---|---|---|
taskId |
Task Id |
Please define the parameter in request URL. |
Response
Task status has been updated successfully or not.
Response Content Type
application/json, text/json
Example
{ "error": "success", "error_Description": "Operation successes." }
2.4. Get Data
2.4.1.Get Data by Offset
To get data, parameters such as offset, size and task ID are all required in the request. Offset should default to 0 (offset=0), and size∈[1,1000] for making the initial request. The offset returned (could be any value greater than 0) should be used for making the next request. For example, if a task has 1000 data rows, using parameter: offset = 0, size = 100 will return the first 100 rows of data and the offset X (X can be any random number greater than or equal to 100). When making the second request, user should use the offset returned from the first request, offset = X, size = 100 to get the next 100 rows of data (row 101 to 200) as well as the new offset to use for the request follows.
Note: This method is only used to get data but will not affect the status of data. (Non-exported data will still remain as non-exported)
Request
Parameters
Parameter | Description | Remark |
---|---|---|
taskId |
Task ID |
Please define the parameter in request URL. |
offset |
If offset is less than or equal to 0, data will be returned starting from the first row. |
Please define the parameter in request URL. |
size |
The amount of data that will be returned(range from 1 to 1000) |
Please define the parameter in request URL. |
Response
Data and request status
Response Content Type
application/json, text/json
Example
{ "data": { "offset": 4, "total": 100000, "restTotal": 99996, "dataList": [ { "State": "Texas", "City": "Plano" }, { "State": "Texas", "City": "Houston" }, { "State": "Texas", "City": "Austin" }, { "State": "Texas", "City": "Arlington" } ] }, "error": "success", "error_Description": "Operation successes." }
3. Reference
3.1. HTTP Status Code
Whenever an error code is returned, please refer to the following status code to solve the problem.
HTTP Status Code | Inner Status Code | Description |
---|---|---|
200 |
ok |
Operation successful. |
400 |
invalid_grant |
Incorrect username or password. |
400 |
unsupported_grant_type |
Incorrect POST format. The correct format should be username={username}&password={password}&grant_type=password. |
401 |
unauthorized |
Access Token is invalid because it is expired or unauthorized. Please get a new token. |
403 |
user_not_allowed |
Permission denied. Please upgrade to Standard Plan to use Data API; upgrade to Professional Plan to use Advanced API. |
404 |
not_found |
The HTTP request is not recognized. Please request with the correct URL. |
405 |
method_not_allowed |
The HTTP method is not supported. Please use the method supported by the interface. |
429 |
quota_exceeded |
The request frequency has exceeded the limit. Please reduce access frequency to less than 20 times per second. |
503 |
service_unavailable |
The server is temporarily unavailable. Please try again later. |
3.2. Example Code
Example Code:
Python: ApiSamples/Code/Python/
Java: ApiSamples/Code/Java/
PHP: ApiSamples/Code/Php/
(Other languages will be coming out soon.)