Browser Emulation Using PhantomJs Cloud
Problem: Certain sites have ugly source code and/or render the page using JavaScript, making it next to impossible to use the Website Agent. (Described in issue #888)
Solution: Use PhantomJs Cloud to emulate the browser and return a fully rendered DOM. This allows the Website Agent to then properly scrape dynamic content from JavaScript-heavy pages.
There are two ways to generate URLS for PhantomJs Cloud:
- PhantomJs Cloud Agent (simpler but limited)
- Manually
Before you begin, you will need to sign up for an account at https://phantomjscloud.com/. Then you can copy your API key and add it in your Huginn credentials
[
{
"id": 1,
"user_id": 1,
"credential_name": "phantomjs_cloud",
"credential_value": "YOUR-KEY",
"mode": "text"
}
]
** This agent only provides a limited subset of the most commonly used options.
The workflow to fetch the page is as follows:
- RssAgent - provides example urls to fetch
- PhantomJsCloudAgent - to set up PhantomJs Cloud options
- WebsiteAgent - to fetch the page using PhantomJs Cloud
- DataOutputAgent - to output RSS
Full scenario can be found here
Name: PhantomJS Cloud - In - RSS
{
"expected_update_period_in_days": "5",
"clean": "true",
"url": "http://xkcd.com/rss.xml"
}
Name: PhantomJS Cloud - Process - Options
Event sources: PhantomJS Cloud - In - RSS
Propagate immediately: Yes
{
"mode": "clean",
"api_key": "{% credential phantomjs_cloud %}",
"url": "{{url}}",
"render_type": "html",
"output_as_json_radio": "false",
"output_as_json": "false",
"ignore_images_radio": "false",
"ignore_images": "false",
"user_agent": "Mozilla/5.0 (BlackBerry; U; BlackBerry 9900; en) AppleWebKit/534.11+ (KHTML, like Gecko) Version/7.1.0.346 Mobile Safari/534.11+",
"wait_interval": "1000"
}
Name: PhantomJS Cloud - Process - Fetch Page
Event sources: PhantomJS Cloud - Process - Options
Propagate immediately: Yes
{
"expected_update_period_in_days": "2",
"url_from_event": "{{url}}",
"type": "html",
"mode": "on_change",
"extract": {
"title": {
"css": "title",
"value": "normalize-space(.)"
},
"body": {
"css": "body #comic",
"value": "./node()"
}
}
}
Name: PhantomJS Cloud - Out - RSS
Event sources: PhantomJS Cloud - Process - Fetch Page
Propagate immediately: Yes
{
"secrets": [
"phantom"
],
"expected_receive_period_in_days": 2,
"template": {
"title": "XKCD comics as a feed",
"description": "This is a feed of recent XKCD comics, generated by Huginn",
"item": {
"title": "{{title}}",
"description": "{{body}}"
}
}
}
The workflow to fetch the page is as follows:
- RssAgent - provides example urls to fetch
- EventFormattingAgent - to set up PhantomJs Cloud options
- JavascriptAgent - to properly encode the [REQUEST-JSON] using encodeURIComponent()
- WebsiteAgent - to fetch the page using PhantomJs Cloud
- DataOutputAgent - to output RSS
Full scenario can be found here
Name: PhantomJS Cloud - In - RSS
{
"expected_update_period_in_days": "5",
"clean": "true",
"url": "http://xkcd.com/rss.xml"
}
Name: PhantomJS Cloud - Process - Format
Event sources: PhantomJS Cloud - In - RSS
Propagate immediately: Yes
{
"instructions": {
"message": {
"url": "{{url}}",
"renderType": "html",
"requestSettings": {
"userAgent": "Mozilla/5.0 (BlackBerry; U; BlackBerry 9900; en) AppleWebKit/534.11+ (KHTML, like Gecko) Version/7.1.0.346 Mobile Safari/534.11+"
}
}
},
"mode": "clean"
}
For more options, refer to the Official API
Name: PhantomJS Cloud - Process - JS Escape
Event sources: PhantomJS Cloud - Process - Format
Propagate immediately: Yes
{
"language": "JavaScript",
"code": "Agent.receive = function() {\r\n var events = this.incomingEvents();\r\n for(var i = 0; i < events.length; i++) {\r\n var js = JSON.stringify(events[i].payload.message);\r\n this.log('Message to escape: ' + js);\r\n this.createEvent({ 'url': encodeURIComponent(js) });\r\n var callCount = this.memory('callCount') || 0;\r\n this.memory('callCount', callCount + 1);\r\n }\r\n}",
"expected_receive_period_in_days": "2",
"expected_update_period_in_days": "2"
}
Note: Huginn's
uri_escape
doesn't escape same as JavascriptencodeURIComponent
Name: PhantomJS Cloud - Process - Fetch Page
Event sources: PhantomJS Cloud - Process - JS Escape
Propagate immediately: Yes
{
"expected_update_period_in_days": "2",
"url_from_event": "https://PhantomJsCloud.com/api/browser/v2/{%credential phantomjs_cloud%}/?request={{url}}",
"type": "html",
"mode": "on_change",
"extract": {
"title": {
"css": "title",
"value": "normalize-space(.)"
},
"body": {
"css": "body #comic",
"value": "./node()"
}
}
}
Name: PhantomJS Cloud - Out - RSS
Event sources: PhantomJS Cloud - Process - Fetch Page
Propagate immediately: Yes
{
"secrets": [
"phantom"
],
"expected_receive_period_in_days": 2,
"template": {
"title": "XKCD comics as a feed",
"description": "This is a feed of recent XKCD comics, generated by Huginn",
"item": {
"title": "{{title}}",
"description": "{{body}}"
}
}
}