Skip to content

Browser Emulation Using PhantomJs Cloud

Irfan Charania edited this page Dec 21, 2016 · 2 revisions

Browser Emulation Using PhantomJs Cloud

Problem: Certain sites have ugly source code and/or render the page using JavaScript, making it next to impossible to use the Website Agent. (Described in issue #888)

Solution: Use PhantomJs Cloud to emulate the browser and return a fully rendered DOM. This allows the Website Agent to then properly scrape dynamic content from JavaScript-heavy pages.

There are two ways to generate URLS for PhantomJs Cloud:

  1. PhantomJs Cloud Agent (simpler but limited)
  2. Manually

Credentials

Before you begin, you will need to sign up for an account at https://phantomjscloud.com/. Then you can copy your API key and add it in your Huginn credentials

[
  {
    "id": 1,
    "user_id": 1,
    "credential_name": "phantomjs_cloud",
    "credential_value": "YOUR-KEY",
    "mode": "text"
  }
]

Option 1: PhantomJs Cloud Agent

** This agent only provides a limited subset of the most commonly used options.

The workflow to fetch the page is as follows:

  1. RssAgent - provides example urls to fetch
  2. PhantomJsCloudAgent - to set up PhantomJs Cloud options
  3. WebsiteAgent - to fetch the page using PhantomJs Cloud
  4. DataOutputAgent - to output RSS

Full scenario can be found here

1. RssAgent

Name: PhantomJS Cloud - In - RSS

{
  "expected_update_period_in_days": "5",
  "clean": "true",
  "url": "http://xkcd.com/rss.xml"
}

2. PhantomJsCloudAgent

Name: PhantomJS Cloud - Process - Options
Event sources: PhantomJS Cloud - In - RSS
Propagate immediately: Yes

{
  "mode": "clean",
  "api_key": "{% credential phantomjs_cloud %}",
  "url": "{{url}}",
  "render_type": "html",
  "output_as_json_radio": "false",
  "output_as_json": "false",
  "ignore_images_radio": "false",
  "ignore_images": "false",
  "user_agent": "Mozilla/5.0 (BlackBerry; U; BlackBerry 9900; en) AppleWebKit/534.11+ (KHTML, like Gecko) Version/7.1.0.346 Mobile Safari/534.11+",
  "wait_interval": "1000"
}

3. WebsiteAgent

Name: PhantomJS Cloud - Process - Fetch Page
Event sources: PhantomJS Cloud - Process - Options
Propagate immediately: Yes

{
  "expected_update_period_in_days": "2",
  "url_from_event": "{{url}}",
  "type": "html",
  "mode": "on_change",
  "extract": {
    "title": {
      "css": "title",
      "value": "normalize-space(.)"
    },
    "body": {
      "css": "body #comic",
      "value": "./node()"
    }
  }
}

4. DataOutputAgent

Name: PhantomJS Cloud - Out - RSS
Event sources: PhantomJS Cloud - Process - Fetch Page
Propagate immediately: Yes

{
  "secrets": [
    "phantom"
  ],
  "expected_receive_period_in_days": 2,
  "template": {
    "title": "XKCD comics as a feed",
    "description": "This is a feed of recent XKCD comics, generated by Huginn",
    "item": {
      "title": "{{title}}",
      "description": "{{body}}"
    }
  }
}

Option 2: Manually

The workflow to fetch the page is as follows:

  1. RssAgent - provides example urls to fetch
  2. EventFormattingAgent - to set up PhantomJs Cloud options
  3. JavascriptAgent - to properly encode the [REQUEST-JSON] using encodeURIComponent()
  4. WebsiteAgent - to fetch the page using PhantomJs Cloud
  5. DataOutputAgent - to output RSS

Full scenario can be found here

1. RssAgent

Name: PhantomJS Cloud - In - RSS

{
  "expected_update_period_in_days": "5",
  "clean": "true",
  "url": "http://xkcd.com/rss.xml"
}

2. EventFormattingAgent

Name: PhantomJS Cloud - Process - Format
Event sources: PhantomJS Cloud - In - RSS
Propagate immediately: Yes

{
  "instructions": {
    "message": {
      "url": "{{url}}",
      "renderType": "html",
      "requestSettings": {
        "userAgent": "Mozilla/5.0 (BlackBerry; U; BlackBerry 9900; en) AppleWebKit/534.11+ (KHTML, like Gecko) Version/7.1.0.346 Mobile Safari/534.11+"
      }
    }
  },
  "mode": "clean"
}

For more options, refer to the Official API

3. JavascriptAgent

Name: PhantomJS Cloud - Process - JS Escape
Event sources: PhantomJS Cloud - Process - Format
Propagate immediately: Yes

{
  "language": "JavaScript",
  "code": "Agent.receive = function() {\r\n  var events = this.incomingEvents();\r\n  for(var i = 0; i < events.length; i++) {\r\n    var js = JSON.stringify(events[i].payload.message);\r\n    this.log('Message to escape: ' + js);\r\n    this.createEvent({ 'url': encodeURIComponent(js) });\r\n    var callCount = this.memory('callCount') || 0;\r\n    this.memory('callCount', callCount + 1);\r\n  }\r\n}",
  "expected_receive_period_in_days": "2",
  "expected_update_period_in_days": "2"
}

Note: Huginn's uri_escape doesn't escape same as Javascript encodeURIComponent

4. WebsiteAgent

Name: PhantomJS Cloud - Process - Fetch Page
Event sources: PhantomJS Cloud - Process - JS Escape
Propagate immediately: Yes

{
  "expected_update_period_in_days": "2",
  "url_from_event": "https://PhantomJsCloud.com/api/browser/v2/{%credential phantomjs_cloud%}/?request={{url}}",
  "type": "html",
  "mode": "on_change",
  "extract": {
    "title": {
      "css": "title",
      "value": "normalize-space(.)"
    },
    "body": {
      "css": "body #comic",
      "value": "./node()"
    }
  }
}

5. DataOutputAgent

Name: PhantomJS Cloud - Out - RSS
Event sources: PhantomJS Cloud - Process - Fetch Page
Propagate immediately: Yes

{
  "secrets": [
    "phantom"
  ],
  "expected_receive_period_in_days": 2,
  "template": {
    "title": "XKCD comics as a feed",
    "description": "This is a feed of recent XKCD comics, generated by Huginn",
    "item": {
      "title": "{{title}}",
      "description": "{{body}}"
    }
  }
}
Clone this wiki locally