Closed
Description
I was crawling this website yesterday, with cheerio. Then I found $(elem).find(xxclass) did not always work out the expected answer. I tracked down this problem and found out it was htmlparser2 that didn't parse the correct DOM tree. I also notice that the html file has lots of comments, which seems infect the parse.
here is part of the suspicious html :
<div id="main">
<div class="hfeed">
<div id="post-9977" class="hentry post publish post-1 odd author-qt category-truyen-cuoi-hang-ngay">
<div class="sticky-header">
<h2 class="post-title entry-title"><a href="xxx">Chuyện vợ chồng anh Lương chị Ví</a></h2>
<div class="byline">
<time class="published">August 25, 2015</time> · by <span class="author vcard">
<a class="url fn n" rel="author" href="xxxx" title="Trùm Cười">Trùm Cười</a></span> · in <span class="category"><a href="http://cuoivuive.com/truyen-cuoi/truyen-cuoi-hang-ngay" rel="tag">Truyện cười hàng ngày</a></span>
</div>
</div>
<!-- .sticky-header -->
<div class="entry-summary">
<p>Anh Lương đi làm xa nhà tháng mới về một lần, hôm nay nghe tin anh Lương về lòng chị Ví thấy vui vui lạ. Chiều nay nghe tin anh Lương về, lòng chị Ví bỗng thấy vui lạ. Hai…</p>
<div class="ssb-share ssb-share-9977 defualt" post_id="9977">
<div class="defualt-button-fb">
<iframe src="xxx" ></iframe>
</div>
<div class="defualt-button-twitter">
<iframe id="twitter-widget-0" src="xxx"></iframe>
<br>
<script>
!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?"http":"https";if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+"://platform.twitter.com/widgets.js";fjs.parentNode.insertBefore(js,fjs);}}(document, "script", "twitter-wjs");
</script>
</div>
<div class="defualt-button-gplus">
<script type="text/javascript" src="xxx" gapi_processed="true"></script>
<p></p>
<div id="___plusone_0" >
<iframe src="xxx"></iframe>
</div>
<p></p>
</div>
</div>
</div>
<!-- .entry-summary -->
</div>
<!-- .hentry -->
<div class="hentry"> many hentry under</div>
<!-- .hentry -->
<div class="hentry"> many hentry under</div>
<!-- .hentry -->
<div class="hentry"> many hentry under</div>
<!-- .hentry -->
</div>
</div>
and this is the parsed DOM tree, notice .entry-summary is a comment but has children and its prev is #main !!:
{ data: ' .entry-summary ',
type: 'comment',
next: [Circular],
prev:
{ type: 'tag',
name: 'div',
attribs: { id: 'main' },
children:
[ [Object],
[Object],
[Object],
[Object],
[Object],
[Object],
[length]: 6 ],
next: [Circular],
prev:
{ data: '\n\t\n\t\t\t\t\n\t\t\t\t\n\t\t\t',
type: 'text',
next: [Circular],
prev: [Object],
parent: [Circular] },
parent: [Circular] },
parent: [Circular] }
Under is my script to show the parsed dom tree by htmlparser2.
var request = require('request');
var htmlparser = require("htmlparser2");
var util = require('util');
var fs = require('fs');
request('http://cuoivuive.com/', function(error, response, body) {
if(error) {
console.log(error);
}
console.log('parsing..');
dom = htmlparser.parseDOM(response.body.match(/(<body[\s\S]*<\/body>)/)[1]);
fs.writeFileSync('result.js', util.inspect(dom[0].children,{ showHidden: true, depth: 9 }) , 'utf-8');
console.log('parsed..');
});
I am really confused...
Activity
fb55 commentedon Nov 26, 2015
Definitely a bug in the
domhandler
module and I think I know what causes it. Thanks for the report!duziaqin commentedon Dec 1, 2015
I think I have found out the problem. I wget the html file, and found there is a strange
p
right in the place where it went wrong.when the
p
close, it close the.defualt-button-gplus
and.ssb-share ssb-share-9977
too.I found the source scrip in Parser.js
So, apparently, there is nothing with the
comment
, but thep
...It is the browser that makes it right.
fb55 commentedon Sep 13, 2020
Looks like this has been fixed at some point in the last five years 🙂