-
-
Notifications
You must be signed in to change notification settings - Fork 389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parse wrong DOM tree #159
Comments
Definitely a bug in the |
I think I have found out the problem. I wget the html file, and found there is a strange
when the
I found the source scrip in Parser.js
So, apparently, there is nothing with the |
Looks like this has been fixed at some point in the last five years 🙂 |
I was crawling this website yesterday, with cheerio. Then I found $(elem).find(xxclass) did not always work out the expected answer. I tracked down this problem and found out it was htmlparser2 that didn't parse the correct DOM tree. I also notice that the html file has lots of comments, which seems infect the parse.
here is part of the suspicious html :
and this is the parsed DOM tree, notice .entry-summary is a comment but has children and its prev is #main !!:
Under is my script to show the parsed dom tree by htmlparser2.
I am really confused...
The text was updated successfully, but these errors were encountered: