Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parse wrong DOM tree #159

Closed
duziaqin opened this issue Nov 26, 2015 · 3 comments
Closed

parse wrong DOM tree #159

duziaqin opened this issue Nov 26, 2015 · 3 comments

Comments

@duziaqin
Copy link

I was crawling this website yesterday, with cheerio. Then I found $(elem).find(xxclass) did not always work out the expected answer. I tracked down this problem and found out it was htmlparser2 that didn't parse the correct DOM tree. I also notice that the html file has lots of comments, which seems infect the parse.
here is part of the suspicious html :

<div id="main">
  <div class="hfeed">
    <div id="post-9977" class="hentry post publish post-1 odd author-qt category-truyen-cuoi-hang-ngay">                    
        <div class="sticky-header">
            <h2 class="post-title entry-title"><a href="xxx">Chuyện vợ chồng anh Lương chị Ví</a></h2>
            <div class="byline">
                <time class="published">August 25, 2015</time> · by <span class="author vcard">
                <a class="url fn n" rel="author" href="xxxx" title="Trùm Cười">Trùm Cười</a></span> · in <span class="category"><a href="http://cuoivuive.com/truyen-cuoi/truyen-cuoi-hang-ngay" rel="tag">Truyện cười hàng ngày</a></span>
            </div>                                      
        </div>
        <!-- .sticky-header -->
        <div class="entry-summary">
            <p>Anh Lương đi làm xa nhà tháng mới về một lần, hôm nay nghe tin anh Lương về lòng chị Ví thấy vui vui lạ. Chiều nay nghe tin anh Lương về, lòng chị Ví bỗng thấy vui lạ. Hai…</p>
                           <div class="ssb-share ssb-share-9977 defualt" post_id="9977">
                <div class="defualt-button-fb">
                    <iframe src="xxx" ></iframe>
                </div>
                <div class="defualt-button-twitter">
                    <iframe id="twitter-widget-0"  src="xxx"></iframe>
                    <br>
                    <script>
                        !function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?"http":"https";if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+"://platform.twitter.com/widgets.js";fjs.parentNode.insertBefore(js,fjs);}}(document, "script", "twitter-wjs");
                    </script>
                </div>
                <div class="defualt-button-gplus">
                    <script type="text/javascript" src="xxx" gapi_processed="true"></script>
                    <p></p>
                    <div id="___plusone_0" >
                        <iframe src="xxx"></iframe>
                    </div>
                    <p></p>
                </div>
            </div>                                                              
        </div>
    <!-- .entry-summary -->
    </div>
    <!-- .hentry -->
    <div class="hentry"> many hentry under</div>
    <!-- .hentry -->
    <div class="hentry"> many hentry under</div>
    <!-- .hentry -->
    <div class="hentry"> many hentry under</div>
    <!-- .hentry -->
  </div>
</div>

and this is the parsed DOM tree, notice .entry-summary is a comment but has children and its prev is #main !!:

{ data: ' .entry-summary ',
  type: 'comment',
  next: [Circular],
  prev:
   { type: 'tag',
     name: 'div',
     attribs: { id: 'main' },
     children:
      [ [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [length]: 6 ],
     next: [Circular],
     prev:
      { data: '\n\t\n\t\t\t\t\n\t\t\t\t\n\t\t\t',
        type: 'text',
        next: [Circular],
        prev: [Object],
        parent: [Circular] },
     parent: [Circular] },
  parent: [Circular] }

Under is my script to show the parsed dom tree by htmlparser2.

var request = require('request');
var htmlparser = require("htmlparser2");
var util = require('util');
var fs = require('fs');

request('http://cuoivuive.com/', function(error, response, body) {
  if(error) {
    console.log(error);
  }

  console.log('parsing..');
  dom = htmlparser.parseDOM(response.body.match(/(<body[\s\S]*<\/body>)/)[1]);

  fs.writeFileSync('result.js', util.inspect(dom[0].children,{ showHidden: true, depth: 9 }) , 'utf-8');

  console.log('parsed..');
});

I am really confused...

@fb55
Copy link
Owner

fb55 commented Nov 26, 2015

Definitely a bug in the domhandler module and I think I know what causes it. Thanks for the report!

@duziaqin
Copy link
Author

duziaqin commented Dec 1, 2015

Definitely a bug in the domhandler module and I think I know what causes it. Thanks for the report!

I think I have found out the problem. I wget the html file, and found there is a strange p right in the place where it went wrong.

<div class="entry-summary">

  <p>Anh Lương đi l&agrave;m xa nh&agrave; th&aacute;ng mới về một lần, h&ocirc;m nay nghe tin anh Lương về l&ograve;ng chị V&iacute; thấy vui vui lạ. Chiều nay nghe tin anh Lương về, l&ograve;ng chị V&iacute; bỗng thấy vui lạ. Hai&#8230;
    <div class='ssb-share ssb-share-9977 defualt' post_id='9977'>
      <div class="defualt-button-fb">
        <iframe src="xxx"></iframe>
      </div>
      <div class="defualt-button-twitter">
        <a href="https://twitter.com/share" class="twitter-share-button" data-url="http://cuoivuive.com/chuyen-vo-chong-anh-luong-chi-vi">Tweet</a>
        <br />
        <script>
        </script>
      </div>
      <div class="defualt-button-gplus">
        <script type="text/javascript" src="https://apis.google.com/js/platform.js"></script>
  </p>
  <div class="g-plusone" data-size="medium" data-href="http://cuoivuive.com/chuyen-vo-chong-anh-luong-chi-vi"></div>
  </p>
  </div>
  </div>

</div>
<!-- .entry-summary -->

</div>
<!-- .hentry -->

when the p close, it close the .defualt-button-gplus and .ssb-share ssb-share-9977 too.

script start; attribs: {"type":"text/javascript","src":"https://apis.google.com/js/platform.js"}
script   end;
div  .defualt-button-gplus end;
div  .ssb-share ssb-share-9977 defualt end;
p   end;
-->

I found the source scrip in Parser.js

Parser.prototype.onclosetag = function(name){
    this._updatePosition(1);

    if(this._lowerCaseTagNames){
        name = name.toLowerCase();
    }

    if(this._stack.length && (!(name in voidElements) || this._options.xmlMode)){
        var pos = this._stack.lastIndexOf(name);
        if(pos !== -1){
            if(this._cbs.onclosetag){
                pos = this._stack.length - pos;
                while(pos--) this._cbs.onclosetag(this._stack.pop());
            }
            else this._stack.length = pos;
        } else if(name === "p" && !this._options.xmlMode){
            this.onopentagname(name);
            this._closeCurrentTag();
        }
    } else if(!this._options.xmlMode && (name === "br" || name === "p")){
        this.onopentagname(name);
        this._closeCurrentTag();
    }
};

So, apparently, there is nothing with the comment, but the p...
It is the browser that makes it right.

@fb55
Copy link
Owner

fb55 commented Sep 13, 2020

Looks like this has been fixed at some point in the last five years 🙂

@fb55 fb55 closed this as completed Sep 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants