Skip to content

parse wrong DOM tree #159

Closed
Closed
@duziaqin

Description

@duziaqin

I was crawling this website yesterday, with cheerio. Then I found $(elem).find(xxclass) did not always work out the expected answer. I tracked down this problem and found out it was htmlparser2 that didn't parse the correct DOM tree. I also notice that the html file has lots of comments, which seems infect the parse.
here is part of the suspicious html :

<div id="main">
  <div class="hfeed">
    <div id="post-9977" class="hentry post publish post-1 odd author-qt category-truyen-cuoi-hang-ngay">                    
        <div class="sticky-header">
            <h2 class="post-title entry-title"><a href="xxx">Chuyện vợ chồng anh Lương chị Ví</a></h2>
            <div class="byline">
                <time class="published">August 25, 2015</time> · by <span class="author vcard">
                <a class="url fn n" rel="author" href="xxxx" title="Trùm Cười">Trùm Cười</a></span> · in <span class="category"><a href="http://cuoivuive.com/truyen-cuoi/truyen-cuoi-hang-ngay" rel="tag">Truyện cười hàng ngày</a></span>
            </div>                                      
        </div>
        <!-- .sticky-header -->
        <div class="entry-summary">
            <p>Anh Lương đi làm xa nhà tháng mới về một lần, hôm nay nghe tin anh Lương về lòng chị Ví thấy vui vui lạ. Chiều nay nghe tin anh Lương về, lòng chị Ví bỗng thấy vui lạ. Hai…</p>
                           <div class="ssb-share ssb-share-9977 defualt" post_id="9977">
                <div class="defualt-button-fb">
                    <iframe src="xxx" ></iframe>
                </div>
                <div class="defualt-button-twitter">
                    <iframe id="twitter-widget-0"  src="xxx"></iframe>
                    <br>
                    <script>
                        !function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?"http":"https";if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+"://platform.twitter.com/widgets.js";fjs.parentNode.insertBefore(js,fjs);}}(document, "script", "twitter-wjs");
                    </script>
                </div>
                <div class="defualt-button-gplus">
                    <script type="text/javascript" src="xxx" gapi_processed="true"></script>
                    <p></p>
                    <div id="___plusone_0" >
                        <iframe src="xxx"></iframe>
                    </div>
                    <p></p>
                </div>
            </div>                                                              
        </div>
    <!-- .entry-summary -->
    </div>
    <!-- .hentry -->
    <div class="hentry"> many hentry under</div>
    <!-- .hentry -->
    <div class="hentry"> many hentry under</div>
    <!-- .hentry -->
    <div class="hentry"> many hentry under</div>
    <!-- .hentry -->
  </div>
</div>

and this is the parsed DOM tree, notice .entry-summary is a comment but has children and its prev is #main !!:

{ data: ' .entry-summary ',
  type: 'comment',
  next: [Circular],
  prev:
   { type: 'tag',
     name: 'div',
     attribs: { id: 'main' },
     children:
      [ [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [length]: 6 ],
     next: [Circular],
     prev:
      { data: '\n\t\n\t\t\t\t\n\t\t\t\t\n\t\t\t',
        type: 'text',
        next: [Circular],
        prev: [Object],
        parent: [Circular] },
     parent: [Circular] },
  parent: [Circular] }

Under is my script to show the parsed dom tree by htmlparser2.

var request = require('request');
var htmlparser = require("htmlparser2");
var util = require('util');
var fs = require('fs');

request('http://cuoivuive.com/', function(error, response, body) {
  if(error) {
    console.log(error);
  }

  console.log('parsing..');
  dom = htmlparser.parseDOM(response.body.match(/(<body[\s\S]*<\/body>)/)[1]);

  fs.writeFileSync('result.js', util.inspect(dom[0].children,{ showHidden: true, depth: 9 }) , 'utf-8');

  console.log('parsed..');
});

I am really confused...

Activity

fb55

fb55 commented on Nov 26, 2015

@fb55
Owner

Definitely a bug in the domhandler module and I think I know what causes it. Thanks for the report!

duziaqin

duziaqin commented on Dec 1, 2015

@duziaqin
Author

Definitely a bug in the domhandler module and I think I know what causes it. Thanks for the report!

I think I have found out the problem. I wget the html file, and found there is a strange p right in the place where it went wrong.

<div class="entry-summary">

  <p>Anh Lương đi l&agrave;m xa nh&agrave; th&aacute;ng mới về một lần, h&ocirc;m nay nghe tin anh Lương về l&ograve;ng chị V&iacute; thấy vui vui lạ. Chiều nay nghe tin anh Lương về, l&ograve;ng chị V&iacute; bỗng thấy vui lạ. Hai&#8230;
    <div class='ssb-share ssb-share-9977 defualt' post_id='9977'>
      <div class="defualt-button-fb">
        <iframe src="xxx"></iframe>
      </div>
      <div class="defualt-button-twitter">
        <a href="https://twitter.com/share" class="twitter-share-button" data-url="http://cuoivuive.com/chuyen-vo-chong-anh-luong-chi-vi">Tweet</a>
        <br />
        <script>
        </script>
      </div>
      <div class="defualt-button-gplus">
        <script type="text/javascript" src="https://apis.google.com/js/platform.js"></script>
  </p>
  <div class="g-plusone" data-size="medium" data-href="http://cuoivuive.com/chuyen-vo-chong-anh-luong-chi-vi"></div>
  </p>
  </div>
  </div>

</div>
<!-- .entry-summary -->

</div>
<!-- .hentry -->

when the p close, it close the .defualt-button-gplus and .ssb-share ssb-share-9977 too.

script start; attribs: {"type":"text/javascript","src":"https://apis.google.com/js/platform.js"}
script   end;
div  .defualt-button-gplus end;
div  .ssb-share ssb-share-9977 defualt end;
p   end;
-->

I found the source scrip in Parser.js

Parser.prototype.onclosetag = function(name){
    this._updatePosition(1);

    if(this._lowerCaseTagNames){
        name = name.toLowerCase();
    }

    if(this._stack.length && (!(name in voidElements) || this._options.xmlMode)){
        var pos = this._stack.lastIndexOf(name);
        if(pos !== -1){
            if(this._cbs.onclosetag){
                pos = this._stack.length - pos;
                while(pos--) this._cbs.onclosetag(this._stack.pop());
            }
            else this._stack.length = pos;
        } else if(name === "p" && !this._options.xmlMode){
            this.onopentagname(name);
            this._closeCurrentTag();
        }
    } else if(!this._options.xmlMode && (name === "br" || name === "p")){
        this.onopentagname(name);
        this._closeCurrentTag();
    }
};

So, apparently, there is nothing with the comment, but the p...
It is the browser that makes it right.

fb55

fb55 commented on Sep 13, 2020

@fb55
Owner

Looks like this has been fixed at some point in the last five years 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @fb55@duziaqin

        Issue actions

          parse wrong DOM tree · Issue #159 · fb55/htmlparser2