3 Star 1 Fork 0

Gitee 极速下载/node-htmlparser

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
此仓库是为了提升国内下载速度的镜像仓库,每日同步一次。 原始仓库: https://github.com/tautologistics/node-htmlparser
克隆/下载
贡献代码
同步代码
Loading...
README
MIT

#NodeHtmlParser A forgiving HTML/XML/RSS parser written in JS for both the browser and NodeJS (yes, despite the name it works just fine in any modern browser). The parser can handle streams (chunked data) and supports custom handlers for writing custom DOMs/output.

##Installing

npm install htmlparser

##Running Tests

###Run tests under node: node runtests.js

###Run tests in browser: View runtests.html in any browser

##Usage In Node

var htmlparser = require("htmlparser");
var rawHtml = "Xyz <script language= javascript>var foo = '<<bar>>';< /  script><!--<!-- Waah! -- -->";
var handler = new htmlparser.DefaultHandler(function (error, dom) {
	if (error)
		[...do something for errors...]
	else
		[...parsing done, do something...]
});
var parser = new htmlparser.Parser(handler);
parser.parseComplete(rawHtml);
sys.puts(sys.inspect(handler.dom, false, null));

##Usage In Browser

var handler = new Tautologistics.NodeHtmlParser.DefaultHandler(function (error, dom) {
	if (error)
		[...do something for errors...]
	else
		[...parsing done, do something...]
});
var parser = new Tautologistics.NodeHtmlParser.Parser(handler);
parser.parseComplete(document.body.innerHTML);
alert(JSON.stringify(handler.dom, null, 2));

##Example output

[ { raw: 'Xyz ', data: 'Xyz ', type: 'text' }
  , { raw: 'script language= javascript'
  , data: 'script language= javascript'
  , type: 'script'
  , name: 'script'
  , attribs: { language: 'javascript' }
  , children: 
     [ { raw: 'var foo = \'<bar>\';<'
       , data: 'var foo = \'<bar>\';<'
       , type: 'text'
       }
     ]
  }
, { raw: '<!-- Waah! -- '
  , data: '<!-- Waah! -- '
  , type: 'comment'
  }
]

##Streaming To Parser

while (...) {
	...
	parser.parseChunk(chunk);
}
parser.done();	

##Streaming To Parser in Node

fs.createReadStream('./path_to_file.html').pipe(parser);

##Parsing RSS/Atom Feeds

new htmlparser.RssHandler(function (error, dom) {
	...
});

##DefaultHandler Options

###Usage

var handler = new htmlparser.DefaultHandler(
	  function (error) { ... }
	, { verbose: false, ignoreWhitespace: true }
	);

###Option: ignoreWhitespace Indicates whether the DOM should exclude text nodes that consists solely of whitespace. The default value is "false".

####Example: true

The following HTML:

<font>
	<br>this is the text
<font>

becomes:

[ { raw: 'font'
  , data: 'font'
  , type: 'tag'
  , name: 'font'
  , children: 
     [ { raw: 'br', data: 'br', type: 'tag', name: 'br' }
     , { raw: 'this is the text\n'
       , data: 'this is the text\n'
       , type: 'text'
       }
     , { raw: 'font', data: 'font', type: 'tag', name: 'font' }
     ]
  }
]

####Example: false

The following HTML:

<font>
	<br>this is the text
<font>

becomes:

[ { raw: 'font'
  , data: 'font'
  , type: 'tag'
  , name: 'font'
  , children: 
     [ { raw: '\n\t', data: '\n\t', type: 'text' }
     , { raw: 'br', data: 'br', type: 'tag', name: 'br' }
     , { raw: 'this is the text\n'
       , data: 'this is the text\n'
       , type: 'text'
       }
     , { raw: 'font', data: 'font', type: 'tag', name: 'font' }
     ]
  }
]

###Option: verbose Indicates whether to include extra information on each node in the DOM. This information consists of the "raw" attribute (original, unparsed text found between "<" and ">") and the "data" attribute on "tag", "script", and "comment" nodes. The default value is "true".

####Example: true The following HTML:

<a href="test.html">xxx</a>

becomes:

[ { raw: 'a href="test.html"'
  , data: 'a href="test.html"'
  , type: 'tag'
  , name: 'a'
  , attribs: { href: 'test.html' }
  , children: [ { raw: 'xxx', data: 'xxx', type: 'text' } ]
  }
]

####Example: false The following HTML:

<a href="test.html">xxx</a>

becomes:

[ { type: 'tag'
  , name: 'a'
  , attribs: { href: 'test.html' }
  , children: [ { data: 'xxx', type: 'text' } ]
  }
]

###Option: enforceEmptyTags Indicates whether the DOM should prevent children on tags marked as empty in the HTML spec. Typically this should be set to "true" HTML parsing and "false" for XML parsing. The default value is "true".

####Example: true The following HTML:

<link>text</link>

becomes:

[ { raw: 'link', data: 'link', type: 'tag', name: 'link' }
, { raw: 'text', data: 'text', type: 'text' }
]

####Example: false The following HTML:

<link>text</link>

becomes:

[ { raw: 'link'
  , data: 'link'
  , type: 'tag'
  , name: 'link'
  , children: [ { raw: 'text', data: 'text', type: 'text' } ]
  }
]

##DomUtils

###TBD (see utils_example.js for now)

##Related Projects

Looking for CSS selectors to search the DOM? Try Node-SoupSelect, a port of SoupSelect to NodeJS: http://github.com/harryf/node-soupselect

There's also a port of hpricot to NodeJS that uses HtmlParser for HTML parsing: http://github.com/silentrob/Apricot

Copyright 2010 - 2012, Chris Winberry <chris@winberry.net>. All rights reserved. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

简介

NodeHtmlParser 一个可以用 JS 编写的 HTML / XML / RSS 解析器,适用于浏览器和 NodeJS(是的,尽管它的名称在任何现代浏览器中都可以正常使用) 展开 收起
README
MIT
取消

发行版

暂无发行版

贡献者 (7)

全部

近期动态

5年前创建了仓库
不能加载更多了
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化
JavaScript
1
https://gitee.com/mirrors/node-htmlparser.git
git@gitee.com:mirrors/node-htmlparser.git
mirrors
node-htmlparser
node-htmlparser
master

搜索帮助