# jsoup
**Repository Path**: liuher/jsoup
## Basic Information
- **Project Name**: jsoup
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 24
- **Created**: 2023-12-28
- **Last Updated**: 2023-12-28
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# jsoup
## 简介
快速且宽容的HTML解析器
- 从URL、文件或字符串中抓取和解析HTML;
- 将HTML文档转化为DOM结构,可以从元素中提取属性、文本;
- 操作HTML元素、属性和文本;
- 清理用户提交的HTML,在每个元素的基础上保留用户列入白名单的元素和列入白名单的属性;
- 输出整洁的HTML或者XHTML。
## 下载安装
按功能对应下载安装:
场景一:HTML操作:对HTML文档进行解析、提取、清理
```
npm install sanitize-html --save
npm install @types/sanitize-html --save-dev
```
场景二:HTML转化为整洁的XHTML
```
npm install @ohos/htmltoxml --save
```
OpenHarmony
npm环境配置等更多内容,请参考 [如何安装OpenHarmony npm包](https://gitee.com/openharmony-tpc/docs/blob/master/OpenHarmony_npm_usage.md) 。
## 使用说明
### HTML操作
#### 解析HTML并提取元素中的属性、文本
- 使用Handler构建Parser
```
import { Parser } from 'htmlparser2'
const parser = new Parser({
onopentag(name, attributes) {
console.info(`jsoup onopentag name --> ${name} attributes --> ${attributes}`)
},
ontext(text) {
console.info("jsoup text -->", text);
},
onopentagname(name) {
console.info("jsoup tagName -->", name);
},
onattribute(name, value) {
console.info(`jsoup attribName name --> ${name} value --> ${value}`)
},
onclosetag(tagname) {
console.info("jsoup closeTag --> ", tagname);
},
});
parser.write(html);
parser.end();
```
- 使用DomHandler构建Parser
```
import { Parser } from 'htmlparser2'
import { DomHandler } from 'domhandler'
import * as DomUtils from 'domutils'
const handler = new DomHandler((error, dom) => {
if (error) {
// Handle error
} else {
// Parsing completed, do something
console.info('jsoup dom.toString()=' + dom + "");
let elements = DomUtils.getElementsByTagName('style', dom)
console.info('jsoup elements.length=', elements.length);
let element = elements[0]
console.info('jsoup element=', Object.keys(element));
let text = DomUtils.getText(elements)
console.info('jsoup text=', text);
}
});
const parser = new Parser(handler, { decodeEntities: true });
parser.write(html);
parser.end();
```
- parseDocument解析
```
import { parseDocument } from 'htmlparser2'
import * as DomUtils from 'domutils'
let dom: Document = parseDocument(html)
// 通过DomUtils对解析过的Dom对象进行操作
// 根据标签名称获取元素
let element = DomUtils.getElementsByTagName('style', dom)
// 获取文本
let text = DomUtils.getText(element)
// 判断元素类型是否为tag
let isTag = DomUtils.isTag(element[0])
// 判断元素类型是否为CDATA
let isCDATA = DomUtils.isCDATA(element[0])
// 判断元素类型是否Text
let isText = DomUtils.isText(element[0])
// 判断元素类型是否为Comment
let isComment = DomUtils.isComment(element[0])
// 获取指定元素的子元素集
let childrens = DomUtils.getChildren(body[0])
```
#### 获取HTML文本
- 通过URL获取HTML文本
```
import http from '@ohos.net.http';
let httpRequest = http.createHttp()
httpRequest.request('http://106.15.92.248/share/html.txt')
.then((data) => {
console.log("jsoup url html=" + JSON.stringify(data))
// TODO do something
if (data.result && typeof data.result === 'string') {
parser.write(data.result);
parser.end();
}
})
.catch((err) => {
console.error('jsoup connect error:' + JSON.stringify(err));
})
```
- 通过文件流获取HTML文本
```
import fileio from '@ohos.fileio';
let buf = new ArrayBuffer(html.length)
stream.readSync(buf, {
offset: 0, length: html.length, position: 0
})
let dom = String.fromCharCode.apply(null, new Uint8Array(buf))
// TODO do something
parser.write(dom);
parser.end();
```
- 通过rawfile获取HTML文本
```
import util from '@ohos.util';
// 注意:需要先在MainAbility中为该变量赋值: globalThis.Context = this.context;
if (!globalThis.Context) {
console.log('jsoup global Context is undefined');
return;
}
globalThis.Context.resourceManager.getRawFile(filePath)
.then((data) => {
var textDecoder = new util.TextDecoder("utf-8", {
ignoreBOM: true
})
var result: string = textDecoder.decode(data, {
stream: false
})
// TODO do something
parser.write(result);
parser.end();
})
.catch((err) => {
console.log("jsoup getHtmlFromRawFile err=" + err)
})
```
- 通过文件路径获取HTML文本
```
import fileio from '@ohos.fileio';
if (!globalThis.Context) {
console.log('jsoup global Context is undefined');
return;
}
var filePath = globalThis.Context.filesDir + '/jsoup.html';
fileio.readText(filePath)
.then((data) => {
console.log("jsoup getHtmlFromFilePath text=" + data);
// TODO do something
parser.write(data);
parser.end();
})
.catch((err) => {
console.log("jsoup getHtmlFromFilePath err=" + err)
})
```
#### 清理HTML并且可以操作HTML元素、属性和文本
- 导入模块
```
import SanitizeHtml from 'sanitize-html'
```
- 清理HTML
使用默认的标签和属性列表:
```
const clean = SanitizeHtml(dirty);
```
允许的特定的标签和属性不会被清除:
```
const clean = sanitizeHtml(dirty, {
allowedTags: [ 'b', 'i', 'em', 'strong', 'a' ],
allowedAttributes: {
'a': [ 'href' ]
},
allowedIframeHostnames: ['www.youtube.com']
});
```
在默认列表的基础上添加标签:
```
const clean = SanitizeHtml(dirty, {
allowedTags: SanitizeHtml.defaults.allowedTags.concat([ 'img' ])
});
```
将不允许的标签进行转义,而不是清除:
```
const clean = SanitizeHtml('before after', {
disallowedTagsMode: 'escape',
allowedTags: [],
allowedAttributes: false
})
```
允许所有标签或所有属性:
```
allowedTags: false,
allowedAttributes: false
```
不想允许任何标签:
```
allowedTags: [],
allowedAttributes: {}
```
在特定元素上允许特定的CSS类:
```
const clean = SanitizeHtml(dirty, {
allowedTags: [ 'p', 'em', 'strong' ],
allowedClasses: {
'p': [ 'fancy', 'simple' ]
}
});
```
在特定元素上允许特定的CSS样式
```
const clean = SanitizeHtml(dirty, {
allowedTags: ['p'],
allowedAttributes: {
'p': ["style"],
},
allowedStyles: {
'*': {
// Match HEX and RGB
'color': [/^#(0x)?[0-9a-f]+$/i, /^rgb\(\s*(\d{1,3})\s*,\s*(\d{1,3})\s*,\s*(\d{1,3})\s*\)$/],
'text-align': [/^left$/, /^right$/, /^center$/],
// Match any number with px, em, or %
'font-size': [/^\d+(?:px|em|%)$/]
},
'p': {
'font-size': [/^\d+rem$/]
}
}
});
```
- 更改标签
```
const dirty='