# jsoup **Repository Path**: echaya2022/jsoup ## Basic Information - **Project Name**: jsoup - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 24 - **Created**: 2022-09-01 - **Last Updated**: 2022-09-08 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # jsoup ## 简介 - 支持根据URL、HTML字符串、文件流、文件路径、rawfile路径获取及解析HTML; - 支持操作HTML元素、属性、文本; - 支持对HTML进行可信化操作。 ![preview.gif](preview/preview.gif) ## 下载安装 ``` npm install @ohos/jsoup --save ``` OpenHarmony npm环境配置等更多内容,请参考 [如何安装OpenHarmony npm包](https://gitee.com/openharmony-tpc/docs/blob/master/OpenHarmony_npm_usage.md) 。 ## 使用说明 1. 引入依赖 ``` import { Jsoup, SanitizeHtml, Parser, DomHandler, Document, DomUtils } from '@ohos/jsoup' ``` 2. 解析HTML ``` const html = ` Document

kkkk

hhhhh

cshi
wwww

wjdwekfe>>>>>

dsjfw<<<<

dksfmjk owqkdo ` ``` 解析方式一: ``` const parser = new Parser.Parser({ onopentag(name, attributes) { console.info(`jsoup onopentag name --> ${name} attributes --> ${attributes}`) }, ontext(text) { console.info("jsoup text -->", text); }, onopentagname(name) { console.info("jsoup tagName -->", name); }, onattribute(name, value) { console.info(`jsoup attribName name --> ${name} value --> ${value}`) }, onclosetag(tagname) { console.info("jsoup closeTag --> ", tagname); }, }); parser.write(html); parser.end(); 或 const handler = new DomHandler((error, dom) => { if (error) { // Handle error } else { // Parsing completed, do something } }); const parser = new Parser.Parser(handler, { decodeEntities: true }); parser.write(html); parser.end(); ``` 解析方式二: ``` let dom: Document = Parser.parseDocument(html) ``` 3. 获取Html - 获取HTML文本方式一:通过URL获取HTML文本 ``` let httpRequest = http.createHttp() httpRequest.request('http://106.15.92.248/share/html.txt') .then((data) => { console.log("jsoup url html=" + JSON.stringify(data)) if (data.result && typeof data.result === 'string') { parser.write(data.result); parser.end(); } }) .catch((err) => { console.error('jsoup connect error:' + JSON.stringify(err)); }) ``` - 获取HTML文本方式二:通过文件流获取HTML文本 ``` var dom = Jsoup.parseHtmlFromFile(stream, html.length) ``` - 获取HTML文本方式三:通过rawfile获取HTML文本 ``` // 注意:需要先在MainAbility中为该变量赋值: globalThis.Context = this.context; if (!globalThis.Context) { console.log('jsoup global Context is undefined'); return; } var filePath = globalThis.Context.filesDir + '/testHtml.html'; globalThis.Context.resourceManager.getRawFile(filePath) .then((data) => { var textDecoder = new util.TextDecoder("utf-8", { ignoreBOM: true }) var result: string = textDecoder.decode(data, { stream: false }) console.log("jsoup getHtmlFromRawFile text=" + result); this.createFile(filePath); this.writeFile(filePath, result); }) .catch((err) => { console.log("jsoup getHtmlFromRawFile err=" + err) }) ``` - 获取HTML文本方式四:通过文件路径获取HTML文本 ``` if (!globalThis.Context) { console.log('jsoup global Context is undefined'); return; } var filePath = globalThis.Context.filesDir + '/testHtml.html'; fileio.readText(filePath) .then((data) => { console.log("jsoup getHtmlFromFilePath text=" + data); parser.write(data); parser.end(); }) .catch((err) => { console.log("jsoup getHtmlFromFilePath err=" + err) }) ``` 4. 提取HTML属性 ``` // 提取CSS Jsoup.parseCSS(html) ``` 对解析过的Dom对象进行提取操作: ``` // 根据标签名称获取元素 let element = DomUtils.getElementsByTagName('style', dom) // 获取文本 let text = DomUtils.getText(element) // 判断元素是否为tag let isTag = DomUtils.isTag(element[0]) // 判断元素是否为CDATA let isCDATA = DomUtils.isCDATA(element[0]) // 判断元素是否Text let isText = DomUtils.isText(element[0]) // 判断元素是否为Comment let isComment = DomUtils.isComment(element[0]) // 获取指定元素的子元素 let childrens = DomUtils.getChildren(body[0]) ``` 5. 清理HTML ``` const clean = SanitizeHtml('before after', { disallowedTagsMode: 'escape', allowedTags: [], allowedAttributes: false }) ``` ## 接口说明 1. 解析字符串类型的HTML 方式一: ``` interface ParserOptions { decodeEntities?: boolean; lowerCaseTags?: boolean; lowerCaseAttributeNames?: boolean; recognizeCDATA?: boolean; recognizeSelfClosing?: boolean; } interface Handler { onparserinit(parser: Parser): void; onreset(): void; onend(): void; onerror(error: Error): void; onclosetag(name: string): void; onopentagname(name: string): void; onattribute(name: string, value: string, quote?: string | undefined | null): void; onopentag(name: string, attribs: { [s: string]: string; }): void; ontext(data: string): void; oncomment(data: string): void; oncdatastart(): void; oncdataend(): void; oncommentend(): void; onprocessinginstruction(name: string, data: string): void; } const parser = new Parser.Parser(cbs: Partial | null, options?: ParserOptions) parser.write(html) parser.end(); ``` 方式二: ``` parseDocument(data: string, options?: Options): Document ``` 2. 提取HTML属性 DomUtils接口定义参照:[Doc](https://domutils.js.org/modules.html) ``` Jsoup.parseCSS(html: string): string ``` 3. 根据文件流获取HTML ``` Jsoup.parseHtmlFromFile(stream: fileio.Stream, htmlLength: number): string ``` 4. 清理HTML ``` SanitizeHtml(dirty: string, options?: sanitize.IOptions): string 可配置属性: interface Attributes { [attr: string]: string; } interface Tag { tagName: string; attribs: Attributes; text?: string | undefined; } type Transformer = (tagName: string, attribs: Attributes) => Tag; type AllowedAttribute = string | { name: string; multiple?: boolean | undefined; values: string[] }; allowedAttributes?: Record | false; allowedStyles?: { [index: string]: { [index: string]: RegExp[] } }; allowedClasses?: { [index: string]: boolean | Array } allowedIframeDomains?: string[]; allowedIframeHostnames?: string[]; allowIframeRelativeUrls?: boolean; allowedSchemes?: string[] | boolean; allowedSchemesByTag?: { [index: string]: string[] } | boolean; allowedSchemesAppliedToAttributes?: string[]; allowedScriptDomains?: string[]; allowedScriptHostnames?: string[]; allowProtocolRelative?: boolean; allowedTags?: string[] | false; allowVulnerableTags?: boolean; textFilter?: ((text: string, tagName: string) => string); exclusiveFilter?: ((frame: IFrame) => boolean); nonTextTags?: string[]; selfClosing?: string[]; transformTags?: { [tagName: string]: string | Transformer }; parser?: ParserOptions; disallowedTagsMode?: discard' | 'escape' | 'recursiveEscape; enforceHtmlBoundary?: boolean; ``` ## 兼容性 支持 OpenHarmony API version 9 及以上版本。 ## 目录结构 ```` |---- jsoup | |---- entry # 示例代码文件夹 | |----src | |----addTag.ets | |----index.ets | |---- jsoup # jsoup库文件夹 | |----src | |----main | |----ets | |----common 模板 | |----Cleaner.ts #html clean | |----Jsoup.ts #html解析 | |---- index.ts # 对外接口 | |---- README.md # 安装使用方法 ```` ## 贡献代码 使用过程中发现任何问题都可以提 [Issue](https://gitee.com/openharmony-sig/jsoup/issues) 给我们,当然,我们也非常欢迎你给我们发 [PR](https://gitee.com/openharmony-sig/jsoup/pulls) 。 ## 开源协议 本项目基于 [MIT](https://gitee.com/openharmony-sig/jsoup/blob/master/LICENSE) ,请自由地享受和参与开源。