# RamX **Repository Path**: rxlib/ram-x ## Basic Information - **Project Name**: RamX - **Description**: RamX 是一个纯 C++17 编写的 HTML5 DOM 解析库,无需任何外部依赖。 将任意 HTML 字符串解析为完整的 DOM 树,行为与浏览器一致(容错、怪异模式等) 提供完整的标准 DOM 接口:`getElementById`、`getElementsByTagName`、`querySelector` 等 - **Primary Language**: C++ - **License**: Apache-2.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-04-10 - **Last Updated**: 2026-04-10 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # RamX **Pure C++17 HTML5 DOM Parser with XPath 1.0 Support / 纯 C++17 HTML5 DOM 解析器,支持 XPath 1.0** ``` License: Apache-2.0 Language: C++17 Dependencies: None (pure standard library) / 无外部依赖(纯标准库实现) Platforms: Windows / Linux / macOS ``` *** ## What Is It? / 是什么? RamX is a pure C++17 HTML5 DOM parsing library with **zero external dependencies**. RamX 是一个纯 C++17 编写的 HTML5 DOM 解析库,**无需任何外部依赖**。 It provides four core capabilities / 它提供四大核心能力: - **HTML Parsing / HTML 解析** - Parses any HTML string into a complete DOM tree, with browser-grade error tolerance and quirks mode support 将任意 HTML 字符串解析为完整的 DOM 树,行为与浏览器一致(容错、怪异模式等) - **DOM API** - Full standard DOM interface: `getElementById`, `getElementsByTagName`, `querySelector`, etc. 提供完整的标准 DOM 接口:`getElementById`、`getElementsByTagName`、`querySelector` 等 - **XPath 1.0** - Built-in XPath parser and evaluator supporting expressions like `//div[@id='main']//p` 内置原生 XPath 解析器与求值器,支持 `//div[@id='main']//p` 等表达式 - **Resource Loading** - Built-in `ResourceLoader` with customizable hooks for intercepting and processing external resources (CSS, JS, images, etc.) 内置 `ResourceLoader` 类,通过可自定义的 hook 拦截和处理外部资源(CSS、JS、图片等) RamX's parsing logic and DOM tree-building algorithm are inspired by Google's **[Gumbo HTML Parser](https://github.com/google/gumbo-parser)**. The tree construction algorithm, data structures, and error handling follow Gumbo's design while being rewritten entirely in modern C++. RamX then extends this foundation with a complete DOM API, native XPath support, and a flexible resource loading system with customizable hooks. RamX 的解析算法与 DOM 树构建逻辑参考了 Google 的 **[Gumbo HTML Parser](https://github.com/google/gumbo-parser)** 项目,在其基础上完全以现代 C++ 重新实现,并扩展了完整的 DOM API、XPath 支持以及带有可自定义 hook 的灵活资源加载系统。 *** ## License / 协议 RamX is open-source under the **Apache License 2.0**, the same license as Gumbo. RamX 基于 **Apache License 2.0** 开源,与 Gumbo 使用相同协议。 > Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at > > ``` > http://www.apache.org/licenses/LICENSE-2.0 > ``` > > 本项目以 **Apache License 2.0** 授权。详细条款请参阅 [LICENSE](./LICENSE) 文件。 *** ## Directory Structure / 目录结构 ``` RamX/ ├── include/ # Header files / 头文件 │ ├── RamX.h # Main umbrella header / 主头文件 │ ├── base.h # Base types (NodeType, QuirksMode, ResourceLoader, etc.) │ ├── core.h # Core classes (Node, HTMLElement, HTMLDocument) │ ├── collections.h # Collection classes (NodeList, HTMLCollection, DOMTokenList) │ ├── events.h # Event system (Event, MouseEvent, etc.) │ ├── style.h # CSS style object │ ├── tokenizer.h # HTML5 tokenizer │ ├── parser.h # HTML5 parser │ ├── encoding.h # Encoding detection and conversion │ └── xpath.h # XPath 1.0 parser and evaluator ├── src/ # Source files / 源代码 │ ├── core.cpp # Core implementation │ ├── tokenizer.cpp # HTML5 tokenizer implementation │ ├── parser.cpp # HTML5 parser implementation │ ├── encoding.cpp # Encoding detection and conversion │ ├── xpath.cpp # XPath 1.0 implementation │ └── resource_loader.cpp # ResourceLoader implementation ├── test/ # Test cases / 测试用例 ├── CMakeLists.txt # CMake build config / CMake 构建配置 └── LICENSE # Apache License 2.0 ``` *** ## Build / 编译 ### Requirements / 环境要求 - C++17 compiler (GCC 9+ / Clang 10+ / MSVC 2019+) / C++17 编译器 - CMake 3.14+ ### Steps / 步骤 ```bash # 1. Create build directory / 创建 build 目录 mkdir build && cd build # 2. Configure CMake / 配置 CMake cmake .. -DCMAKE_BUILD_TYPE=Release # 3. Build / 编译 cmake --build . --config Release # 4. Run basic tests / 运行基础测试 cd Release ./RamX_test_basic ``` ### Dependencies / 依赖说明 **None.** RamX uses only the C++ standard library. No Boost, ICU, or any third-party libraries required. **无外部依赖。** RamX 仅使用 C++ 标准库,无需 Boost、ICU 或任何第三方库,可直接集成到任何 C++ 项目。 *** ## Quick Start / 快速开始 ### Include the Header / 引入头文件 ```cpp #include // Optional: use namespace for convenience / 可选:使用命名空间简化代码 using namespace RamX; ``` ### Simplest Usage: Parse HTML / 最简用法:解析 HTML ```cpp #include #include int main() { auto doc = parseHtml("Hello" "

World

"); std::cout << doc->title << std::endl; // Hello std::cout << doc->body->innerText() << std::endl; // World return 0; } ``` ### Encoding Detection / 编码检测 RamX supports automatic encoding detection and conversion for GBK, Big5, UTF-16, and other encodings: RamX 支持自动检测和转换 GBK、Big5、UTF-16 等多种编码: ```cpp #include #include #include int main() { // Read file with unknown encoding / 读取未知编码的文件 std::ifstream file("page.html", std::ios::binary); std::string htmlBytes((std::istreambuf_iterator(file)), std::istreambuf_iterator()); // Detect encoding / 检测编码 EncodingType encoding = EncodingDetector::detect(htmlBytes); std::cout << "Detected: " << static_cast(encoding) << std::endl; // Parse with automatic conversion / 自动转换后解析 auto doc = parseWithEncoding(htmlBytes, encoding); // Or parse directly (auto-detects encoding) / 或直接解析 auto doc2 = parseHtml(htmlBytes); return 0; } ``` ## Feature Tour / 功能演示 All code below is taken from `test/test_basic.cpp`, `test/test_xpath.cpp`, and `test/test_resource_loader.cpp` — real, runnable test cases. 以下代码全部节选自 `test/test_basic.cpp`、`test/test_xpath.cpp` 和 `test/test_resource_loader.cpp`,均为实际可运行的测试用例。 ### Parse HTML / 解析 HTML ```cpp auto doc = parseHtml(R"( My Page

Paragraph 1

Paragraph 2

)"); doc->title; // "My Page" doc->body; // 元素 doc->head; // 元素 doc->doctypeName; // "html" ``` ### DOM Queries / DOM 查询 ```cpp // By ID / 通过 ID 查找(返回首个匹配) auto main = doc->getElementById("main"); //
// By tag name / 通过标签名查找(返回 HTMLCollection) auto paragraphs = doc->getElementsByTagName("p"); // 两个

// By class name / 通过类名查找 auto textItems = doc->getElementsByClassName("text"); // 两个 // Multiple classes (AND logic) / 多类名AND查找:同时拥有 .text 和 .highlight auto highlighted = doc->getElementsByClassNames("text highlight"); // 1 个 // By name attribute / 通过 name 属性查找 auto inputs = doc->getElementsByName("field1"); // CSS selectors / CSS 选择器 auto firstP = doc->querySelector("p"); // 首个

auto allDivs = doc->querySelectorAll("div"); // 所有

auto byId = doc->querySelector("#main"); // id 选择器 auto byClass = doc->querySelector(".text"); // class 选择器 ``` ### Element Attributes / 元素属性操作 ```cpp auto div = doc->getElementById("main"); div->getAttribute("class"); // "container active" div->hasAttribute("id"); // true div->setAttribute("data-id", "123"); div->removeAttribute("class"); // classList (browser-style DOMTokenList) / 类名列表(浏览器风格) div->classList->add("new-class"); div->classList->remove("active"); div->classList->contains("container"); // true div->classList->length; // 2 ``` ### Node Operations / 节点操作 ```cpp auto parent = doc->createElement("div"); auto child1 = doc->createElement("span"); auto child2 = doc->createElement("a"); parent->appendChild(child1); parent->appendChild(child2); parent->firstChild(); // child1 parent->lastChild(); // child2 child1->nextSibling(); // child2 child2->previousSibling(); // child1 parent->hasChildNodes(); // true parent->childNodes.size(); // 2 parent->removeChild(child1); ``` ### Content Access / 内容读写 ```cpp auto div = doc->getElementById("main"); div->innerHTML(); // 获取内部 HTML div->setInnerHTML("

New content

"); div->innerText(); // 纯文本(不含标签) div->setInnerText("Plain text"); div->textContent(); // 所有文本(含隐藏文本) ``` ### Node Insertion / 插入节点 ```cpp auto container = doc->getElementById("main"); auto newDiv = doc->createElement("div"); container->appendChild(newDiv); // 追加到末尾 // Insert at specific positions / 在指定位置插入 auto sibling = doc->querySelector("p"); sibling->insertAdjacentElement("beforebegin", doc->createElement("h2")); // 插入到同级之前 sibling->insertAdjacentElement("afterend", doc->createElement("h3")); // 插入到同级之后 // Insert inside element / 在元素内部插入 newDiv->insertAdjacentHTML("afterbegin", "Hi"); newDiv->insertAdjacentText("beforeend", "click me"); ``` ### Resource Loading with Hooks / 带Hook的资源加载 RamX provides a built-in `ResourceLoader` class that allows you to intercept and handle external resource requests (like stylesheets, scripts, images, etc.) through customizable hooks. RamX 提供了内置的 `ResourceLoader` 类,允许你通过可自定义的 hook 来拦截和处理外部资源请求(如样式表、脚本、图片等)。 #### Basic Usage / 基础用法 ```cpp #include #include int main() { // Create ResourceLoader auto loader = std::make_shared(); // Register hooks for resource types ResourceLoader::ResourceHook hook; hook.beforeFetch = [](const std::string& url, std::shared_ptr requesterNode) { std::cout << "Before fetch: " << url << std::endl; }; hook.afterFetch = [](const std::string& data, std::shared_ptr requesterNode) { std::cout << "After fetch: " << data << std::endl; }; // Register hooks for specific resource types loader->onResourceHook(ResourceType::LINK_SRC, hook); // CSS files loader->onResourceHook(ResourceType::SCRIPT_SRC, hook); // JavaScript files loader->onResourceHook(ResourceType::IMG_SRC, hook); // Images // Create parsing options with ResourceLoader Options options; options.resourceLoader = loader; // Parse HTML with resource loading enabled auto doc = parseHtml(R"( Test image )", options); return 0; } ``` #### Handling Inline Content / 处理内联内容 ResourceLoader also handles inline script and style content through `SCRIPT_TEXT` and `CSS_TEXT` resource types: ResourceLoader 还通过 `SCRIPT_TEXT` 和 `CSS_TEXT` 资源类型处理内联脚本和样式内容: ```cpp // Register hooks for inline content loader->onResourceHook(ResourceType::SCRIPT_TEXT, { .afterFetch = [](const std::string& content, std::shared_ptr node) { std::cout << "Inline script content: " << content << std::endl; } }); loader->onResourceHook(ResourceType::CSS_TEXT, { .afterFetch = [](const std::string& content, std::shared_ptr node) { std::cout << "Inline style content: " << content << std::endl; } }); ``` #### Custom Resource Processing / 自定义资源处理 You can implement custom resource processing logic in your hooks, such as asynchronous loading, caching, or modifying resource content: 你可以在 hook 中实现自定义资源处理逻辑,如异步加载、缓存或修改资源内容: ```cpp loader->onResourceHook(ResourceType::LINK_SRC, { .beforeFetch = [](const std::string& url, std::shared_ptr node) { // Mark resource as loading auto element = std::dynamic_pointer_cast(node); if (element) { element->setAttribute("data-loading", "true"); } }, .afterFetch = [](const std::string& data, std::shared_ptr node) { // Mark resource as loaded auto element = std::dynamic_pointer_cast(node); if (element) { element->setAttribute("data-loading", "false"); element->setAttribute("data-loaded", "true"); // Process loaded CSS data here std::cout << "CSS loaded: " << data.substr(0, 50) << "..." << std::endl; } } }); ``` *** ## ResourceLoader: Hook 拦截外部资源 / Hook: Intercept External Resources RamX 内置 `ResourceLoader` 模块,支持在解析 HTML 时**拦截并处理外部资源请求**(CSS、JS、图片、视频、iframe 等),通过 `beforeFetch` 和 `afterFetch` 两个 Hook,在资源请求前后注入自定义逻辑,非常适合爬虫、SSR、离线应用等场景。 RamX ships a built-in **`ResourceLoader`** that intercepts external resource requests during HTML parsing (CSS, JS, images, videos, iframes, etc.). By registering `beforeFetch` and `afterFetch` hooks, you can inject custom logic before and after each resource is loaded - ideal for crawlers, server-side rendering, or offline-first applications. ### ResourceType 支持的资源类型 / Supported Resource Types | ResourceType | 说明 / Description | | ------------- | ---------------------------------------------- | | `LINK_SRC` | `` 外部样式表 / External stylesheet | | `CSS_TEXT` | ` loader->onResourceHook(ResourceType::VIDEO_SRC, hook); //