# u2pppw **Repository Path**: qaz9877/u2pppw ## Basic Information - **Project Name**: u2pppw - **Description**: golang的爬虫的实现 golang爬虫框架的实现 go的函数编程 golang并发编程 golang的net.http的api - **Primary Language**: Go - **License**: Zlib - **Default Branch**: master - **Homepage**: https://gitee.com/qaz9877/u2pppw.git - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2020-11-10 - **Last Updated**: 2022-08-02 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README **golang爬虫实现** **1. 这里是列表文本** - 接受golang爬虫框架需要的知识 _ 基础省略_ - 实现golang爬虫框架 **项目名称crawie** 1. 实现爬取网页 转变网页的编码 自动识别编码格式 2. 确定单体式爬虫的框架 ``` type ParesResult struct { Requests []Request Items []interface{} } type Item struct { Url string Id string Payload interface{} } func NilParser([]byte) ParesResult{ return ParesResult{} } ``` 3. 修改代码实现 seeds->engine->scheduler->fetcher->working->engine - engine simple.go 当前正在实现 - scheduler simple.go - scheduler quened.go 4. 实现 ``` type SimpleEngine struct { } func (e SimpleEngine)Run(seeds ...Request){ var requests []Request for _,seed :=range seeds{ requests=append(requests,seed) } for len(requests)>0{ r:=requests[0] requests=requests[1:] parseResult,err:=worker(r) if err !=nil{ log.Panicf("fetch err: url ",r.Url) continue } requests =append(requests,parseResult.Requests...) for _,item:=range parseResult.Items{ log.Printf("Got item %v \n",item) } } } func worker(r Request) (ParesResult,error){ log.Printf("fetch %s \n",r.Url) body,err:=fetcher.Fetch(r.Url) if err !=nil{ log.Printf("fetch :error fetching url %s",r.Url) return ParesResult{},err } return r. ``` 5. 实现schuluer simple **concurrent.go代码** ``` type ConcurrentEngine struct { Scheduler Scheduler WorkerCount int ItemChan chan interface{} } type Scheduler interface { Submit( Request) ConfigureMasterWorkerChan(chan Request) WorkerReady(chan Request) Run() } func (e *ConcurrentEngine)Run(sends ...Request){ out :=make( chan ParesResult) e.Scheduler.Run() for i:=0;i< e.WorkerCount;i++{ createWorker(out,e.Scheduler) } for _,r :=range sends{ e.Scheduler.Submit(r) } for { result :=<- out for _,item:= range result.Items{ fmt.Printf("Got item:%v \n",item) go func(){e.ItemChan <-item}() } for _,request:=range result.Requests{ e.Scheduler.Submit(request) } } } func createWorker(out chan ParesResult, s Scheduler){ in :=make(chan Request) go func(){ for{ //tell scheduler i`m ready s.WorkerReady(in) request :=<- in result,err:=worker(request) if err !=nil{ continue } out <-result } }() } ``` **scheduler simple.go实现 ** ``` type SimpleScheduler struct { workerChan chan engine.Request } //{ "tom": "name", "18": "age" } func (s *SimpleScheduler) ConfigureMasterWorkerChan(c chan engine.Request){ s.workerChan=c } func (s *SimpleScheduler) Submit(r engine.Request){ go func(){s.workerChan <-r}() } ```