1 Star 0 Fork 0

青文杰 / colly

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
克隆/下载
reddit.go 1.49 KB
一键复制 编辑 原始数据 按行查看 历史
Frank Cash 提交于 2018-02-07 14:10 . Reads in reddit link from CLI, async
package main
import (
"fmt"
"os"
"time"
"github.com/gocolly/colly"
)
type item struct {
StoryURL string
Source string
comments string
CrawledAt time.Time
Comments string
Title string
}
func main() {
stories := []item{}
// Instantiate default collector
c := colly.NewCollector(
// Visit only domains: reddit.com
colly.AllowedDomains("www.reddit.com"),
colly.Async(true),
)
// On every a element which has .top-matter attribute call callback
// This class is unique to the div that holds all information about a story
c.OnHTML(".top-matter", func(e *colly.HTMLElement) {
temp := item{}
temp.StoryURL = e.ChildAttr("a[data-event-action=title]", "href")
temp.Source = "https://www.reddit.com/r/programming/"
temp.Title = e.ChildText("a[data-event-action=title]")
temp.Comments = e.ChildAttr("a[data-event-action=comments]", "href")
temp.CrawledAt = time.Now()
stories = append(stories, temp)
})
// On every span tag with the class next-button
c.OnHTML("span.next-button", func(h *colly.HTMLElement) {
t := h.ChildAttr("a", "href")
c.Visit(t)
})
// Set max Parallelism and introduce a Random Delay
c.Limit(&colly.LimitRule{
Parallelism: 2,
RandomDelay: 5 * time.Second,
})
// Before making a request print "Visiting ..."
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL.String())
})
// Crawl all reddits the user passes in
reddits := os.Args[1:]
for _, reddit := range reddits {
c.Visit(reddit)
}
c.Wait()
fmt.Println(stories)
}
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化
1
https://gitee.com/qingwenjie/colly.git
git@gitee.com:qingwenjie/colly.git
qingwenjie
colly
colly
v1.1.0

搜索帮助

344bd9b3 5694891 D2dac590 5694891