大家好,我叫谢伟,是一名程序员。
我们已经研究了:
Golang 环境的搭建、设置GOPATH、GOROOT 参数,Govendor 包管理, Goland 集成开发环境
Golang 语言学习专栏 -- 第一期Golang 的基础知识:变量声明、基本数据类型、基本数据结构(map、数组、切片、结构体)、流程控制、循环操作等
Golang 语言学习专栏 -- 第二期Golang 函数:入参、返回值、匿名函数、函数作为参数、函数作为返回值
Golang 语言学习专栏 -- 第三期Golang 结构体:声明和定义、组合、格式化显示、访问字段、方法定义
Golang 语言学习专栏 -- 第四期Golang 错误处理机制
Golang 语言学习专栏 -- 第五期Golang 结构体
Golang 语言学习专栏 -- 第六期
不管学习什么,如果没有得到快速入门的机会,会丧失学习的动力。进而失去深入研究一门技能的机会。这对初学者或者自学者来说,这一点非常的重要,不然的话,会重复的抓起沙子,而建设不了大厦,所以说自信心很重要。
这节呢,使用之前学习的知识。完成一个小任务。
作为程序员呢。我们在专注学习研究技术的同时,也需要关注一些技术的热点。那怎么才能关注技术热点,比如现在的技术人员在研究些什么、关注些什么?
方法当然是上主流的技术社区,了解现在的技术人员在研究些什么东西。
这里我们说的主流的技术社区,认为是 Github 。因为这个托管网站实在是存在太多值得你研究的东西、巨多开源的技术值得你去研究。
Github 专门有一个链接指向当天最热门的项目。从这一个侧面,我们大概可以了解到热门的语言的一些热门项目。
还可以根据编程语言查看热门的项目:
比如:
语言 | 链接 |
---|---|
Python | Github Trending Python |
Go | Github Trending Go |
我们的目的是:抓取这些热门的项目的一些信息。(因为我发现,不管是Python 还是Go 爬虫似乎总能很好的激发学习者的兴趣?)
任务就是上面两张图里的内容:
- 定义抓取字段
- 获取网页信息
- 解析网页信息
- 任务调度
- 函数主入口
这里在提一点:初学者往往不太注重自己的项目的工程结构。什么意思呢?意思是说初学者往往注重在实现部分,认为实现了功能,整个工程就差不多结束了,就理所当然的认为自己的开发任务完成了。实际上在企业里的任务开发和你自己练手玩的项目很不一样,企业里的任务开发往往会根据需求变动,假如在学校里,你做一个项目,老师给你定下了一个任务,中途又改变了,待你代码差不多写好了,又更改了任务目标,看上去你肯定会抱怨老师,实际上这种情形在企业里开发是日常很常见的。所以,刚开始我就建议初学者或者自学者坚持一项好的工程组织结构,以后都在这个项目的组织结构上动态的调整(主体不变,内部细节调整)。事实上很多设计模式或者软件设计架构都是有一套固定的项目组织结构。这样保证项目可扩展性、低耦合等
项目结构
就爬虫项目,给你推荐下面一个工程目录:
workspace
download
download.go
engine
engine.go
object.go
infra
util.go
main
main.go
parse
github
github_trending_parse.go
解释下各个文件的含义:
download
download.go
定位为:下载器
download.go
完成的是:获取网页信息
engine
engine.go
object.go
定位为:调度引擎
engine.go
完成的是:爬虫任务的调度
object.go
完成的是: 定义抓取的字段
infra
util.go
定位为:基础设施
util.go
完成的是:项目需要的一些辅助函数
main
main.go
主函数入口。没什么好说的。
parse
github
github_trending_parse.go
定位为:解析器
github_trending_parse.go
完成的是:解析github 网站的一些解析函数
下载器
// download.go
var (
ErrorNil = errors.New("response is nil")
ErrorWrongCode = errors.New("http response code is wrong")
)
func Download(url string) (*goquery.Document, error) {
var (
resp *http.Response
err error
)
if resp, err = http.Get(url); err != nil {
return nil, ErrorNil
}
defer resp.Body.Close()
if resp.StatusCode != 200 {
return nil, ErrorWrongCode
}
return goquery.NewDocumentFromReader(resp.Body)
}
- 注意函数命名
- 注意错误处理机制:建议每个文件的开头定义一些错误信息
解析器
// github_trending_parse.go
func ParseForGithub(document *goquery.Document) {
document.Find("div.explore-content ol.repo-list li").Each(func(i int, selection *goquery.Selection) {
RespName, _ := infra.HandleCommon(selection.Find("div h3 a").Text())
URL, _ := infra.HandlerURL(selection.Find("div").Eq(0).Find("h3 a").AttrOr("href", "None"))
Description, _ := infra.HandleCommon(selection.Find("div").Eq(2).Find("p").Text())
Stars, _ := infra.HandleCommon(selection.Find("div").Eq(3).Find("a").Eq(0).Text())
Fork, _ := infra.HandleCommon(selection.Find("div").Eq(3).Find("a").Eq(1).Text())
TodayStars, _ := infra.HandleCommon(selection.Find("div").Eq(3).Find("span").Eq(1).Text())
fmt.Println(RespName, URL, Description, Stars, Fork, TodayStars)
})
}
func ParseForDevelopers(document *goquery.Document) {
document.Find("div.explore-content ol li").Each(func(i int, selection *goquery.Selection) {
DevName, _ := infra.HandleCommon(selection.Find("li div div").Eq(1).Find("h2 a").Text())
Description, _ := infra.HandleCommon(selection.Find("li div div").Eq(1).Find("a span").Text())
URL, _ := infra.HandleCommon(selection.Find("li div div").Eq(1).Find("h2 a").AttrOr("href", "None"))
fmt.Println(DevName, Description, URL)
})
}
- 一个解析函数解析:
https://github.com/trending/developers
- 一解析函数解析:
https://github.com/trending
调度器
// enigin.go
package engine
import (
"errors"
"fmt"
"go-example-for-live/seven_learning/download"
"github.com/PuerkitoBio/goquery"
)
var (
ErrorDocWrong = errors.New("document wrong")
)
type Trending struct {
}
func (t Trending) Run(request RequestForGithub) {
var doc *goquery.Document
doc, err := download.Download(request.URL)
if err != nil {
fmt.Println(ErrorDocWrong)
return
}
if doc != nil {
fmt.Println("Game start!")
request.ParseFunc(doc)
} else {
fmt.Println("Game over!")
}
}
- 负责串接:下载器和解析器,获取到抓取的字段
package engine
import "github.com/PuerkitoBio/goquery"
type RequestForGithub struct {
URL string
ParseFunc func(doc *goquery.Document)
}
type Repositories struct {
RespName string
URL string
Stars int
Fork int
TodayStars string
Description string
}
type Developers struct {
DevName string
Description string
URL string
}
- 定义三个结构体:
1、称之为种子:包括URL 和 解析函数
2、Developers 定义为https://github.com/trending/developers
网页的抓取字段
3、Repositories 定义为https://github.com/trending
网页的抓取字段
基础设施
// util.go
package infra
import (
"errors"
"strings"
)
var (
ErrorStringSpace = errors.New("string trim error")
)
func HandleCommon(oldString string) (string, error) {
newReplacer := strings.NewReplacer("\n", "", "\t", "")
return strings.TrimSpace(newReplacer.Replace(oldString)), nil
}
func HandlerURL(oldString string) (string, error) {
return "https://github.com" + strings.TrimSpace(oldString), nil
}
即:一些字符串的处理函数,比如替换函数、拼接函数
主函数入口
// main.go
package main
import (
"go-example-for-live/seven_learning/engine"
"go-example-for-live/seven_learning/parse/github"
)
func main() {
var simplerTest engine.Trending
simplerTest.Run(
engine.RequestForGithub{
URL: "https://github.com/trending",
ParseFunc: github.ParseForGithub,
},
)
simplerTest.Run(
engine.RequestForGithub{
URL: "https://github.com/trending/developers",
ParseFunc: github.ParseForDevelopers,
},
)
}
结果:
Game start!
xingshaocheng / architect-awesome https://github.com/xingshaocheng/architect-awesome 后端架构师技术图谱 13,220 3,150 1,528 stars today
google / gvisor https://github.com/google/gvisor Container Runtime Sandbox 5,080 190
davideuler / architecture.of.internet-product https://github.com/davideuler/architecture.of.internet-product 互联网公司技术架构,微信/淘宝/微博/腾讯/阿里/美团点评/百度/Google/Facebook/Amazon/eBay的架构,欢迎PR补充 7,058 1,123 1,427 stars today
kusti8 / proton-native https://github.com/kusti8/proton-native A React environment for cross platform native desktop apps 6,216 153
github / gh-ost https://github.com/github/gh-ost GitHub's Online Schema Migrations for MySQL 5,136 335
pytorch / ELF https://github.com/pytorch/ELF ELF: a platform for game research 1,525 218
cyanharlow / purecss-francine https://github.com/cyanharlow/purecss-francine HTML/CSS drawing in the style of an 18th-century oil painting. Hand-coded entirely in HTML & CSS. 4,035 169
sallar / github-contributions-chart https://github.com/sallar/github-contributions-chart Generate an image of all your Github contributions 2,228 60
RelaxedJS / ReLaXed https://github.com/RelaxedJS/ReLaXed Create PDF documents using web technologies 7,116 181
sindresorhus / ow https://github.com/sindresorhus/ow Function argument validation for humans 1,790 18
xx45 / dayjs https://github.com/xx45/dayjs Fast 2KB immutable date library alternative to Moment.js with the same modern API 9,310 269
sharkdp / bat https://github.com/sharkdp/bat A cat(1) clone with wings. 2,102 26
CyC2018 / Interview-Notebook https://github.com/CyC2018/Interview-Notebook 技术面试需要掌握的基础知识整理,欢迎编辑~ 21,845 5,763 256 stars today
shimohq / chinese-programmer-wrong-pronunciation https://github.com/shimohq/chinese-programmer-wrong-pronunciation 中国程序员容易发音错误的单词 5,963 530 257 stars today
binhnguyennus / awesome-scalability https://github.com/binhnguyennus/awesome-scalability High Scalability, High Availability, High Stability, High Performance, and High Intelligence Back-End Design Patterns 10,778 810 246 stars today
nhnent / tui.calendar https://github.com/nhnent/tui.calendar A JavaScript calendar that everything you need. 4,759 192
YadiraF / PRNet https://github.com/YadiraF/PRNet The source code of 'Joint 3D Face Reconstruction and Dense Alignment with Position Map Regression Network'. 1,296 108
hasura / skor https://github.com/hasura/skor Listen to postgres events and forward them as JSON payloads to a webhook 946 19
roytseng-tw / Detectron.pytorch https://github.com/roytseng-tw/Detectron.pytorch A pytorch implementation of Detectron. Both training from scratch and inferring directly from pretrained Detectron weights are available. 926 114
iotexproject / iotex-core https://github.com/iotexproject/iotex-core Connecting the physical world, block by block. 654 58
layerJS / layerJS https://github.com/layerJS/layerJS layerJS: Javascript UI composition framework 1,039 26
AllThingsSmitty / css-protips https://github.com/AllThingsSmitty/css-protips A collection of tips to help take your CSS skills pro 11,921 785 188 stars today
tabler / tabler https://github.com/tabler/tabler Tabler is free and open-source HTML Dashboard UI Kit built on Bootstrap 4 13,629 983
tmcw / big https://github.com/tmcw/big presentations for busy messy hackers 2,436 137
cgoldsby / LoginCritter https://github.com/cgoldsby/LoginCritter An animated avatar that responds to text field interactions 3,351 141
Game start!
google (Google) (Google) material-design-icons material-design-icons Material Design icons by Google /google
davideuler (david l euler) (david l euler) architecture.of.internet-product architecture.of.internet-product 互联网公司技术架构,微信/淘宝/微博/腾讯/阿里/美团点评/百度/Google/Facebook/Amazon/eBay的架构,欢迎PR补充 /davideuler
xingshaocheng architect-awesome architect-awesome 后端架构师技术图谱 /xingshaocheng
cyanharlow (Diana Smith) (Diana Smith) purecss-francine purecss-francine HTML/CSS drawing in the style of an 18th-century oil painting. Hand-coded entirely in HTML & CSS. /cyanharlow
kusti8 (Gustav Hansen) (Gustav Hansen) proton-native proton-native A React environment for cross platform native desktop apps /kusti8
pytorch pytorch pytorch Tensors and Dynamic neural networks in Python with strong GPU acceleration /pytorch
github (GitHub) (GitHub) gitignore gitignore A collection of useful .gitignore templates /github
sindresorhus (Sindre Sorhus) (Sindre Sorhus) awesome awesome Curated list of awesome lists /sindresorhus
sallar (Sallar Kaboli) (Sallar Kaboli) github-contributions-chart github-contributions-chart Generate an image of all your Github contributions /sallar
RelaxedJS (ReLaXed) (ReLaXed) ReLaXed ReLaXed Create PDF documents using web technologies /RelaxedJS
symfony (Symfony) (Symfony) symfony symfony The Symfony PHP framework /symfony
facebook (Facebook) (Facebook) react react A declarative, efficient, and flexible JavaScript library for building user interfaces. /facebook
Microsoft (Microsoft) (Microsoft) vscode vscode Visual Studio Code /Microsoft
xx45 dayjs dayjs Fast 2KB immutable date library alternative to Moment.js with the same modern API /xx45
apache (The Apache Software Foundation) (The Apache Software Foundation) incubator-echarts incubator-echarts A powerful, interactive charting and visualization library for browser /apache
tensorflow tensorflow tensorflow Computation using data flow graphs for scalable machine learning /tensorflow
sharkdp (David Peter) (David Peter) fd fd A simple, fast and user-friendly alternative to 'find' /sharkdp
vuejs (vuejs) (vuejs) vue vue A progressive, incrementally-adoptable JavaScript framework for building UI on the web. /vuejs
CyC2018 Interview-Notebook Interview-Notebook 技术面试需要掌握的基础知识整理,欢迎编辑~ /CyC2018
nhnent (NHN Entertainment) (NHN Entertainment) tui.editor tui.editor Markdown WYSIWYG Editor. GFM Standard + Chart & UML Extensible. /nhnent
binhnguyennus (Binh Nguyen) (Binh Nguyen) awesome-scalability awesome-scalability High Scalability, High Availability, High Stability, High Performance, and High Intelligence Back-End Design Patterns /binhnguyennus
shimohq (Shimo Docs) (Shimo Docs) chinese-programmer-wrong-pronunciation chinese-programmer-wrong-pronunciation 中国程序员容易发音错误的单词 /shimohq
YadiraF PRNet PRNet The source code of 'Joint 3D Face Reconstruction and Dense Alignment with Position Map Regression Network'. /YadiraF
yyx990803 (Evan You) (Evan You) pod pod Git push deploy for Node.js /yyx990803
roytseng-tw (Roy) (Roy) Detectron.pytorch Detectron.pytorch A pytorch implementation of Detectron. Both training from scratch and inferring directly from pretrained Detectron weights are available. /roytseng-tw
需要强调的是这个项目的组织结构能够很好的进行扩展:比如说,我又想抓取其他网页。即重新再 parse 定义个新的解析器即可。其他可以复用。
另外,最后抓取的字段并没有填充进定义的结构体内。
再有,看上去这项目没什么值得提的,事实上,已经有人做了这个项目。每天抓取github trending 写入文件并托管在 github 上。有兴趣的可以看看别人的实现方式。
如果你自学者,接触不到企业级的项目,我建议你从 github 上寻找自己感兴趣的编程语言的项目重新写一遍。这样相当于,给自己出了一个题,而又有一份参考答案,能给自己一些反馈,同时不断的精进自己的技术。
全文完。希望大家学的开心。