zhihu-scrapy

A Scrapy Zhihu Crawler

What is zhihu-scrapy?

zhihu-scrapy is a distributed crawler system for crawling zhihu website.The data we gather include user profile, followees and followers.Collected data can be used for various purpose(eg. finding communities, identifying popular answer posters)

How does it work?

It combines the following systems:

scrapy (parsing and logging)
selenium (downloading and executing javascript)
redis (queueing and storing results)

The crawler system consists of one main redis server to manage crawling records. All crawling machines start a local redis server for storing user data.

How to get started?

Start redis server on main server and crawling machines.

Add initial users to the main redis server with Monitor, example:

>> from zhihu.utils import Monitor
>> init_list = ['first-id',]
>> Monitor.add_user_ids(init_list)

In zhihu/settings.py set REDIS_HOST to the ip address of the main redis server.

Use scrapy crawl zhihu_people to start a crawler.

How to solve captchas?

We provide the Monitor class to monitor crawlers, including solving captchas for them. To solve captchas for all crawlers that need captcha, use:

>> from zhihu.utils import Monitor
>> m = Monitor()
>> m.solve_captchas()

How to add accounts?

Each crawler needs to fetch an account from the account pool to start. To add accounts to account pool, use:

>> from zhihu.utils import Monitor
>> m = Monitor()
>> m.add_account('username','password')

How to check stats?

>> from zhihu.utils import Monitor
>> m = Monitor()
>> m.stats()

License: GPL v3

GPL v3 details

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
zhihu		zhihu
.gitignore		.gitignore
README.md		README.md
dependencies.txt		dependencies.txt
deploy.sh		deploy.sh
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zhihu

zhihu

.gitignore

.gitignore

README.md

README.md

dependencies.txt

dependencies.txt

deploy.sh

deploy.sh

requirements.txt

requirements.txt

scrapy.cfg

scrapy.cfg

Repository files navigation

zhihu-scrapy

What is zhihu-scrapy?

How does it work?

How to get started?

How to solve captchas?

How to add accounts?

How to check stats?

License: GPL v3

About

Releases

Packages

Languages

immzz/zhihu-scrapy

Folders and files

Latest commit

History

Repository files navigation

zhihu-scrapy

What is zhihu-scrapy?

How does it work?

How to get started?

How to solve captchas?

How to add accounts?

How to check stats?

License: GPL v3

About

Resources

Stars

Watchers

Forks

Languages