Python Projects Rating 5 out of 5

Create a Web Crawler using Scrapy module in python and store the data in a JSON format

Create a Web Crawler using Scrapy module in python and store the data in a JSON format Dec. 11, 2020

What is Web Crawler?

Web crawler are the programs written to crawl or retrieve data from the websites.

Use Cases-

  • Search Engine such as Google, Yahoo uses the crawler to index the data in the search results. 
  • Price comparison sites uses the crawler to detect the prices and other specifications of the products to compare their  prices across various platforms.
  • Some news websites use the crawler to get news from many sources. automatically.
  • Many websites crawl data from various websites to know the market rates and apply relevant offer into their site. 

Scrapy-

Scrapy is a free and open-source web-crawling framework written in Python. Originally designed for web scraping, it can also be used to extract data using APIs or as a general-purpose web crawler. It is currently maintained by Scrapinghub Ltd., a web-scraping development and services company. 

Lets start the tutorial-

We assume that you have python installed in your system. In case you have not installed yet, See this article.

Install Scrapy

Run pip install scrapy in your console to install scrapy. Scrapy Shell can be used for using scrapy in command prompt. Run scrapy shell <url> to run shell.

Create a scrapy project

Run scrapy startproject <project name> in your console to start project. For example, I am creating a project named codever by running scrapy startproject codever. It will create some files and directories.

codever/
    scrapy.cfg            # deploy configuration file

    codever/              # project's Python module, you'll import your code from here
        __init__.py

        items.py          # project items definition file

        middlewares.py    # project middlewares file

        pipelines.py      # project pipelines file

        settings.py       # project settings file

        spiders/          # a directory where you'll later put your spiders
            __init__.py

Create Spider

Spider is a class. It inherits scrapy.Spider class. It contains the name of the spider and a function start_requests(). This function contains list of urls of the websites which you want to scrape. scrapy.Request class is used to fetch the data. Two parameters are passed in this function. One is the url and another one is the callback function. Callback function performs the various operations on the fetched pages. 

#Import the required module. For instance, We need scrapy module.
import scrapy

#Create a class which inherits scrapy.Spider class.
class CodeverSpider(scrapy.Spider):
    name = "codever"                                                              #Name of the spider which is to be called to run spider

    def start_requests(self):   #Initial function 
        url = 'https://www.codever.co.in'                                   #Url to fetch 

# yield is used to return multiple data. If we use loop all the urls can be fetched. It is not possible in return.
#scrapy.Request class takes url and callback function and pass the response to the callback function
        yield scrapy.Request(url, self.parse) 

#This is the callback function which is called by Request class
    def parse(self, response):
        categories = response.css('.card-title:nth-child(1)')         #fetching the category elements using css selector
        for category in categories: 
             yield category.get()                                                     #returning the text from all category elements

This is the basic scrapy program. Here I have used css selector. It also supports xpath selector. In the following code, both will work same as another. 

response.css('.card-title:nth-child(1)')
response.css('//*[@class='card-title'][1]')

The following command also functions similarly. In second code, Request class is automatically initialized. 

urls = [ 'https://www.codever.co.in',
            ......................................................,
            ......................................................,
            ......................................................,
           ]
for url in urls:
       yield scrapy.Request(url, self.parse)
start_urls = [ 'https://www.codever.co.in',
            ......................................................,
            ......................................................,
            ......................................................,
           ]

Using command line tags

def start_requests(self):
        urls = [....]
        for url in urls:

#getattr() function takes the command line arguments. We can use any variable name instead of tag. Third parameter contains the default tag value.
# scrapy runspider <spider-name> -a <variable_name>=<arguments>
# scrapy runspider codever -a tag=python
             tag = getattr(self, 'tag', None)
             if tag is not None:
                    url = url + 'tag/' + tag


             yield scrapy.Request(url, self.parse)

Handling Multiple Pages

def parse(self,response):
        .....
        .....
        .....
        next_page = response.css('<selector for next-page>::attr(href)').get()
        if next_page is not None:
            yield response.request(response.urljoin(next_page), self.parse)
#response.follow(next_page, self.parse) can also be used instead of response.request() function

 

Run Spider

scrapy crawl codever -o categories.json

Run the above command on command prompt. Before running ensure that your location must be <path-to-project>/project. For example, My location is E:/scrapy/codever where codever is my project name. Your output will be saved in categories.json file. 

 

Don't forget Rate and Comment.

Comment if you want to more on web-scraping or suggest the topic, on which you want to learn?


Rate this post

Comments