Scrapy and Elasticsearch

On 29.07.2014 I gave a talk at Search Meetup Karlsruhe on using Scrapy with Elasticsearch, the slides are here. This post evolved from the talk and introduces you to web scraping and search with Scrapy and Elasticsearch.

Web Crawling

You might think that web crawling and scraping only is for search engines like Google and Bing. But a lot of companies are using it for different purposes: Price comparison, financial risk information and portals all need a way to get the data. And at least sometimes the way is to retrieve it through some public website. Besides these cases where the data is not in your hand it can also make sense if the data is aggregated already. For intranet and portal search engines it can be easier to just scrape the frontend instead of building data import facilities for different, sometimes even old systems.

The Example

In this post we are looking at a rather artificial example: Crawling the meetup.com page for recent meetups to make them available for search. Why artificial? Because meetup.com has an API that provides all the data in a more convenient way. But imagine there is no other way and we would like to build a custom search on this information, probably by adding other event sites as well.

This is a part of the Search Meetup Karlsruhe page that displays the recent meetups.

We can see that there is already some information we are interested in like the title and the link to the meetup page.

Roll Your Own?

When deciding on doing web scraping you might be tempted to build it yourself using a script or some code. How hard can it be to fetch a website, parse its source and extract all links to follow?

For demoing some of the features of Akka I have built a simple web crawler that visits a website, follows all links and indexes the content in Lucene. While this is not a lot of code you will notice soon that it is not suited for real world uses: It is hammering the crawled page with as many requests as possible. There is no way to make it behave nicely by respecting the robots.txt. Additional processing of the content is too hard to add afterwards. All of this is enough to lean to a ready made solution.

Scrapy

Scrapy is a framework for building crawlers and process the extracted data. It is implemented in Python and does asynchronous, non-blocking networking. It is easily extendable, not only via the item pipeline the content flows through. Finally it already comes with lots of features that you might have to build yourself otherwise.

In Scrapy you implement a spider, that visits a certain page and extracts items from the content. The items then flow through the item pipeline and get dumped to a Feed Exporter that then writes the data to a file. At every stage of the process you can add custom logic.

This is a very simplified diagram that doesn't take the asynchronous nature of Scrapy into account. See the Scrapy documentation for a more detailed view.

For installing Scrapy I am using pip which should be available for all systems. You can then run pip install scrapy to get it.

To get started using Scrapy you can use the scaffolding feature to create a new project. Just issue something like scrapy startproject meetup and scrapy will generate quite some files for you.

meetup/
meetup/scrapy.cfg
meetup/meetup
meetup/meetup/settings.py
meetup/meetup/__init__.py
meetup/meetup/items.py
meetup/meetup/pipelines.py
meetup/meetup/spiders
meetup/meetup/spiders/__init__.py

For now we can concentrate on the items.py, that describes the strucure of the data to crawl, and the spiders directory where we can put our spiders.

Our First Spider

First we need to define what data structure we would like to retrieve. This is described as an Item that is then created using a Spider and flows through the item pipeline. For our case we can put this into items.py

from scrapy.item import Item, Field

class MeetupItem(Item):
title = Field()
link = Field()
description = Field()

Our MeetupItem defines three fields for the title, the link and a description we can search on. For real world usecases this would contain more information like the date and time or probably more information on the participants.

To fetch data and create Items we need to implement a Spider instance. We create a file meetup_spider.py in the spiders directory.

from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from meetup.items import MeetupItem

class MeetupSpider(BaseSpider):
name = "meetup"
allowed_domains = ["meetup.com"]
start_urls = [
"http://www.meetup.com/Search-Meetup-Karlsruhe/"
]

def parse(self, response):
responseSelector = Selector(response)
for sel in responseSelector.css('li.past.line.event-item'):
item = MeetupItem()
item['title'] = sel.css('a.event-title::text').extract()
item['link'] = sel.xpath('a/@href').extract()
yield item

Our spider extends BaseSpider and defines a name, the allowed domains and a start url. Scrapy calls the start url and passes the response to the parse method. We are then using a Selector to parse the data using eiher css or xpath. Both is shown in the example above.

Every Item we create is returned from the method. If we would have to visit another page we could also return a Request object and Scrapy would then visit that page as well.

We can run this spider from the project directory by issuing scrapy crawl meetup -o talks.json. This will use our meetup spider and write the items as JSON to a file.

2014-07-24 18:27:59+0200 [scrapy] INFO: Scrapy 0.20.0 started (bot: meetup)
[...]
2014-07-24 18:28:00+0200 [meetup] DEBUG: Crawled (200) <get http:="" www.meetup.com="" search-meetup-karlsruhe=""> (referer: None)
2014-07-24 18:28:00+0200 [meetup] DEBUG: Scraped from <200 http://www.meetup.com/Search-Meetup-Karlsruhe/>
{'link': [u'http://www.meetup.com/Search-Meetup-Karlsruhe/events/178746832/'],
'title': [u'Neues in Elasticsearch 1.1 und Logstash in der Praxis']}
2014-07-24 18:28:00+0200 [meetup] DEBUG: Scraped from <200 http://www.meetup.com/Search-Meetup-Karlsruhe/>
{'link': [u'http://www.meetup.com/Search-Meetup-Karlsruhe/events/161417512/'],
'title': [u'Erstes Treffen mit Kurzvortr\xe4gen']}
2014-07-24 18:28:00+0200 [meetup] INFO: Closing spider (finished)
2014-07-24 18:28:00+0200 [meetup] INFO: Stored jsonlines feed (2 items) in: talks.json
2014-07-24 18:28:00+0200 [meetup] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 244,
'downloader/request_count': 1,
[...]
'start_time': datetime.datetime(2014, 7, 24, 16, 27, 59, 540300)}
2014-07-24 18:28:00+0200 [meetup] INFO: Spider closed (finished)
</get>

You can see that Scrapy visited the page and extracted two items. Finally it prints some stats on the crawl. The file contains our items as well

{"link": ["http://www.meetup.com/Search-Meetup-Karlsruhe/events/178746832/"], "title": ["Neues in Elasticsearch 1.1 und Logstash in der Praxis"]}
{"link": ["http://www.meetup.com/Search-Meetup-Karlsruhe/events/161417512/"], "title": ["Erstes Treffen mit Kurzvortr\u00e4gen"]}

This is fine but there is a problem. We don't have all the data that we would like to have, we are missing the description. This information is not fully available on the overview page so we need to crawl the detail pages of the meetup as well.

The Crawl Spider

We still need to use our overview page because this is where all the recent meetups are listed. But for retrieving the item data we need to go to the detail page.

As mentioned already we could solve our new requirement using our spider above by returning Request objects and a new callback function. But we can solve it another way, by using the CrawlSpider that can be configured with a Rule that advices where to extract links to visit.

In case you are confused, welcome to the world of Scrapy! When working with Scrapy you will regularly find cases where there are several ways to do a thing.

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from meetup.items import MeetupItem

class MeetupDetailSpider(CrawlSpider):
name = "meetupDetail"
allowed_domains = ["meetup.com"]
start_urls = ["http://www.meetup.com/Search-Meetup-Karlsruhe/"]
rules = [Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@id="recentMeetups"]//a[@class="event-title"]')), callback='parse_meetup')]

def parse_meetup(self, response):
sel = Selector(response)
item = MeetupItem()
item['title'] = sel.xpath('//h1[@itemprop="name"]/text()').extract()
item['link'] = response.url
item['description'] = sel.xpath('//div[@id="past-event-description-wrap"]//text()').extract()
yield item

Besides the information we have set for our other spider we now also add a Rule object. It extracts the links from the list and passes the responses to the supplied callback. You can also add rules that visit links by path, e.g. all with the fragment /articles/ in the url.

Our parse_meetup method now doesn't work on the overview page but on the detail pages that are extracted by the rule. The detail page has all the information available we need and will now even pass the description to our item.

Now that we have all the information we can do something useful with it: Index it in Elasticsearch.

Elasticsearch

Elasticsearch support for Scrapy is available by installing a module: pip install "ScrapyElasticSearch". It takes the Items created by your spider and indexes those in Elasticsearch using the library pyes.

Looking at the Scrapy architecture above you might expect that the module is implemented as a FeedExporter that exports the items to Elasticsearch instead of the filesystem. For reasons unknown to me exporting to a database or search engine is done using an ItemPipeline which is a component in the item pipeline. Confused?

To configure Scrapy to put the items to Elasticsearch of course you need to have an instance running somewhere. The pipeline is configured in the file settings.py.

ITEM_PIPELINES = [
'scrapyelasticsearch.ElasticSearchPipeline',
]

ELASTICSEARCH_SERVER = 'localhost'
ELASTICSEARCH_PORT = 9200
ELASTICSEARCH_INDEX = 'meetups'
ELASTICSEARCH_TYPE = 'meetup'
ELASTICSEARCH_UNIQ_KEY = 'link'

The configuration should be straightforward. We enable the module by adding it to the ITEM_PIPELINES and configure additional information like the host, index and type name. Now when crawling for the next time Scrapy will automatically push your data to Elasticsearch.

I am not sure if this can be an issue when it comes to crawling but the module doesn't use bulk indexing but indexes each item by itself. If you have a very large amount of data this could be a problem but should be totally fine for most uses. Also, of course you need to make sure that your mapping is in place before indexing data if you need some predefined handling.

Conclusion

I hope you could see how useful Scrapy can be and how easy it is to put the data in stores like Elasticsearch. Some approaches of Scrapy can be quite confusing at first but nevertheless it's an extremely useful tool. Give it a try the next time you are thinking about web scraping.