Running Scrapy on Amazon EC2

Sometimes can be useful to crawl sites with Scrapy using temporary resources on the cloud, and Amazon EC2 is perfect for this task. You can launch an Ubuntu OS instance and schedule your spiders using the Scrapyd API. With boto, a python interface to Amazon Web Services, you can launch instances and install the scrapy daemon using the user data feature to run a script on boot.

First, you need an AWS account with your access keys, a EC2 security group accepting TCP connections on port 6800 and a key pair for the selected region. After that you must choose an Ubuntu EC2 image, here you can find a list of Ubuntu AMIs.

We need to create an user data script with the following content

#!/bin/bash
# filename: user_data-scrapyd.sh

# Add Scrapy repositories and key
echo "deb http://archive.scrapy.org/ubuntu precise main" > /etc/apt/sources.list.d/scrapy.list
curl -s http://archive.scrapy.org/ubuntu/archive.key | apt-key add -

# Install debian package
apt-get update && apt-get -y install scrapyd-0.16

# Restart scrapyd
service scrapyd restart

finally we can launch an instance using boto on the python console

>>> with open('user_data-scrapyd.sh') as f:
...  user_data = f.read()
...
>>>
>>> import boto.ec2
>>> conn = boto.ec2.connect_to_region("us-east-1",
...    aws_access_key_id='my_aws_access_key',
...    aws_secret_access_key='my_aws_secret_key')
>>> conn.run_instances('ami-da0d9eb3', user_data=user_data, key_name='main', instance_type='t1.micro', security_groups=['default'])
Reservation:r-89a2aef3
>>> reservations = conn.get_all_instances()
>>> reservations
[Reservation:r-89a2aef3]
>>> instances = reservations[0].instances
>>> instances
[Instance:i-d4a2c1b8]
>>> i = instances[0]
>>> i.state
u'pending'
>>> i.update()
u'running'
>>> u'http://%s:6800/' % i.public_dns_name
u'http://ec2-54-224-86-173.compute-1.amazonaws.com:6800/'

and pointing your browser to this URL you must see the Scrapyd monitoring page. Now you can schedule a spider run just by using

$ curl http://ec2-54-224-86-173.compute-1.amazonaws.com:6800/schedule.json -d project=myproject -d spider=spider1
{"status": "ok", "jobid": "70d1b1a6d6f111e0be5c001e648c5a52"}

You can find a copy of these scripts on Github ec2-scrapyd.

D'oh!

Running Scrapy on Amazon EC2