Sunday, January 7, 2018

Timber Spider

Getting there first

My parents are part of a timeshare pool in which they can trade weeks at their location for weeks at other places in the pool. Weeks are submitted to the pool by other owners, at semi-random, and are reserved on a first-come-first-serve basis. Some places are reserved more quickly than others. My dad checks the website that lists the locations almost every day. I thought... why not every hour? It was decided to send an MMS any time a really good week was added, and an email anytime a week of any quality was added or removed. Hopefully by getting alerts sent to our phones we could secure our favorite weeks before anyone else.

Making a robospider

After a few iterations I decided that scrapy was the right tool. It has a lot of stackoverflow stuff, a bunch of automation, processing pipelines, etc.
The availability website is behind a password protected login so I needed to use a spider that could handle the authentication step. I decided to use the InitSpider as it handles authentication before crawling.
Next I told the spider to seek out the "next >" links on the page for creating followup requests. The availabilities are listed in a standard table, so I parse each row as a scrapy Item. I set up an item processing pipeline and activated it to do the alerting.

Alert system

Items gleaned by the spider are passed through a pipeline which assembles them into a dictionary and compares them to an older dictionary for a difference. The older dictionary is shelved when the spider is closed. I decided to shelve the dictionaries as it seemed really easy to do and I thought doing something like a JSON export was overkill. Turns out shelving is really easy (except when it's not).
The comparison creates three lists: additional really good weeks, new weeks, and removed weeks. Later in the pipeline the text message and email is formatted. Both required setup of a dummy gmail account to send to my family. 

Installing and Automating

I had to get scrapy installed in a conda environment (turned out to be a problem) on my pi. I made sure to install the following before attempting to install scrapy.
sudo apt-get install libffi-dev
sudo apt-get install libxml2-dev
sudo apt-get install libxslt1-dev
sudo apt-get install python-dev

It took a while for lxml to build on my little pi (~20 minutes). But after that step scrapy seemed to be available. I loaded my spider and hit go and ran smack into my first issue.

Traceback (most recent call last):
  File "/home/pi/berryconda2/lib/python2.7/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks
    result = g.send(result)
  File "/home/pi/berryconda2/lib/python2.7/site-packages/scrapy/crawler.py", line 98, in crawl
    six.reraise(*exc_info)
  File "/home/pi/berryconda2/lib/python2.7/site-packages/scrapy/crawler.py", line 82, in crawl
    yield self.engine.open_spider(self.spider, start_requests)
ImportError: No module named _bsddb

OK so I gotta figure out what this bsddb thing is. Apparently it's some BerkelyDB interface that twisted appears to use.

Getting bsddb3

First thing was to get the BerkeleyDB built for my pi. I followed the install instructions here:

Make sure you install the older version of BDB (5.3.28)

Actually I tried to just pip install bsddb3 but got smacked down.

After a while I decided to drop shelve, as soon as it seemed like it might be causing the bsddb dependency. After dropping it I encountered a new error, ECC Curve.

I went to the Twisted IRC after bashing my skull against the problem for a while and someone confirmed there was some missing stuff from my pyopenssl build. They said they'd try and reproduce the issue. So I waited for a bit.

Forgetting bsddb3, something else is wrong

I tried updating cryptography, twisted, and pyopenssl to the most recent versions I could. I did everything short of updating openssl for the pi to 1.1 (I am was running Jessie). Nothing seemed to work. So I hopped into the IRC for twisted and got some help from runciter. After running some tests they concluded conda was my problem. Specifically berryconda and my old openssl libs. I need to bump to 1.1 but I couldn't easily do that with Jessie. So I decided to upgrade to Stretch. After upgrading to Stretch and uninstalling scrapy, twisted, pyopenssl, and cryptography I installed scrapy again and voila, it works.

Wrapping it up

Last step was to add a simple cron job to execute my script on the hour from 6-23. 

No comments:

Post a Comment