In this post, our CTO Alex López explains why it's important to keep the creative juices flowing in a development team and how his team built a functional crawler in just one day during the very first Plytix hackathon.
Eat like a bird and poo like an elephant. That is one of our core principles in Plytix. It means that we always strive to make the biggest possible impact with the smallest possible means. To accomplish that, you need to be passionate about what you do. Lack of passion leads to impersonal or careless actions, which in the end becomes standard, or substandard, stuff. This is a serious issue that can keep companies away from success. In order to do something truly original, you need to nurture the professional passions of your team.
In the Plytix IT department, everybody is passionately engaged in our developments. The coding we do is very challenging, hence very rewarding as we get things done. But to maintain momentum and spark professional passion, it's crucial to change up the context once in a while, especially when working with software developers / system engineers.
Established guidelines can potentially dampen creativity, which would be a shame considering that everyone in our team has magnificent ideas and a remarkable bucket list of new technologies they’d like to try. I'm highly committed to helping my team members achieve their goals and support their personal development, which is why we plan to arrange an internal hackathon every few months or so.
In November 2015, we had the chance to run our very first hackathon. The whole Plytix team met up in Málaga for a few days of intense work, inspiring workshops and lots of socializing. One day was reserved for department activities, and we seized the opportunity for an express hackathon. It’s quite a big challenge to define, organize and deliver something decent in such a short time, so we had to prove yet again why we’re proud to be part of Plytix.
The hackathon challenge
As we were so limited in time, I decided to suggest a project that we had talked about a few times before. We all found the idea interesting and challenging (and, of course, potentially useful for Plytix): to build a web crawler.
There are tons of crawlers out there, and since not all of them are built with the greater good in mind, the concept of a ‘crawler’ is negatively perceived by some people. However, web crawlers are not bad boys in themselves. They’re practical - and legal - tools that can be written in a responsible and respectful way, being extremely useful and absolutely harmless. Their sole purpose is - well, should be - to explore the web, searching for information that is already publicly available. Since we value our integrity, we of course wrote the crawler accordingly, taking precautions to maintain transparency:
- The crawler should identify itself as “Plytix” (user-agent), so that any system manager would know it was us who were walking around their site.
- It should have a reasonable rate limit to avoid stressing out the servers it frequented.
- It should have a clear goal, rather than just crawling anything.
Regarding the final point, the subset of websites Plytix is interested in is, of course, e-commerce sites. Ideally, the crawler should be able to identify web shops and, if possible, proceed to scrape product information. This last part is particularly difficult, since each site has a different structure (aka. templates, layout, code, you name it), making it much more complex to have a cross-site working scraper.
Given all these considerations, I prepared a little list of useful links to visit for the team, so we had enough ideas to start a discussion about how to face the project (there are plenty of resources out there, but I kept the list short, since we only had one day for the hackathon):
- Introduction to web-crawling in Python
- My Python Web Crawler
- Web Scraping with Scrapy and MongoDB
- How to crawl a quarter billion webpages in 40 hours
Once everybody had read the links, it was the time to open a discussion about how to face the development. Given the time constraint, we had to balance ambition and functionality. We knew there’d be a lot of things to improve later, no matter what we did, but we were determined to write the crawler with the ambition of creating something immediately useful. Even though having some “working” code by the end of the day was essential to us, the project should also meet our quality standards, meaning that if we later decided to include it in our work pipeline, we wouldn’t have to start from scratch.
That resolution in mind, we had these issues to discuss:
- Which technology to use for the scraper? We all agreed that Scrapy was very promising, having a nice framework that seemed to solve a lot of common problems in these kinds of projects. However, the required learning curve was clearly steeper than the one required by scraping libraries such as pyquery and Beautiful Soup, which were basically HTML content parsers with very intuitive interfaces.
- How to build the URL database? This question was of course closely related to the previous one. We could decide to trust Scrapy’s ability to run a spider on every single target domain and review all the pages on it, thus only maintaining domains in our URL database. Or we could use a scraping library to simply find all URLs in the visited pages and then handle them ourselves, sorting out which ones were already in our Database and which ones should be added.
- How to design the system architecture to properly balance the workload on a threaded crawler? Should we implement a hash or a fast-read algorithm - like the one proposed in the last article on our initial list of resources (a Bloom filter)?
How to build a crawler
What did we do in the end? Well, after a brief discussion we decided to use Scrapy. We were a bit concerned by the learning curve, but we were also confident about the potential of the framework and all the headaches it could save us in the immediate future. This decision laid the ground for the rest of the work, making the process run quite smoothly:
- We would have a domains database, thus only storing the domain, rather than all the internal URLs of each site. This would significantly streamline the process of checking URLs.
- Moreover, we were not interested in processing all the domains out there, but only those identified as webshops. So, we would maintain another list of domains that had been visited, but were discarded for not being e-commerces. Both lists would be handled by a crawler manager.
- Where to start? We decided to go for a few well-known web shops and then expand from there. Should we have been lacking ideas, though, the list of top million Alexas’ sites could also have been a good starting point.
- For each domain we visited, a new Scrapy spider would be created real time for that specific domain, thus avoiding unnecessary or cycling visits. Each spider would have two scrapers: one for URL links to other domains which would be sent to the crawler manager, who would then determine which list to put them on (if any); the other scraper would be searching for product information, which was the main purpose of the project.
- As we knew that the product information would vary from site to site, we started out searching only for standard Product schema-like information, expecting to later broaden the search.
- We decided to avoid the use of a Bloom filter given its complexity and the limited time we had on our hands. Instead, we focused the system efforts on building a scalable setup for the project in Amazon EC2.
We made the project more manageable by dividing the work into these areas:
- System: Build a scalable architecture in Amazon EC2 along with all the scripts for automated deployment, updates, etc. We used Ansible for automation (as we do in all Plytix projects).
- Crawler: Check and maintain DB, and launch a Scrapy instance for each of the domains to be visited. As mentioned, we would trust Scrapy’s spider engine to visit all the pages inside the domain.
- Scraper: Build the two scrapers (parsers) - one scraper for URLs (which were sent to the crawler manager) and one scraper for Products (sent to a Product DB).
- Interface: Build a basic admin panel where we could start and stop the crawler, track visited URLs, present all the scraped information, etc.
So, what was the end result of our speedy hackathon? I have to say we managed to build quite a decent solution!
Scrapy was definitely the right choice, if you ask me. We had to read more documentation and some tutorials before being ready to go, but it saved us a lot of time and effort, as expected. The product scraper was of course very limited, but we managed to fetch a bunch of product information, and the crawler worked just fine. The admin panel we could get was very basic, but it showed a solid information overview of the product information found by the scrapers.
All of this was achieved in only one day! So I’m naturally excited to find out how much we will be able to accomplish when we book a few days or even a week for our next hackathon. The experience was fun and inspiring, and we got a very promising product draft, which will probably be added to our internal tools in the near future.
If you have any questions or comments about our hackathon or the crawler we built, feel free to write a comment below or get in touch with us here.