Design a Web Crawler
Difficulty: Hard
Definition
A Web Crawler is a bot that downloads content from all over the Internet or worldwide web. It is also referred to as spiders, spider bots, worms, or simply bots.
Companies like Google or Bing search uses such a bot to learn or download every webpage on the internet. The intention here is to index data and provide the most relevant search results back to the user.
Real-time Use case
- Search Engine: Platforms like Google, bing uses to index webpages in order to provide the most relevant search results.
One can think of a web crawler as a kind of bot that organizes the content over the internet in the searchable form. It’s like organizing your phone book with alphabets, easy to search by name or query in the world of Web Crawler. Imagine the vastness of these search engine crawling through billions of pages to provide the most relevant results back to the user. Through all crawled data, search indices are created which are used to fulfill the user requests. It is like going to the back of book to see where to find the relevant reference for a certain phrase or area of interest. - Malware Detection: Used to find the attacks like phishing.
- Web Analytics: More on the analysis of patterns around the content on the internet. This kind of data is used by companies that try to generalize the content over the internet.
- Copyright infringement: Checks the internet if there is any copyright violation for protected data.
- Special-purpose index: This is to understand the multi-media files on the internet.
Requirements & Features to be supported
Feature List
- Crawling rate / Politeness
- Distributed Crawling
- Detecting duplicates
- Priority Crawling
Functional requirements
- Seed Url or urls?
- What to crawl — all urls, multimedia files, content? And what part needs to persist?
- Only crawl English language articles if any.
- What type of documents are supported?
Assuming multimedia files is out of scope for this article. Only work is to extract the urls and if no urls is present persist the data on some local storage.
Non-functional requirements
- High Availability
- Scalability: Fetching content as fast as possible. Service should be able to scale to fetch billions of web pages.
- Fault-tolerant: Timeouts can be discarded. If urls fails to be rendered or fetched, discard the url.
- Extensibility: Adding a new feature should be easy. One can expect to support new type of documents or multimedia files later. It should affect the existing application.
Assumptions
It is highly recommended to define the following in the interview based on above interrogation with the interviewer:
Scale Estimations
- Expected number of URLs to crawl (writes per second):
- Expected number of API calls (reads per sec):
High Level Design
A Web Crawler has the following main components:
- Seeder (Url Adder)
- Crawler (Writer path)
- Indexer (Reader path)
Seeder
A simple components which add the url to be crawled to a queue. It is the initiator for the system.
Crawler
Input: List of Seed Urls to begin crawling at. Looks a Queue data structure would be nice to have lists of Seed Urls.
The Worker node will pull one url from the queue and perform the following actions:
- Extract text from url.
- Send the extracted text to the document indexing service.
- Look for any urls on the page. If found any, check if the links are already visited. This is to make sure the crawler works on the url only once and does not get stuck in any loop. A key value based storage sounds reasonable.
Indexer
Input: The content of the url that needs to be indexed for ranking and searching capabilities.
The job of the document indexer is:
- Store the document content at some storage. Maybe a block store or document store. No-Sql store might not work if the document is large. Considering blob store makes sense here.
- Inverse index needs to be created based on keywords and phrases. Service like Elastic Search or Lucene can be used to create such index. This can have the association back to the original url based on the query fo keyword and text search.
- Exposes a query API to return the results from the Inverted index. Input to the API could be number of results and query. Maybe pagination parameter in case the number of results to be returned is huge.
Challenges
Scalability
Scaling Crawler
- Worker queue can be distributed queue. Any distributed queue has a Visibility timeout which implies that message stays in the queue and made invisible until the worker processes it completely. If timeout elapses, the message becomes visible in the queue. then it depends on the number of re-drive policy. If re-drive policy exceeds, then message is push to Dead letter queue.
- Key-value based datastore is pretty fast. Additionally some caching like Redis or Memcache can be used. All No-Sql databases, typically, coms with master-slave replication. This ensures fault-tolerance and supports re-partitioning if requires using algorithms like consistent hashing.
Identifying bottlenecks:
- With billions of urls to crawl, the crawler will make too many requests to the No-Sql database to ensure if the url is visited. If not, it adds to the key-value store. How to ensure that one worker thread is working on a single url?
- What if webpage is temporarily down or not available? Let’s take scenario when the crawler receives some 5xx response from the website and it took like 50 seconds for a server to respond. During this time, the message queue released the same url to another worker who eventually would suffer the same fate.
- How to handle dead links? Maybe tagging such urls as blacklisted or storing it in blacklist no-sql store. This is to prevent the other crawlers to add such url back to the queue to process.
- What to do if
robots.txt
is present? Maybe such domain name needs to be stored and marked as blacklisted. - What if crawler failed to crawl some webpage? This kind of scenarios needs further investigation. It should be sent to the Dead Letter Queue.
- What if parallel crawlers are running on different pages of the same website? How to throttle requests in this case? How to ensure multi-crawler are not working on the same sub-domain? Maybe streaming queue could be more useful than messaging queue in this case where the domain name is the partition key. With domain name as the partition key, all sub-urls related to that domain name would be handled by one worker only.
- How many parallel requests should be made to subdomains for a website? Websites like wikipedia can handle multiple requests at a time but small website might not. Maybe a rate limiting factor could be applied on the number of connections open by the worker node because websites like wikipedia will require more number of connections than a small business website. This value of n can be analyzed and tweaked based on error and trial.
Scaling Indexer
- Indexer is all about read operation. to make reads more efficient and scale, we can do two things here:
1. Having more read Replicas to support more requests.
2. Adding a caching layer in between. Considering 80/20 rule, 80% of requests can be served by 20% of cached data. - Distributed cache like Redis or Memcache can be used. What about
write through cache
that will have crawler directly update the cache? What kind of caching strategy to use here? - LRU. What should be ideal TTL?
Identifying bottlenecks:
- What if page to be indexed is updated? How frequently these changes to be reflected?
- De-duplication of document might need to be considered while updating the new document in the blob storage.
Metrics & Logging
Crawler
- How many times worker was rate limited?
- Latency of every operation
- Unresponsive websites
Indexer
Originally published at https://raghavnyati.ghost.io on December 9, 2020.