It seems that MessageQueue should be a good architectural solution for building Web Crawler, but still I can't understand how to do it.

Let's consider the first case with shared database, it is pretty clear how to do it the algorithm would be the classical Graph Traversal:

There are multiple Workers and shared database.

- I manually put the first url into the database

while true

  - worker get random discovered url from database.
  - worker parses it and gets list of all links on the page.
  - worker updates the url in the database as processed.
  - worker lookup into the database and separates the found links 
    into processed, discovered and the new ones.
  - worker add the new ones links to database as discovered.

Let's consider the second case, with MessageQueue

There are MessageQueue containing urls that should be processed 
and multiple Workers.

- I manually put the first url in the Queue.

while true

  - worker takes next discovered url from the Queue.
  - worker parsers it and gets list of all links on the page.
  - what it does next? How it separates found links into
    processed, discovered and the new ones?
  - worker puts the list of new urls into the Queue as discovered.



You would set up separate queues for these, which would stream back to your database. The idea is that you could have multiple workers going, and a feedback loop to send the newly discovered URLs back into queue for processing, and to the database for storage.

How to separate the links found on the page into processed, discovered and the new ones? It's clear how to do it in case of DB - just lookup in DB and check every link, but how to do it in case of MessageQueue?

You would probably still look up in the DB the links that are coming in from the queue.

So, workflow looks like this: Link gets dropped on queue Queue worker picks it up, and checks db to see if link processed If not processed, make call to website to retrieve other outbound links parse page, and drop each outbound link onto the queue for processing

Is it ok to keep all discovered urls in the MessageQueue? What if there are thousands of sites with thousands of pages, there would be millions messages waiting in the Queue.

Probably not, this is what a database is for. Once things are processed, you should drop them from the queue. Queues are made for... queuing. Message transport. Not for data-storage. Databases are made for data-storage.

Now, until they are processed, yes you can leave them on the queue. If you are worried about queue capacity, you could modify the workflow so that the queue worker removes any links that are already processed, that should reduce the depth of your queue. It might even be more efficient.


