architecture,web-crawler,message-queue , How to use MessageQueue in Crawler?

How to use MessageQueue in Crawler?


Tag: architecture,web-crawler,message-queue

It seems that MessageQueue should be a good architectural solution for building Web Crawler, but still I can't understand how to do it.

Let's consider the first case with shared database, it is pretty clear how to do it the algorithm would be the classical Graph Traversal:

There are multiple Workers and shared database.

- I manually put the first url into the database

while true

  - worker get random discovered url from database.
  - worker parses it and gets list of all links on the page.
  - worker updates the url in the database as processed.
  - worker lookup into the database and separates the found links 
    into processed, discovered and the new ones.
  - worker add the new ones links to database as discovered.

Let's consider the second case, with MessageQueue

There are MessageQueue containing urls that should be processed 
and multiple Workers.

- I manually put the first url in the Queue.

while true

  - worker takes next discovered url from the Queue.
  - worker parsers it and gets list of all links on the page.
  - what it does next? How it separates found links into
    processed, discovered and the new ones?
  - worker puts the list of new urls into the Queue as discovered.



what it does next? How it separates found links into processed, discovered and the new ones?

You would set up separate queues for these, which would stream back to your database. The idea is that you could have multiple workers going, and a feedback loop to send the newly discovered URLs back into queue for processing, and to the database for storage.

How to separate the links found on the page into processed, discovered and the new ones? It's clear how to do it in case of DB - just lookup in DB and check every link, but how to do it in case of MessageQueue?

You would probably still look up in the DB the links that are coming in from the queue.

So, workflow looks like this: Link gets dropped on queue Queue worker picks it up, and checks db to see if link processed If not processed, make call to website to retrieve other outbound links parse page, and drop each outbound link onto the queue for processing

Is it ok to keep all discovered urls in the MessageQueue? What if there are thousands of sites with thousands of pages, there would be millions messages waiting in the Queue.

Probably not, this is what a database is for. Once things are processed, you should drop them from the queue. Queues are made for... queuing. Message transport. Not for data-storage. Databases are made for data-storage.

Now, until they are processed, yes you can leave them on the queue. If you are worried about queue capacity, you could modify the workflow so that the queue worker removes any links that are already processed, that should reduce the depth of your queue. It might even be more efficient.


Get all links from page on Wikipedia

I am making a Python web-crawler program to play The Wiki game. If you're unfamiliar with this game: Start from some article on Wikipedia Pick a goal article Try to get to the goal article from the start article just by clicking wiki/ links My process for doing this is:...

Distinguishing between HTML and non-HTML pages in Scrapy

I am building a Spider in Scrapy that follows all the links it can find, and sends the url to a pipeline. At the moment, this is my code: from scrapy import Spider from scrapy.http import Request from scrapy.http import TextResponse from scrapy.selector import Selector from scrapyTest.items import TestItem import...

How to iterate over many websites and parse text using web crawler

I am trying to parse text and run an sentiment analysis over the text from multiple websites. I have successfully been able to strip just one website at a time and generate a sentiment score using the TextBlob library, but I am trying to replicate this over many websites, any...

How can I get the value of a Monad without System.IO.Unsafe? [duplicate]

This question already has an answer here: How to get normal value from IO action in Haskell 2 answers I just started learning Haskell and got my first project working today. Its a small program that uses Network.HTTP.Conduit and Graphics.Rendering.Chart (haskell-chart) to plot the amount of google search results...

Scrapy not entering parse method

I don't understand why this code is not entering the parse method. It is pretty similar to the basic spider examples from the doc: And I'm pretty sure this worked earlier in the day... Not sure if I modified something or not.. from import WebDriverWait from import...

C# Code design / Seperate classes for each TabControl

My main problem is that my tool grows and grows and I start loosing the focus on the different parts of my code. The main-Form got a docked tabControl at fullsize. I got 5 different tabs with for really different functions. So I can say my tool is splitted into...

What's the recommended way to load an object graph from Data Access Layer?

From a relatively old software architecture book: In other situations, the same conceptual Get-Orders query might generate different data—for example, a collection of Order objects plus order items information. What should you do? Should you expose two slightly different GetOrders methods out of your DAL? And what if, depending on...

Unable to click in CasperJS

I want to crawl the HTML data. And, I tried headless browser in CasperJS. But, Can't able to click. - The following is tried code in CapserJS. var casper = require('casper').create(); var mouse = require('mouse').create(casper); casper.start('', function() { this.echo('START'); }); casper.then(function() { this.capture("1.png");'li[class="item1"]'); casper.wait(5000, function() { this.capture("2.png"); }); });...

Some doubts related this Swing MVC implementation. Opening a database connection should be a Controller task?

I am following a Java tutorial related to the implementation of the observer pattern (using Swing) and I have some doubts. My doubts are not related to the observer pattern but about the architecture of this tutorial application (that is based on something like an MVC logic) So it contains...

SgmlLinkExtractor in scrapy

i need some enlightenment about SgmlLinkExtractor in scrapy. For the link: i would write: Rule(SgmlLinkExtractor(allow=[r'\d{4}/\d{2}/\d{2}/\w+']), callback='parse_example')] For the link: should i write: r'\news\category\w+'or r'\news\w+/\w+' ? (category changes but the url contains always news) For the link: should i write: r'\article\w+' ? (the url contains always article)...

Segregating the read-only and read-write in Spring/J2EE Apps

We using Spring, Spring-Data and JPA in our project. For production servers, we would like to setup database cluster such that all read queries are directed to one server and all write queries are directed to another server. This obviously will require some changes in the way the DAOs are...

Should I create another model for admins? Or what's the best way to do it in Ruby on Rails?

I'm developing an app, which requires two types of users regular user and admin. Regular users have roles in the app, and an admin doesn't have any roles, he just should have an access to an admin panel. What's the best way to implement this task? ...

MVC Web application architectural concern

I have inherited an MVC application designed along the above project structure. the application uses the Unity framework for dependency injection, and the user interactions go upstream to the database in the following order View -> Controller -> ViewModels -> Repository Services -> ORM -> Database The infrastructure components...

Workload balancing between akka actors

I have 2 akka actors used for crawling links, i.e. find all links in page X, then find all links in all pages linked from X, etc... I want them to progress more or less at the same pace, but more often than not one of them becomes starved and...

With the MESI protocol, a write hit also stalls the processor, right?

I'm doing a project that is to implement a dual-processor system with some kind of cache coherency (for which I chose MESI) in VHDL. I just want to confirm this one thing: a write-hit on a shared cache line should cause the cache controller to send invalidation messages on the...

Where to format collections / objects

From front end architectural point of view, what is the most common way to store scripts that perform transformations on collections of objects/models? In what folder would you store it, and what would you name the file / function? Currently I have models, views, controllers, repositories, presenters, components and services....

Data flow of MVC application architecture

Attempting to validate the approach for data flow in an MVC application that i am cleaning up and streamlining, (after a bit of refactoring) things currently looks like the diagram below (Data flow indicated by arrows). and some parts are written to access the the repository services jumping over layers....

Ruby - WebCrawler how to visit the links of the found links?

I try to make a WebCrawler which find links from a homepage and visit the found links again and again.. Now i have written a code w9ith a parser which shows me the found links and print there statistics of some tags of this homepage but i dont get it...

how to download image in Goutte

I want to download an image in this page. The image source is I try to download it use this: $client = new Goutte\Client (); $client->getClient->get($img_url, array('save_to' => $img_url_save_name)); But I failed, then I realize if I directly access, I are denied by CDN nginx server. I have to access...

Why doesn't JavaScript get its own thread in common browsers?

Not enough that JavaScript isn't multithreaded, apparently JavaScript doesn't even get its own but shares a thread with a load of other stuff. Even in most modern browsers JavaScript is typically in the same queue as painting, updating styles, and handling user actions. Why is that? From my experience an...

Selenium pdf automatic download not working

I am new to selenium and I am writing a scraper to download pdf files automatically from a given site. Below is my code: from selenium import webdriver fp = webdriver.FirefoxProfile() fp.set_preference("",2); fp.set_preference("",False) fp.set_preference("", "/home/jill/Downloads/Dinamalar") fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/pdf") browser = webdriver.Firefox(firefox_profile=fp)...

Check if element exists in fetched URL [closed]

I have a page with, say, 30 URLS, I need to click on each and check if an element exists. Currently, this means: $('area').each(function(){ $(this).attr('target','_blank'); var _href = $(this).attr("href"); var appID = (window.location.href).split('?')[1]; $(this).attr("href", _href + '?' + appID); $(this).trigger('click'); }); Which opens 30 new tabs, and I manually go...

Architecture for creating a JavaScript framework

Around one year ago we started a web system that over the time has grown quite a bit. From the beginning the goal was to build reusable code that would speed up the development of future projects, and it has. With every new project, the reusable code from the previous...

Web Crawler - TooManyRedirects: Exceeded 30 redirects. (python)

I've tried to follow one of the youtube tutorial however I've met some issue. Anyone able to help? I'm new to python, I understand that there is one or two similar question, however, I read and don't understand. Can someone help me out? Thanks import requests from bs4 import BeautifulSoup...

Java generic class that contains an instance of implementation of generic interface

I am developing an independent, self contained component, that needs domain specific parts to function properly. The part of the idea is to create a generic interface, that will settle the usage of interface's implementation in another part of this component (in my example in class B). I have written...

Heritrix not finding CSS files in conditional comment blocks

The Problem/evidence Heritrix is not detecting the presence of files in conditional comments that open & close in one string, such as this: <!--[if (gt IE 8)|!(IE)]><!--> <link rel="stylesheet" href="/css/mod.css" /> <!--<![endif]--> However standard conditional blocks like this work fine: <!--[if lte IE 9]> <script src="/js/ltei9.js"></script> <![endif]--> I've identified the...

T_STRING error in my php code [duplicate]

This question already has an answer here: PHP Parse/Syntax Errors; and How to solve them? 10 answers I have this PHP that is supposed to crawl End Clothing website for product IDs When I run it its gives me this error Parse error: syntax error, unexpected 'i' (T_STRING), expecting...

Apache Nutch REST api

I'm trying to launch a crawl via the rest api. A crawl starts with injecting urls. Using a chrome developer tool "Advanced Rest Client" I'm trying to build this POST payload up but the response I get is a 400 Bad Request. POST - http://localhost:8081/job/create Payload { "crawl-id":"crawl-01", "type":"INJECT", "config-id":"default",...

Difference between switch & bus architecture?

I was just going through a operating systems textbook but I got stuck at switch architecture . Please explain what it is and how it is different from bus architecture ?...

How to asynchronously send data to a client via a different application path?

I am currently researching large scale application optimisation and scaling, and through my research I have gotten to grips with the standard ways of doing large scale with DNS Round Robin for splitting the load across load balancers, using load balancing to divide traffic across web-servers like Nginx, which again...

simple model when requesting collection and extended model when requesting resource - how

I have the following URI: /articles/:id, where article is a resource on web-service and have associated model/class. Now I need to return only partial data for each resource (to save bandwidth and make for speed) when collection is requested, but when a single item is requested from collection I need...

Libgdx: Objects creating other objects

I have kind of a general question about a simple game architecture. In my game I have these classes: Main class who's responsible for drawing and rendering. Ball object which has a few attributes and update() function that does certain things. The main class has an array of all exist...

NodeJS run code in X minutes architecture

I want to schedule some code to be run at a variable time. For example, after 60 minutes I want to send an HTTP request to an endpoint and update a database document, but I also want to be able to cancel that code from being ran if an action...

Client-Server architecture: 100% Android (Android as a server) or J2EE+Android?

Context I am considering going into a client-server architecture with Java. The idea is that several Android tablets (let's say around 15) need to display a content from a server. Content can vary times to times (e.g. day display v/s night display). Furthermore, tablets will also display a Yes /...

mips converting to assembly

I was working with Writing MIPS assembly for the following statement: f = a - 20 + b + c - d using the following registers $1 a $2 b $3 c $4 d $5 f $6 g $7 i $8 j $9 A 10$ D my answer is this:...

Web Scraper for dynamic forms in python

I am trying to fill the form of this website It consists of three drop down lists. One is Model of the car, Second is the state and third is city. The first two are static and the third, city is generated dynamically depending upon the value of state,...

Python 3.3 TypeError: can't use a string pattern on a bytes-like object in re.findall()

I am trying to learn how to automatically fetch urls from a page. In the following code I am trying to get the title of the webpage: import urllib.request import re url = "" regex = '<title>(,+?)</title>' pattern = re.compile(regex) with urllib.request.urlopen(url) as response: html = title = re.findall(pattern,...

Micro Service cross service dependencies

Just to simplify my situation I currently have 3 micro services. Authentication Locations Inventory The authentication service authenticates the user and sends back a JWT access token and I use that across the other services. Its stateless and all works well. I setup locations among some other things in the...

Python: Transform a unicode variable into a string variable

I used a web crawler to get some data. I stored the data in a variable price. The type of price is: <class 'bs4.element.NavigableString'> The type of each element of price is: <type 'unicode'> Basically the price contains some white space and line feeds followed by: $520. I want to...

Subtypes of an interface only compatible with a subtype of another interface

I've been scratching my head for a long time on a problem and I still don't know what would be the best solution. Since the application domain is very technical, I'll illustrate my problem with a simple example. Suppose I have the following interfaces: public interface Animal { public void...

os kern error : “ld: symbol(s) not found for architecture x86_64”

I have looked all over Stack Overflow and other websites about this famous error, and all of them are very specific, and in my case I cannot find a solution. I am making an ncurses application and when i try to compile it, it causes the following error: Undefined symbols...

Howto use scrapy to crawl a website which hides the url as href=“javascript:;” in the next button

I am learning python and scrapy lately. I googled and searched around for a few days, but I don't seem to find any instruction on how to crawl multiple pages on a website with hidden urls - <a href="javascript:;". Basically each page contains 20 listings, each time you click on...

How to crawl links on all pages of a web site with Scrapy

I'm learning about scrapy and I'm trying to extract all links that contains: "" , example: But I don't know what is the page on the web site that contains these information. For example this web site: The links that I want are on this page: What...

How to model data for in-memory processing

I have a lot of static data (i.e. read only data, which is not transactional) which gets updated only once in few days. I have to support searches on that data (api calls, not sql). So I am thinking I will just load it in Memory, and refresh the in-memory...

What does BEAM stand for in iex for the Elixir programming language?

I'm sort of curious as to what the B. E. A. and M. stand for. I recall seeing an explanation of the acronym BEAM, but I have not managed to find it again. It comes up in error codes: ➜ gentoo iex Erlang/OTP 17 [erts-6.4.1] [source] [64-bit] [smp:8:8] [async-threads:10] [kernel-poll:false]...

WCF service architecture query,architecture,wcfserviceclient
I have an application that consists of a web application, and mutliple windows services, only one windows service is installed depending on what version of the backend sofware is used. Currently, Data is saved by the web app in a database, then the relevant service is installed and this picks...

Scrapy CrawlSpider not following links

I am trying to crawl some attributes from all(#123) detail pages given on this category page - but scrapy is not able to follow link pattern I set, I checked on scrapy documentation and some tutorials as well but No Luck! Below is the code: import scrapy from scrapy.contrib.linkextractors...

Why scrapy not giving all the results and the rules part is also not working?

This script is only providing me with the first result or the .extract()[0] if I change 0 to 1 then next item. Why it is not iterating the whole xpath again? The rule part is also not working. I know the problem is in the response.xpath. How to deal with...

How does ETL (database to database) fit into SOA?

Lets imagine, that our application needs ETL (extract, transform, load) data from relation database to another relation database. Most simple (and most performance, IMHO) way is to make link between databases and write simple stored procedure. In this case we use minimal technologies and components, all features are "out of...

Heritrix single-site scrape, including required off-site assets

I believe need help compiling Heritrix decide rules, although I'm open to other Heritrix suggestions: I need to scrape an entire copy of a website (in the crawler-beans.cxml seed list), but not scrape any external (off-site) pages. Any external resources needed to render the current website should be downloaded,...