FAQ Database Discussion Community


scraping a heritrix page using python's request module

ssl,python-requests,heritrix
I want to scrape a Heritrix home page using pythons requests module. When I try to open this page on chrome, I get the error: This server could not prove that it is 10.100.121.41; its security certificate is not trusted by your computer's operating system. This may be caused by...

Heritrix not finding CSS files in conditional comment blocks

java,web-crawler,heritrix
The Problem/evidence Heritrix is not detecting the presence of files in conditional comments that open & close in one string, such as this: <!--[if (gt IE 8)|!(IE)]><!--> <link rel="stylesheet" href="/css/mod.css" /> <!--<![endif]--> However standard conditional blocks like this work fine: <!--[if lte IE 9]> <script src="/js/ltei9.js"></script> <![endif]--> I've identified the...

Heritrix single-site scrape, including required off-site assets

java,web-crawler,heritrix
I believe need help compiling Heritrix decide rules, although I'm open to other Heritrix suggestions: https://webarchive.jira.com/wiki/display/Heritrix/Configuring+Crawl+Scope+Using+DecideRules I need to scrape an entire copy of a website (in the crawler-beans.cxml seed list), but not scrape any external (off-site) pages. Any external resources needed to render the current website should be downloaded,...