FAQ Database Discussion Community


SgmlLinkExtractor in scrapy

web-crawler,scrapy,rules,extractor
i need some enlightenment about SgmlLinkExtractor in scrapy. For the link: example.com/YYYY/MM/DD/title i would write: Rule(SgmlLinkExtractor(allow=[r'\d{4}/\d{2}/\d{2}/\w+']), callback='parse_example')] For the link: example.com/news/economic/title should i write: r'\news\category\w+'or r'\news\w+/\w+' ? (category changes but the url contains always news) For the link: example.com/article/title should i write: r'\article\w+' ? (the url contains always article)...

Design pattern for consecutive object value extractor

java,design-patterns,iterator,extractor,reader
Consider an object that extracts object values from a source on a "pull" basis, until a special value (e.g., null) is encountered. In Java, the API could be something like public interface ValueExtractor<T> { public T extractNext(); } Operationally, this is an Iterator but it only has a (sort of)...