python-2.7,web-scraping,web-crawler,scrapy,scrapy-spider , how to output multiple webpages crawled data into csv file using python with scrapy

how to output multiple webpages crawled data into csv file using python with scrapy


Tag: python-2.7,web-scraping,web-crawler,scrapy,scrapy-spider

I have the following code below which crawls all the available pages from a website. This is perfectly `crawling` the valid pages because when I use print function - I can see the data from the `'items'` list, but I don't see any output when I try to use `.csv` as a destination file to dump the stats. (Using this command in command prompt : `scrapy crawl craig -o test.csv -t csv`),.. Please help me output the data into a `csv` file.

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.exceptions import CloseSpider
from scrapy.http import Request
from test.items import CraigslistSampleItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

URL = ""

class MySpider(BaseSpider):
  name = "craig"
  allowed_domains = [""]

  #for u in URL:
  start_urls = [URL % 1]

  def __init__(self):
        self.page_number = 1

  def parse(self, response):
      hxs = HtmlXPathSelector(response)
      titles ="//div[@class='thumb']")
      if not titles:
            raise CloseSpider('No more pages')
      items = []
      for titles in titles:
          item = CraigslistSampleItem()
          item ["title"] ="a/@title").extract()
          item ["url"] ="a/@href").extract()
      yield items

      self.page_number += 1
      yield Request(URL % self.page_number)

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.exceptions import CloseSpider
from scrapy.http import Request
from test.items import CraigslistSampleItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

URL = ""

class MySpider(BaseSpider):
  name = "craig"
  allowed_domains = [""]

  def start_requests(self):
      for i in range(10):
          yield Request(URL % i, callback=self.parse)

  def parse(self, response):
      titles = response.xpath("//div[@class='thumb']")
      if not titles:
            raise CloseSpider('No more pages')
      for title in titles:
          item = CraigslistSampleItem()
          item ["title"] = title.xpath("./a/@title").extract()
          item ["url"] = title.xpath("./a/@href").extract()
          yield item


Need workaround to treat float values as tuples when updating “list” of float values

I am finding errors with the last line of the for loop where I am trying to update the curve value to a list containing the curve value iterations. I get errors like "can only concatenate tuple (not "float) to tuple" and "tuple object has no attribute 'append'". Does anyone...

Python code not executing in order? MySQLdb UPDATE commits in unexpected order

I've got a Python 2.7 script I'm working on that retrieves rows from a MySQL table, loops through the data to process it, then is supposed to do the following things in this order: UPDATE the table rows we just got previously to set a locked value in each row...

BeautifulSoup is not getting all data, only some

import requests from bs4 import BeautifulSoup def trade_spider(max_pages): page = 0 while page <= max_pages: url = '' + str(page * 100) source_code = requests.get(url) plain_text = source_code.text soup = BeautifulSoup(plain_text) for link in soup.findAll('a', {'class':'hdrlnk'}): href = '' + link.get('href') title = link.string print title #print href get_single_item_data(href) page...

How to remove structure with python from this case?
How to remove "table" from HTML using python? I had case like this: paragraph = ''' <p>Lorem ipsum dolor sit amet, consectetur adipisicing elit. Quidem molestiae consequuntur officiis corporis sint.<br /><br /> <table> <tr> <td> text title </td> <td> text title 2 </td> </tr> </table> <p> lorem ipsum</p> ''' how...

Count function counting only last line of my list

Count function counting only last line of my list N = int(raw_input()) cnt = [] for i in range(N): string = raw_input() for j in range(1,len(string)): if string[j] =='K': cnt.append('R') elif string[j] =='R': cnt.append('R') if string[0] == 'k': cnt.append('k') elif string[0] == 'R': cnt.append('R') print cnt.count('R') if I am giving...

delete data from mysql table based on multiple conditions

i need to delete rows from MySQL table which is very huge.I have multiple conditions check before deleting the data. my table looks like: data(table name): name mobile email address source xyz 871 [email protected] 1.txt bac 123 null 2.XLS TST 456 [email protected] 3.xls yup 897 null abcde web null [email protected]

Parse text from a .txt file using csv module

I have an email that comes in everyday and the format of the email is always the same except some of the data is different. I wrote a VBA Macro that exports the email to a text file. Now that it is a text file I want to parse the...

List of tuples from (a, all b) to (b, all a)

I am starting with a list of tuples (a,all b). I want to end with a list of tuples (b,all a). For example: FROM (a1,[b1,b2,b3]) (a2,[b2]) (a3,[b1,b2]) TO (b1,[a1,a3]) (b2[a1,a2,a3]) (b3,[a1] How do I do this using Python 2? Thank you for your help....

Anaconda site-packages

After installing a package in an anaconda environment, I'll like to make some changes to the code in that package. Where can I find the site-packages directory containing the installed packages? I do not find a directory /Users/username/anaconda/lib/python2.7/site-packages...

Benefit of using os.mkdir vs os.system(“mkdir”)

Simple question that I can't find an answer to: Is there a benefit of using os.mkdir("somedir") over os.system("mkdir somedir") or, beyond code portability? Answers should apply to Python 2.7. Edit: the point was raised that a hard-coded directory versus a variable (possibly containing user-defined data) introduces the question of...

How to find longest consistent increment in a python list?

possible_list = [] bigger_list = [] new_list= [0, 25, 2, 1, 14, 1, 14, 1, 4, 6, 6, 7, 0, 10, 11] for i in range(0,len(new_list)): # if the next index is not greater than the length of the list if (i + 1) < (len(new_list)): #if the current value...

from x(defined in program) import y(defined in program) python

I need some assitance since I really have no idea how I can fix this: x="test" y="test2" When I try to import y from x , it says that there is no file with the name "x" (from x import y) Is there any way to import test2 from test...

Python root logger messages not being logged via handler configured with fileConfig

The Problem: Given a logging config and a logger that employs that config, I see log messages from the script in which the log handler is configured, but not from the root logger, to which the same handler is assigned. Details: (Using Python 2.7) I have a module my_mod which...

Parsing Google Custom Search API for Elasticsearch Documents

After retrieving results from the Google Custom Search API and writing it to JSON, I want to parse that JSON to make valid Elasticsearch documents. You can configure a parent - child relationship for nested results. However, this relationship seems to not be inferred by the data structure itself. I've...

How to extract efficientely content from an xml with python?

I have the following xml: <?xml version="1.0" encoding="UTF-8" standalone="no"?><author id="user23"> <document><![CDATA["@username: That boner came at the wrong time ????" HELP I'M DYING ]]></document> <document><![CDATA[Ugh ]]></document> <document><![CDATA[YES !!!! WE GO FOR IT. ]]></document> <document><![CDATA[@username Shout out to me???? ]]></document> </author> What is the most efficient...

monosubstitution cypher : decryption in python 2.7 list trouble

I'm currently coding a simple monosubsubstitution cypher in python. The encryption goes this way: first a key to encypher is produce this way def non_random_key(key_name): my_alphabet = [] for char in alphabet: my_alphabet.append(char) my_key_alphabet = list(key_name) for char in key_name: my_alphabet.remove(char) return my_key_alphabet + my_alphabet Then a message is encrypted...

Stopping list selection in Python 2.7

Imagine that I have an order list of tuples: s = [(0,-1), (1,0), (2,-1), (3,0), (4,0), (5,-1), (6,0), (7,-1)] Given a parameter X, I want to select all the tuples that have a first element equal or greater than X up to but not including the first tuple that has...

What is the difference between <> and == in python?

In [142]: (MON,TUE,WED,THR,FRI,SAT,SUN)<>range(7) Out[142]: True In [143]: (MON,TUE,WED,THR,FRI,SAT,SUN)==range(7) Out[143]: False ...

PySerial client unable to write data

I'm trying to write a python program which can communicate over a serial interface using PySerial module as follows: import serial if __name__ == '__main__': port = "/dev/tnt0" ser = serial.Serial(port, 38400) print print ser.isOpen() x = ser.write('hello') ser.close() print "Done!" But if I execute the above I get...

Slicing a Python OrderedDict

In my code I frequently need to take a subset range of keys+values from a Python OrderedDict (from collections package). Slicing doesn't work (throws TypeError: unhashable type) and the alternative, iterating, is cumbersome: from collections import OrderedDict o = OrderedDict([('a', 1), ('b', 2), ('c', 3), ('d', 4)]) # want to...

Compare 2 seperate csv files and write difference to a new csv file - Python 2.7

I am trying to compare two csv files in python and save the difference to a third csv file in python 2.7. import csv f1 = open ("olddata/file1.csv") oldFile1 = csv.reader(f1) oldList1 = [] for row in oldFile1: oldList1.append(row) f2 = open ("newdata/file2.csv") oldFile2 = csv.reader(f2) oldList2 = [] for...

Is there a way to say for every x values, do this?

Before I start, I am new to Python, so any low-level description would be incredibly helpful! I have a list of lets say 60 values (representing one hour, from 8:00-9:00) and I want to run average, maximums, minimums, and standard deviation for each set of 15. (I already have the...

Python - Terminate Child Process or PID?

I have a script (simplified below) that initiates another python process. I know the process name and PID for the current and child processes. When I attempt to terminate the child process - menu option (2) - I get the message "local variable 'py_process' referenced before assignment." Suggestions to terminate...

How to skip a function

I am learning python and trying to make a little game. So my question is can you define a function but skip it and use it later. EX. def func() print"1,2,3,4" func() def func2() print "counting" func() func2() How would I skip func but still be able to print it...

Python initialize strings as variable

I have a calculation schema as string calc = "((k+m+46)/2)" and some strings containing variable like m = 2 k = m*2 all only strings. Now I want to initialize them into Python. my goal is it to calculate with the calculating schema the varible values. calc should return 26...

Syntax Error (FROM) in Python, I do not want to use it as function but rather use it as to print something

I am trying to print out usernames from Instagram. When I type in print i.from.username, there will be syntax error because Python thinks that I am using from function, which i actually not. for i in a: print i.from.username Is there anyway to troubleshoot it? I tried using making a...

How to specify string variables as unicode strings for pattern and text in regex matching?

>>> import re >>> re.match(u'^[一二三四五六七]、', u'一、') If the pattern and the text are stored in variables (for example, they were read from text files), >>> myregex='^[一二三四五六七]、' >>> mytext='一、' How shall I specify myregex and mytext to re.match, in the same way as re.match(u'^[一二三四五六七]、', u'一、')? Thanks....

lookbehind for start of string or a character

The command re.compile(ur"(?<=,| |^)(?:next to|near|beside|opp).+?(?=,|$)", re.IGNORECASE) throws a sre_constants.error: look-behind requires fixed-width pattern error in my program but regex101 shows it to be fine. What I'm trying to do here is to match landmarks from addresses (each address is in a separate string) like: "Opp foobar, foocity" --> Must match...

Strange Behavior: Floating Point Error after Appending to List

I am writing a simple function to step through a range with floating step size. To keep the output neat, I wrote a function, correct, that corrects the floating point error that is common after an arithmetic operation. That is to say: correct(0.3999999999) outputs 0.4, correct(0.1000000001) outputs 0.1, etc. Here's...

Identify that a string could be a datetime object

If I knew the format in which a string represents date-time information, then I can easily use datetime.datetime.strptime(s, fmt). However, without knowing the format of the string beforehand, would it be possible to determine whether a given string contains something that could be parsed as a datetime object with the...

Pass function call as a function argument

Code: def function1(a,b): return a-1,b-1 def function2(c,d): return c+1,d+1 print function1(function2(1,2)) Error: Traceback (most recent call last): File "C:\Users\sony\Desktop\Python\scripts\", line 6, in <module> print function1(function2(1,2)) TypeError: function1() takes exactly 2 arguments (1 given) [Finished in 0.1s with exit code 1] Why the above error? ...

Python Popen - wait vs communicate vs CalledProcessError

Continuing from my previous question I see that to get the error code of a process I spawned via Popen in python I have to call either wait() or communicate() (which can be used to access the Popen stdout and stderr attributes): app7z = '/path/to/7z.exe' command = [app7z, 'a', dstFile.temp,...

Import on class instanciation

I'm creating a module with several classes in it. My problem is that some of these classes need to import very specific modules that needs to be manually compiled or need specific hardware to work. There is no interest in importing every specific module up front, and as some modules...

How to check for multiple attributes in a list

I am making a TBRPG game using Python 2.7, and i'm currently making a quest system. I wanted to make a function that checks all of the quests in a list, in this case (quests), and tells you if any of of the quests in the list have the same...

how to fetch a column in browse_record_list in orm browse method in openERP

I'm beginner in openERP. I'm trying to get a column in a table. While using ORM browse method and iterating that object i got the result in browse_record_list as browse_record(,21). I want to fetch that particular id 21 alone through that browse method but instead im getting same browse_record as...

Pandas Dataframe Complex Calculation

I have the following dataframe,df: Year totalPubs ActualCitations 0 1994 71 191.002034 1 1995 77 2763.911781 2 1996 69 2022.374474 3 1997 78 3393.094951 I want to write code that would do the following: Citations of currentyear / Sum of totalPubs of the two previous years I want something to...

Sort List of Numbers according to Custom Number Sequence

Question :A set of numbers will be passed as input. Also the redefined relationship of the digits 0-9 in ascending order will be passed as input. Based on the redefined relationship, the set of numbers must be listed in ascending order. Input Format: The first line will contain the the...

python - how to properly evaluate a value from a system command

I'm just learning python (using 2.7.8) and i'm trying to figure out what is the best way to evaluate the output of a system command. I've read to use subprocess. For example, I need to run this IF statment and evaluate for anything > 0, then process it. Example of...

Who calls the metaclass

This actually stems from a discussion here on SO. Short version def meta(name, bases, class_dict) return type(name, bases, class_dict) class Klass(object): __metaclass__ = meta meta() is called when Klass class declaration is executed. Which part of the (python internal) code actually calls meta()? Long version When the class is declared,...

Keep strings that occur N times or more

I have a list that is mylist = ['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd'] And I used Counter from collections on this list to get the result: from collection import Counter counts = Counter(mylist) #Counter({'a': 3, 'c': 2, 'b': 2, 'd': 1}) Now I want to subset this...

I need to make sure that only certain characters are in a list?

I have this to get input and put it in a list: def start(): move_order=[raw_input("Enter your moves: ").split()] And I only want the characters A, D, S, C, H (it's for a game >_>) to be allowed. I've tried using the regular expressions stuff: if re.match('[ADSCH]+', [move_order]) is False: print...

How do I copy a row from one pandas dataframe to another pandas dataframe?

I have a dataframe of data that I am trying to append to another dataframe. I have tried various ways with .append() and there has been no successful way. When I print the data from iterrows. I provide 2 possible ways I tried to solve the issue below, one creates...

Why does `for lst in lst:` work? [duplicate]

This question already has an answer here: Why can I use the same name for iterator and sequence in a Python for loop? 6 answers The following code lst = ['foo', 'bar', 'baz'] for lst in lst: print lst gives me this output foo bar baz I would expect...

How can I resolve my variable's unexpected output?

I have a variable in django named optional_message. If I debug the variable then it says Swenskt but when I try to print the variable on my page the following comes out: (u'Swenskt',) and the variable can't be tested for its length etc. What should I do if I only...

Python split by comma delimiter and strip

I am trying to open a file in python in read mode then write striped and split data to an output file. I am unsure how to split and strip the same data. Do I need to create a different line? Do I need to write the data out first?...

multiple iteration of the same list

I have one list of data as follows: from shapely.geometry import box data = [box(1,2,3,4), box(4,5,6,7), box(1,2,3,4)] sublists = [A,B,C] The list 'data' has following sub-lists: A = box(1,2,3,4) B = box(4,5,6,7) C = box(1,2,3,4) I have to check if sub-lists intersect. If intersect they should put in one tuple;...

Adding time/duration from CSV file

I am trying to add time/duration values from a CSV file that I have but I have failed so far. Here's the sample csv that I'm trying to add up. Is getting this output possible? Output: I have been trying to add up the datetime but I always fail: finput...

Use NamedTemporaryFile to read from stdout via subprocess on Linux

import subprocess import tempfile fd = tempfile.NamedTemporaryFile() print(fd) print( p = subprocess.Popen("date", stdout=fd).communicate() print(p[0]) fd.close() This returns: <open file '<fdopen>', mode 'w' at 0x7fc27eb1e810> /tmp/tmp8kX9C1 None Instead, I would like it to return something like: Tue Jun 23 10:23:15 CEST 2015 I tried adding mode="w", as well as delete=False, but...

Problems with tk entry and optionmenu widget in currency converter

I am currently having problems with my currency converter program in my python class. I am trying to convert an amount, the entry widget, from the starting currency, the first option menu, to the desired currency, the second option menu. I have spent hours reading the documentation, so I hope...

Can't get value from xpath python

I want to get values from page:,Actimel-cytryna-miod-Danone.html I can get all values from first section, but I can't get values from table "Wartości odżywcze" I use this xpath: ''.join(tree2.xpath("//html/body/div[1]/div[3]/article/div[2]/div/div[4]/div[3]/div/div[1]/div[3]/table[1]/tr[3]/td[2]/span/text()")) But I'm not getting anything. With xpath like this: ''.join(tree2.xpath("//html/body/div[1]/div[3]/article/div[2]/div/div[4]/div[3]/div/div[1]/div[3]/table[1]/tr[3]/td[2]//text()")) I'm...