Scrapy + selenium requests twice for each url

up vote
0
down vote

favorite

import scrapy 
from selenium import webdriver

class ProductSpider(scrapy.Spider):
 name = "product_spider"
 allowed_domains = ['ebay.com']
 start_urls = ['http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40']

 def __init__(self):
 self.driver = webdriver.Firefox()

 def parse(self, response):
 self.driver.get(response.url)

 while True:
 next = self.driver.find_element_by_xpath('//td[@class="pagn-next"]/a')

 try:
 next.click()

 # get the data and write it to scrapy items
 except:
 break

 self.driver.close()

selenium with scrapy for dynamic page

This solution works good, but it requests twice for the same url one by scrapy scheduler and another by the selenium web driver.

It'll take twice the time to finish the job, compared to scrapy request without selenium. How to avoid this?

edited Jun 6 at 7:39

asked Jun 6 at 7:16

Yash Pokar

346311

Why do you want to use scrapy if you are already using Selenium?
– VMRuiz
Jun 6 at 11:10

@VMRuiz scrapy isn't about just request response and html parsing. It comes with lot more capabilities within it and the most interesting thing is concurrency.
– Yash Pokar
Jun 6 at 11:20

In that case, if you only want to render the webpage you can use scrapy + splash: splash.readthedocs.io/en/stable
– VMRuiz
Jun 6 at 11:22

I have used splash but I couldn't fetch result for one site. where chrome and firefox are well known browser which will gives the 100% result.
– Yash Pokar
Jun 6 at 11:28

add a comment |

up vote
0
down vote

favorite

import scrapy 
from selenium import webdriver

class ProductSpider(scrapy.Spider):
 name = "product_spider"
 allowed_domains = ['ebay.com']
 start_urls = ['http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40']

 def __init__(self):
 self.driver = webdriver.Firefox()

 def parse(self, response):
 self.driver.get(response.url)

 while True:
 next = self.driver.find_element_by_xpath('//td[@class="pagn-next"]/a')

 try:
 next.click()

 # get the data and write it to scrapy items
 except:
 break

 self.driver.close()

selenium with scrapy for dynamic page

This solution works good, but it requests twice for the same url one by scrapy scheduler and another by the selenium web driver.

It'll take twice the time to finish the job, compared to scrapy request without selenium. How to avoid this?

edited Jun 6 at 7:39

asked Jun 6 at 7:16

Yash Pokar

346311

Why do you want to use scrapy if you are already using Selenium?
– VMRuiz
Jun 6 at 11:10

@VMRuiz scrapy isn't about just request response and html parsing. It comes with lot more capabilities within it and the most interesting thing is concurrency.
– Yash Pokar
Jun 6 at 11:20

In that case, if you only want to render the webpage you can use scrapy + splash: splash.readthedocs.io/en/stable
– VMRuiz
Jun 6 at 11:22

I have used splash but I couldn't fetch result for one site. where chrome and firefox are well known browser which will gives the 100% result.
– Yash Pokar
Jun 6 at 11:28

add a comment |

up vote
0
down vote

favorite

import scrapy 
from selenium import webdriver

class ProductSpider(scrapy.Spider):
 name = "product_spider"
 allowed_domains = ['ebay.com']
 start_urls = ['http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40']

 def __init__(self):
 self.driver = webdriver.Firefox()

 def parse(self, response):
 self.driver.get(response.url)

 while True:
 next = self.driver.find_element_by_xpath('//td[@class="pagn-next"]/a')

 try:
 next.click()

 # get the data and write it to scrapy items
 except:
 break

 self.driver.close()

selenium with scrapy for dynamic page

This solution works good, but it requests twice for the same url one by scrapy scheduler and another by the selenium web driver.

It'll take twice the time to finish the job, compared to scrapy request without selenium. How to avoid this?

edited Jun 6 at 7:39

asked Jun 6 at 7:16

Yash Pokar

346311

import scrapy 
from selenium import webdriver

class ProductSpider(scrapy.Spider):
 name = "product_spider"
 allowed_domains = ['ebay.com']
 start_urls = ['http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40']

 def __init__(self):
 self.driver = webdriver.Firefox()

 def parse(self, response):
 self.driver.get(response.url)

 while True:
 next = self.driver.find_element_by_xpath('//td[@class="pagn-next"]/a')

 try:
 next.click()

 # get the data and write it to scrapy items
 except:
 break

 self.driver.close()

selenium with scrapy for dynamic page

This solution works good, but it requests twice for the same url one by scrapy scheduler and another by the selenium web driver.

It'll take twice the time to finish the job, compared to scrapy request without selenium. How to avoid this?

python selenium web-scraping scrapy

edited Jun 6 at 7:39

asked Jun 6 at 7:16

Yash Pokar

346311

edited Jun 6 at 7:39

asked Jun 6 at 7:16

Yash Pokar

346311

edited Jun 6 at 7:39

asked Jun 6 at 7:16

Yash Pokar

346311

asked Jun 6 at 7:16

Yash Pokar

346311

asked Jun 6 at 7:16

Yash Pokar

346311

Why do you want to use scrapy if you are already using Selenium?
– VMRuiz
Jun 6 at 11:10

@VMRuiz scrapy isn't about just request response and html parsing. It comes with lot more capabilities within it and the most interesting thing is concurrency.
– Yash Pokar
Jun 6 at 11:20

In that case, if you only want to render the webpage you can use scrapy + splash: splash.readthedocs.io/en/stable
– VMRuiz
Jun 6 at 11:22

I have used splash but I couldn't fetch result for one site. where chrome and firefox are well known browser which will gives the 100% result.
– Yash Pokar
Jun 6 at 11:28

add a comment |

Why do you want to use scrapy if you are already using Selenium?
– VMRuiz
Jun 6 at 11:10

@VMRuiz scrapy isn't about just request response and html parsing. It comes with lot more capabilities within it and the most interesting thing is concurrency.
– Yash Pokar
Jun 6 at 11:20

In that case, if you only want to render the webpage you can use scrapy + splash: splash.readthedocs.io/en/stable
– VMRuiz
Jun 6 at 11:22

I have used splash but I couldn't fetch result for one site. where chrome and firefox are well known browser which will gives the 100% result.
– Yash Pokar
Jun 6 at 11:28

Why do you want to use scrapy if you are already using Selenium?
– VMRuiz
Jun 6 at 11:10

@VMRuiz scrapy isn't about just request response and html parsing. It comes with lot more capabilities within it and the most interesting thing is concurrency.
– Yash Pokar
Jun 6 at 11:20

In that case, if you only want to render the webpage you can use scrapy + splash: splash.readthedocs.io/en/stable
– VMRuiz
Jun 6 at 11:22

I have used splash but I couldn't fetch result for one site. where chrome and firefox are well known browser which will gives the 100% result.
– Yash Pokar
Jun 6 at 11:28

add a comment |

1 Answer
1

active

oldest

votes

up vote
0
down vote

accepted

Here is a trick that can be useful to solve this problem.

create a web service for the selenium run it, locally

from flask import Flask, request, make_response
from flask_restful import Resource, Api
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

app = Flask(__name__)
api = Api(app)

class Selenium(Resource):
 _driver = None

 @staticmethod
 def getDriver():
 if not Selenium._driver:
 chrome_options = Options()
 chrome_options.add_argument("--headless")

 Selenium._driver = webdriver.Chrome(chrome_options=chrome_options)
 return Selenium._driver

 @property
 def driver(self):
 return Selenium.getDriver()

 def get(self):
 url = str(request.args['url'])

 self.driver.get(url)

 return make_response(self.driver.page_source)

api.add_resource(Selenium, '/')

if __name__ == '__main__':
 app.run(debug=True)

now http://127.0.0.1:5000/?url=https://stackoverflow.com/users/5939254/yash-pokar will return you the compiled web page using selenium Chrome/Firefox driver.

now how our spider will look like,

import scrapy
import urllib


class ProductSpider(scrapy.Spider):
 name = 'products'
 allowed_domains = ['ebay.com']
 urls = [
 'http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40',
 ]

 def start_requests(self):
 for url in self.urls:
 url = 'http://127.0.0.1:5000/?url='.format(urllib.quote(url))
 yield scrapy.Request(url)

 def parse(self, response):
 yield 
 'field': response.xpath('//td[@class="pagn-next"]/a'),

answered Jun 6 at 8:15

Yash Pokar

346311

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f50714354%2fscrapy-selenium-requests-twice-for-each-url%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
0
down vote

accepted

Here is a trick that can be useful to solve this problem.

create a web service for the selenium run it, locally

from flask import Flask, request, make_response
from flask_restful import Resource, Api
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

app = Flask(__name__)
api = Api(app)

class Selenium(Resource):
 _driver = None

 @staticmethod
 def getDriver():
 if not Selenium._driver:
 chrome_options = Options()
 chrome_options.add_argument("--headless")

 Selenium._driver = webdriver.Chrome(chrome_options=chrome_options)
 return Selenium._driver

 @property
 def driver(self):
 return Selenium.getDriver()

 def get(self):
 url = str(request.args['url'])

 self.driver.get(url)

 return make_response(self.driver.page_source)

api.add_resource(Selenium, '/')

if __name__ == '__main__':
 app.run(debug=True)

now http://127.0.0.1:5000/?url=https://stackoverflow.com/users/5939254/yash-pokar will return you the compiled web page using selenium Chrome/Firefox driver.

now how our spider will look like,

import scrapy
import urllib


class ProductSpider(scrapy.Spider):
 name = 'products'
 allowed_domains = ['ebay.com']
 urls = [
 'http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40',
 ]

 def start_requests(self):
 for url in self.urls:
 url = 'http://127.0.0.1:5000/?url='.format(urllib.quote(url))
 yield scrapy.Request(url)

 def parse(self, response):
 yield 
 'field': response.xpath('//td[@class="pagn-next"]/a'),

answered Jun 6 at 8:15

Yash Pokar

346311

add a comment |

up vote
0
down vote

accepted

Here is a trick that can be useful to solve this problem.

create a web service for the selenium run it, locally

from flask import Flask, request, make_response
from flask_restful import Resource, Api
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

app = Flask(__name__)
api = Api(app)

class Selenium(Resource):
 _driver = None

 @staticmethod
 def getDriver():
 if not Selenium._driver:
 chrome_options = Options()
 chrome_options.add_argument("--headless")

 Selenium._driver = webdriver.Chrome(chrome_options=chrome_options)
 return Selenium._driver

 @property
 def driver(self):
 return Selenium.getDriver()

 def get(self):
 url = str(request.args['url'])

 self.driver.get(url)

 return make_response(self.driver.page_source)

api.add_resource(Selenium, '/')

if __name__ == '__main__':
 app.run(debug=True)

now http://127.0.0.1:5000/?url=https://stackoverflow.com/users/5939254/yash-pokar will return you the compiled web page using selenium Chrome/Firefox driver.

now how our spider will look like,

import scrapy
import urllib


class ProductSpider(scrapy.Spider):
 name = 'products'
 allowed_domains = ['ebay.com']
 urls = [
 'http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40',
 ]

 def start_requests(self):
 for url in self.urls:
 url = 'http://127.0.0.1:5000/?url='.format(urllib.quote(url))
 yield scrapy.Request(url)

 def parse(self, response):
 yield 
 'field': response.xpath('//td[@class="pagn-next"]/a'),

answered Jun 6 at 8:15

Yash Pokar

346311

add a comment |

up vote
0
down vote

accepted

Here is a trick that can be useful to solve this problem.

create a web service for the selenium run it, locally

from flask import Flask, request, make_response
from flask_restful import Resource, Api
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

app = Flask(__name__)
api = Api(app)

class Selenium(Resource):
 _driver = None

 @staticmethod
 def getDriver():
 if not Selenium._driver:
 chrome_options = Options()
 chrome_options.add_argument("--headless")

 Selenium._driver = webdriver.Chrome(chrome_options=chrome_options)
 return Selenium._driver

 @property
 def driver(self):
 return Selenium.getDriver()

 def get(self):
 url = str(request.args['url'])

 self.driver.get(url)

 return make_response(self.driver.page_source)

api.add_resource(Selenium, '/')

if __name__ == '__main__':
 app.run(debug=True)

now http://127.0.0.1:5000/?url=https://stackoverflow.com/users/5939254/yash-pokar will return you the compiled web page using selenium Chrome/Firefox driver.

now how our spider will look like,

import scrapy
import urllib


class ProductSpider(scrapy.Spider):
 name = 'products'
 allowed_domains = ['ebay.com']
 urls = [
 'http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40',
 ]

 def start_requests(self):
 for url in self.urls:
 url = 'http://127.0.0.1:5000/?url='.format(urllib.quote(url))
 yield scrapy.Request(url)

 def parse(self, response):
 yield 
 'field': response.xpath('//td[@class="pagn-next"]/a'),

answered Jun 6 at 8:15

Yash Pokar

346311

Here is a trick that can be useful to solve this problem.

create a web service for the selenium run it, locally

from flask import Flask, request, make_response
from flask_restful import Resource, Api
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

app = Flask(__name__)
api = Api(app)

class Selenium(Resource):
 _driver = None

 @staticmethod
 def getDriver():
 if not Selenium._driver:
 chrome_options = Options()
 chrome_options.add_argument("--headless")

 Selenium._driver = webdriver.Chrome(chrome_options=chrome_options)
 return Selenium._driver

 @property
 def driver(self):
 return Selenium.getDriver()

 def get(self):
 url = str(request.args['url'])

 self.driver.get(url)

 return make_response(self.driver.page_source)

api.add_resource(Selenium, '/')

if __name__ == '__main__':
 app.run(debug=True)

now http://127.0.0.1:5000/?url=https://stackoverflow.com/users/5939254/yash-pokar will return you the compiled web page using selenium Chrome/Firefox driver.

now how our spider will look like,

import scrapy
import urllib


class ProductSpider(scrapy.Spider):
 name = 'products'
 allowed_domains = ['ebay.com']
 urls = [
 'http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40',
 ]

 def start_requests(self):
 for url in self.urls:
 url = 'http://127.0.0.1:5000/?url='.format(urllib.quote(url))
 yield scrapy.Request(url)

 def parse(self, response):
 yield 
 'field': response.xpath('//td[@class="pagn-next"]/a'),

answered Jun 6 at 8:15

Yash Pokar

346311

answered Jun 6 at 8:15

Yash Pokar

346311

answered Jun 6 at 8:15

Yash Pokar

346311

answered Jun 6 at 8:15

Yash Pokar

346311

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Pfthb