Scrapy + selenium requests twice for each url









up vote
0
down vote

favorite












import scrapy 
from selenium import webdriver

class ProductSpider(scrapy.Spider):
name = "product_spider"
allowed_domains = ['ebay.com']
start_urls = ['http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40']

def __init__(self):
self.driver = webdriver.Firefox()

def parse(self, response):
self.driver.get(response.url)

while True:
next = self.driver.find_element_by_xpath('//td[@class="pagn-next"]/a')

try:
next.click()

# get the data and write it to scrapy items
except:
break

self.driver.close()


selenium with scrapy for dynamic page



This solution works good, but it requests twice for the same url one by scrapy scheduler and another by the selenium web driver.



It'll take twice the time to finish the job, compared to scrapy request without selenium. How to avoid this?










share|improve this question























  • Why do you want to use scrapy if you are already using Selenium?
    – VMRuiz
    Jun 6 at 11:10










  • @VMRuiz scrapy isn't about just request response and html parsing. It comes with lot more capabilities within it and the most interesting thing is concurrency.
    – Yash Pokar
    Jun 6 at 11:20










  • In that case, if you only want to render the webpage you can use scrapy + splash: splash.readthedocs.io/en/stable
    – VMRuiz
    Jun 6 at 11:22










  • I have used splash but I couldn't fetch result for one site. where chrome and firefox are well known browser which will gives the 100% result.
    – Yash Pokar
    Jun 6 at 11:28














up vote
0
down vote

favorite












import scrapy 
from selenium import webdriver

class ProductSpider(scrapy.Spider):
name = "product_spider"
allowed_domains = ['ebay.com']
start_urls = ['http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40']

def __init__(self):
self.driver = webdriver.Firefox()

def parse(self, response):
self.driver.get(response.url)

while True:
next = self.driver.find_element_by_xpath('//td[@class="pagn-next"]/a')

try:
next.click()

# get the data and write it to scrapy items
except:
break

self.driver.close()


selenium with scrapy for dynamic page



This solution works good, but it requests twice for the same url one by scrapy scheduler and another by the selenium web driver.



It'll take twice the time to finish the job, compared to scrapy request without selenium. How to avoid this?










share|improve this question























  • Why do you want to use scrapy if you are already using Selenium?
    – VMRuiz
    Jun 6 at 11:10










  • @VMRuiz scrapy isn't about just request response and html parsing. It comes with lot more capabilities within it and the most interesting thing is concurrency.
    – Yash Pokar
    Jun 6 at 11:20










  • In that case, if you only want to render the webpage you can use scrapy + splash: splash.readthedocs.io/en/stable
    – VMRuiz
    Jun 6 at 11:22










  • I have used splash but I couldn't fetch result for one site. where chrome and firefox are well known browser which will gives the 100% result.
    – Yash Pokar
    Jun 6 at 11:28












up vote
0
down vote

favorite









up vote
0
down vote

favorite











import scrapy 
from selenium import webdriver

class ProductSpider(scrapy.Spider):
name = "product_spider"
allowed_domains = ['ebay.com']
start_urls = ['http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40']

def __init__(self):
self.driver = webdriver.Firefox()

def parse(self, response):
self.driver.get(response.url)

while True:
next = self.driver.find_element_by_xpath('//td[@class="pagn-next"]/a')

try:
next.click()

# get the data and write it to scrapy items
except:
break

self.driver.close()


selenium with scrapy for dynamic page



This solution works good, but it requests twice for the same url one by scrapy scheduler and another by the selenium web driver.



It'll take twice the time to finish the job, compared to scrapy request without selenium. How to avoid this?










share|improve this question















import scrapy 
from selenium import webdriver

class ProductSpider(scrapy.Spider):
name = "product_spider"
allowed_domains = ['ebay.com']
start_urls = ['http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40']

def __init__(self):
self.driver = webdriver.Firefox()

def parse(self, response):
self.driver.get(response.url)

while True:
next = self.driver.find_element_by_xpath('//td[@class="pagn-next"]/a')

try:
next.click()

# get the data and write it to scrapy items
except:
break

self.driver.close()


selenium with scrapy for dynamic page



This solution works good, but it requests twice for the same url one by scrapy scheduler and another by the selenium web driver.



It'll take twice the time to finish the job, compared to scrapy request without selenium. How to avoid this?







python selenium web-scraping scrapy






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jun 6 at 7:39

























asked Jun 6 at 7:16









Yash Pokar

346311




346311











  • Why do you want to use scrapy if you are already using Selenium?
    – VMRuiz
    Jun 6 at 11:10










  • @VMRuiz scrapy isn't about just request response and html parsing. It comes with lot more capabilities within it and the most interesting thing is concurrency.
    – Yash Pokar
    Jun 6 at 11:20










  • In that case, if you only want to render the webpage you can use scrapy + splash: splash.readthedocs.io/en/stable
    – VMRuiz
    Jun 6 at 11:22










  • I have used splash but I couldn't fetch result for one site. where chrome and firefox are well known browser which will gives the 100% result.
    – Yash Pokar
    Jun 6 at 11:28
















  • Why do you want to use scrapy if you are already using Selenium?
    – VMRuiz
    Jun 6 at 11:10










  • @VMRuiz scrapy isn't about just request response and html parsing. It comes with lot more capabilities within it and the most interesting thing is concurrency.
    – Yash Pokar
    Jun 6 at 11:20










  • In that case, if you only want to render the webpage you can use scrapy + splash: splash.readthedocs.io/en/stable
    – VMRuiz
    Jun 6 at 11:22










  • I have used splash but I couldn't fetch result for one site. where chrome and firefox are well known browser which will gives the 100% result.
    – Yash Pokar
    Jun 6 at 11:28















Why do you want to use scrapy if you are already using Selenium?
– VMRuiz
Jun 6 at 11:10




Why do you want to use scrapy if you are already using Selenium?
– VMRuiz
Jun 6 at 11:10












@VMRuiz scrapy isn't about just request response and html parsing. It comes with lot more capabilities within it and the most interesting thing is concurrency.
– Yash Pokar
Jun 6 at 11:20




@VMRuiz scrapy isn't about just request response and html parsing. It comes with lot more capabilities within it and the most interesting thing is concurrency.
– Yash Pokar
Jun 6 at 11:20












In that case, if you only want to render the webpage you can use scrapy + splash: splash.readthedocs.io/en/stable
– VMRuiz
Jun 6 at 11:22




In that case, if you only want to render the webpage you can use scrapy + splash: splash.readthedocs.io/en/stable
– VMRuiz
Jun 6 at 11:22












I have used splash but I couldn't fetch result for one site. where chrome and firefox are well known browser which will gives the 100% result.
– Yash Pokar
Jun 6 at 11:28




I have used splash but I couldn't fetch result for one site. where chrome and firefox are well known browser which will gives the 100% result.
– Yash Pokar
Jun 6 at 11:28












1 Answer
1






active

oldest

votes

















up vote
0
down vote



accepted










Here is a trick that can be useful to solve this problem.



create a web service for the selenium run it, locally



from flask import Flask, request, make_response
from flask_restful import Resource, Api
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

app = Flask(__name__)
api = Api(app)

class Selenium(Resource):
_driver = None

@staticmethod
def getDriver():
if not Selenium._driver:
chrome_options = Options()
chrome_options.add_argument("--headless")

Selenium._driver = webdriver.Chrome(chrome_options=chrome_options)
return Selenium._driver

@property
def driver(self):
return Selenium.getDriver()

def get(self):
url = str(request.args['url'])

self.driver.get(url)

return make_response(self.driver.page_source)

api.add_resource(Selenium, '/')

if __name__ == '__main__':
app.run(debug=True)


now http://127.0.0.1:5000/?url=https://stackoverflow.com/users/5939254/yash-pokar will return you the compiled web page using selenium Chrome/Firefox driver.



now how our spider will look like,



import scrapy
import urllib


class ProductSpider(scrapy.Spider):
name = 'products'
allowed_domains = ['ebay.com']
urls = [
'http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40',
]

def start_requests(self):
for url in self.urls:
url = 'http://127.0.0.1:5000/?url='.format(urllib.quote(url))
yield scrapy.Request(url)

def parse(self, response):
yield
'field': response.xpath('//td[@class="pagn-next"]/a'),






share|improve this answer




















    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f50714354%2fscrapy-selenium-requests-twice-for-each-url%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    0
    down vote



    accepted










    Here is a trick that can be useful to solve this problem.



    create a web service for the selenium run it, locally



    from flask import Flask, request, make_response
    from flask_restful import Resource, Api
    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options

    app = Flask(__name__)
    api = Api(app)

    class Selenium(Resource):
    _driver = None

    @staticmethod
    def getDriver():
    if not Selenium._driver:
    chrome_options = Options()
    chrome_options.add_argument("--headless")

    Selenium._driver = webdriver.Chrome(chrome_options=chrome_options)
    return Selenium._driver

    @property
    def driver(self):
    return Selenium.getDriver()

    def get(self):
    url = str(request.args['url'])

    self.driver.get(url)

    return make_response(self.driver.page_source)

    api.add_resource(Selenium, '/')

    if __name__ == '__main__':
    app.run(debug=True)


    now http://127.0.0.1:5000/?url=https://stackoverflow.com/users/5939254/yash-pokar will return you the compiled web page using selenium Chrome/Firefox driver.



    now how our spider will look like,



    import scrapy
    import urllib


    class ProductSpider(scrapy.Spider):
    name = 'products'
    allowed_domains = ['ebay.com']
    urls = [
    'http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40',
    ]

    def start_requests(self):
    for url in self.urls:
    url = 'http://127.0.0.1:5000/?url='.format(urllib.quote(url))
    yield scrapy.Request(url)

    def parse(self, response):
    yield
    'field': response.xpath('//td[@class="pagn-next"]/a'),






    share|improve this answer
























      up vote
      0
      down vote



      accepted










      Here is a trick that can be useful to solve this problem.



      create a web service for the selenium run it, locally



      from flask import Flask, request, make_response
      from flask_restful import Resource, Api
      from selenium import webdriver
      from selenium.webdriver.chrome.options import Options

      app = Flask(__name__)
      api = Api(app)

      class Selenium(Resource):
      _driver = None

      @staticmethod
      def getDriver():
      if not Selenium._driver:
      chrome_options = Options()
      chrome_options.add_argument("--headless")

      Selenium._driver = webdriver.Chrome(chrome_options=chrome_options)
      return Selenium._driver

      @property
      def driver(self):
      return Selenium.getDriver()

      def get(self):
      url = str(request.args['url'])

      self.driver.get(url)

      return make_response(self.driver.page_source)

      api.add_resource(Selenium, '/')

      if __name__ == '__main__':
      app.run(debug=True)


      now http://127.0.0.1:5000/?url=https://stackoverflow.com/users/5939254/yash-pokar will return you the compiled web page using selenium Chrome/Firefox driver.



      now how our spider will look like,



      import scrapy
      import urllib


      class ProductSpider(scrapy.Spider):
      name = 'products'
      allowed_domains = ['ebay.com']
      urls = [
      'http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40',
      ]

      def start_requests(self):
      for url in self.urls:
      url = 'http://127.0.0.1:5000/?url='.format(urllib.quote(url))
      yield scrapy.Request(url)

      def parse(self, response):
      yield
      'field': response.xpath('//td[@class="pagn-next"]/a'),






      share|improve this answer






















        up vote
        0
        down vote



        accepted







        up vote
        0
        down vote



        accepted






        Here is a trick that can be useful to solve this problem.



        create a web service for the selenium run it, locally



        from flask import Flask, request, make_response
        from flask_restful import Resource, Api
        from selenium import webdriver
        from selenium.webdriver.chrome.options import Options

        app = Flask(__name__)
        api = Api(app)

        class Selenium(Resource):
        _driver = None

        @staticmethod
        def getDriver():
        if not Selenium._driver:
        chrome_options = Options()
        chrome_options.add_argument("--headless")

        Selenium._driver = webdriver.Chrome(chrome_options=chrome_options)
        return Selenium._driver

        @property
        def driver(self):
        return Selenium.getDriver()

        def get(self):
        url = str(request.args['url'])

        self.driver.get(url)

        return make_response(self.driver.page_source)

        api.add_resource(Selenium, '/')

        if __name__ == '__main__':
        app.run(debug=True)


        now http://127.0.0.1:5000/?url=https://stackoverflow.com/users/5939254/yash-pokar will return you the compiled web page using selenium Chrome/Firefox driver.



        now how our spider will look like,



        import scrapy
        import urllib


        class ProductSpider(scrapy.Spider):
        name = 'products'
        allowed_domains = ['ebay.com']
        urls = [
        'http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40',
        ]

        def start_requests(self):
        for url in self.urls:
        url = 'http://127.0.0.1:5000/?url='.format(urllib.quote(url))
        yield scrapy.Request(url)

        def parse(self, response):
        yield
        'field': response.xpath('//td[@class="pagn-next"]/a'),






        share|improve this answer












        Here is a trick that can be useful to solve this problem.



        create a web service for the selenium run it, locally



        from flask import Flask, request, make_response
        from flask_restful import Resource, Api
        from selenium import webdriver
        from selenium.webdriver.chrome.options import Options

        app = Flask(__name__)
        api = Api(app)

        class Selenium(Resource):
        _driver = None

        @staticmethod
        def getDriver():
        if not Selenium._driver:
        chrome_options = Options()
        chrome_options.add_argument("--headless")

        Selenium._driver = webdriver.Chrome(chrome_options=chrome_options)
        return Selenium._driver

        @property
        def driver(self):
        return Selenium.getDriver()

        def get(self):
        url = str(request.args['url'])

        self.driver.get(url)

        return make_response(self.driver.page_source)

        api.add_resource(Selenium, '/')

        if __name__ == '__main__':
        app.run(debug=True)


        now http://127.0.0.1:5000/?url=https://stackoverflow.com/users/5939254/yash-pokar will return you the compiled web page using selenium Chrome/Firefox driver.



        now how our spider will look like,



        import scrapy
        import urllib


        class ProductSpider(scrapy.Spider):
        name = 'products'
        allowed_domains = ['ebay.com']
        urls = [
        'http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40',
        ]

        def start_requests(self):
        for url in self.urls:
        url = 'http://127.0.0.1:5000/?url='.format(urllib.quote(url))
        yield scrapy.Request(url)

        def parse(self, response):
        yield
        'field': response.xpath('//td[@class="pagn-next"]/a'),







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Jun 6 at 8:15









        Yash Pokar

        346311




        346311



























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f50714354%2fscrapy-selenium-requests-twice-for-each-url%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            How to how show current date and time by default on contact form 7 in WordPress without taking input from user in datetimepicker

            Syphilis

            Darth Vader #20