Scrape webpage for dataset









up vote
3
down vote

favorite












I am trying to scrape the contents of a webpage so I can check whether a
Stata dataset exists.



I have put together a few lines of code but they don't work:



tempfile page
copy "https://www.stata-press.com/data/r15/u.html" "`page'"
tempname fh
file open `fh' using "`page'", read
file read `fh' line
while r(eof)==0
if "`line'"=="regsmpl.dta" dis "Dataset exists"
else dis "Dataset doesn't exit"
file read `fh' line

file close `fh'


Any ideas will be highly appreciated.










share|improve this question



























    up vote
    3
    down vote

    favorite












    I am trying to scrape the contents of a webpage so I can check whether a
    Stata dataset exists.



    I have put together a few lines of code but they don't work:



    tempfile page
    copy "https://www.stata-press.com/data/r15/u.html" "`page'"
    tempname fh
    file open `fh' using "`page'", read
    file read `fh' line
    while r(eof)==0
    if "`line'"=="regsmpl.dta" dis "Dataset exists"
    else dis "Dataset doesn't exit"
    file read `fh' line

    file close `fh'


    Any ideas will be highly appreciated.










    share|improve this question

























      up vote
      3
      down vote

      favorite









      up vote
      3
      down vote

      favorite











      I am trying to scrape the contents of a webpage so I can check whether a
      Stata dataset exists.



      I have put together a few lines of code but they don't work:



      tempfile page
      copy "https://www.stata-press.com/data/r15/u.html" "`page'"
      tempname fh
      file open `fh' using "`page'", read
      file read `fh' line
      while r(eof)==0
      if "`line'"=="regsmpl.dta" dis "Dataset exists"
      else dis "Dataset doesn't exit"
      file read `fh' line

      file close `fh'


      Any ideas will be highly appreciated.










      share|improve this question















      I am trying to scrape the contents of a webpage so I can check whether a
      Stata dataset exists.



      I have put together a few lines of code but they don't work:



      tempfile page
      copy "https://www.stata-press.com/data/r15/u.html" "`page'"
      tempname fh
      file open `fh' using "`page'", read
      file read `fh' line
      while r(eof)==0
      if "`line'"=="regsmpl.dta" dis "Dataset exists"
      else dis "Dataset doesn't exit"
      file read `fh' line

      file close `fh'


      Any ideas will be highly appreciated.







      stata






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 9 at 18:51









      Pearly Spencer

      8,590163250




      8,590163250










      asked Nov 9 at 18:35









      user10630389

      183




      183






















          1 Answer
          1






          active

          oldest

          votes

















          up vote
          2
          down vote



          accepted










          You could first feed the entire page into a scalar variable using the fileread() function:



          local dataset regsmpl
          scalar page = fileread("https://www.stata-press.com/data/r15/u.html")


          After the scalar is successfully created, you can go about this using two approaches.



          Solution 1: Check if the dataset is mentioned in the page



          if strmatch(page, "*`dataset'.dta*") display "Page mentions dataset"
          else display "No trace of dataset in page"


          Solution 2: Check if there is an actual link pointing to the dataset



          local link = ustrregexm(page, `"<a [^>]*bhrefs*=s*"([^"]*`dataset'.dta[^"]*)"')
          local url = trim(ustrregexs(1))

          if "`url'" != "" display "The link is: `url'"
          else display "There is no such link"


          Your approach can also work using both strmatch() and a regular expression:



          tempname fh
          file open `fh' using "https://www.stata-press.com/data/r15/u.html", read
          file read `fh' line

          local tag = 0
          while r(eof) == 0
          if strmatch(`"`line'"', "*regsmpl.dta*") local tag = 1
          file read `fh' line


          if `tag' == 1 display "Dataset exists"
          else display "Dataset doesn't exit"




          tempname fh
          file open `fh' using "https://www.stata-press.com/data/r15/u.html", read
          file read `fh' line

          local tag = 0
          while r(eof) == 0
          local link = ustrregexm(`"`line'"', `"<a [^>]*bhrefs*=s*"([^"]*`dataset'.dta[^"]*)"')
          if `link' == 1
          local url = trim(ustrregexs(1))
          local tag = 1

          file read `fh' line


          if `tag' == 1 display "The link is: `url'"
          else display "There is no such link"





          share|improve this answer






















            Your Answer






            StackExchange.ifUsing("editor", function ()
            StackExchange.using("externalEditor", function ()
            StackExchange.using("snippets", function ()
            StackExchange.snippets.init();
            );
            );
            , "code-snippets");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "1"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













             

            draft saved


            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53231532%2fscrape-webpage-for-dataset%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            2
            down vote



            accepted










            You could first feed the entire page into a scalar variable using the fileread() function:



            local dataset regsmpl
            scalar page = fileread("https://www.stata-press.com/data/r15/u.html")


            After the scalar is successfully created, you can go about this using two approaches.



            Solution 1: Check if the dataset is mentioned in the page



            if strmatch(page, "*`dataset'.dta*") display "Page mentions dataset"
            else display "No trace of dataset in page"


            Solution 2: Check if there is an actual link pointing to the dataset



            local link = ustrregexm(page, `"<a [^>]*bhrefs*=s*"([^"]*`dataset'.dta[^"]*)"')
            local url = trim(ustrregexs(1))

            if "`url'" != "" display "The link is: `url'"
            else display "There is no such link"


            Your approach can also work using both strmatch() and a regular expression:



            tempname fh
            file open `fh' using "https://www.stata-press.com/data/r15/u.html", read
            file read `fh' line

            local tag = 0
            while r(eof) == 0
            if strmatch(`"`line'"', "*regsmpl.dta*") local tag = 1
            file read `fh' line


            if `tag' == 1 display "Dataset exists"
            else display "Dataset doesn't exit"




            tempname fh
            file open `fh' using "https://www.stata-press.com/data/r15/u.html", read
            file read `fh' line

            local tag = 0
            while r(eof) == 0
            local link = ustrregexm(`"`line'"', `"<a [^>]*bhrefs*=s*"([^"]*`dataset'.dta[^"]*)"')
            if `link' == 1
            local url = trim(ustrregexs(1))
            local tag = 1

            file read `fh' line


            if `tag' == 1 display "The link is: `url'"
            else display "There is no such link"





            share|improve this answer


























              up vote
              2
              down vote



              accepted










              You could first feed the entire page into a scalar variable using the fileread() function:



              local dataset regsmpl
              scalar page = fileread("https://www.stata-press.com/data/r15/u.html")


              After the scalar is successfully created, you can go about this using two approaches.



              Solution 1: Check if the dataset is mentioned in the page



              if strmatch(page, "*`dataset'.dta*") display "Page mentions dataset"
              else display "No trace of dataset in page"


              Solution 2: Check if there is an actual link pointing to the dataset



              local link = ustrregexm(page, `"<a [^>]*bhrefs*=s*"([^"]*`dataset'.dta[^"]*)"')
              local url = trim(ustrregexs(1))

              if "`url'" != "" display "The link is: `url'"
              else display "There is no such link"


              Your approach can also work using both strmatch() and a regular expression:



              tempname fh
              file open `fh' using "https://www.stata-press.com/data/r15/u.html", read
              file read `fh' line

              local tag = 0
              while r(eof) == 0
              if strmatch(`"`line'"', "*regsmpl.dta*") local tag = 1
              file read `fh' line


              if `tag' == 1 display "Dataset exists"
              else display "Dataset doesn't exit"




              tempname fh
              file open `fh' using "https://www.stata-press.com/data/r15/u.html", read
              file read `fh' line

              local tag = 0
              while r(eof) == 0
              local link = ustrregexm(`"`line'"', `"<a [^>]*bhrefs*=s*"([^"]*`dataset'.dta[^"]*)"')
              if `link' == 1
              local url = trim(ustrregexs(1))
              local tag = 1

              file read `fh' line


              if `tag' == 1 display "The link is: `url'"
              else display "There is no such link"





              share|improve this answer
























                up vote
                2
                down vote



                accepted







                up vote
                2
                down vote



                accepted






                You could first feed the entire page into a scalar variable using the fileread() function:



                local dataset regsmpl
                scalar page = fileread("https://www.stata-press.com/data/r15/u.html")


                After the scalar is successfully created, you can go about this using two approaches.



                Solution 1: Check if the dataset is mentioned in the page



                if strmatch(page, "*`dataset'.dta*") display "Page mentions dataset"
                else display "No trace of dataset in page"


                Solution 2: Check if there is an actual link pointing to the dataset



                local link = ustrregexm(page, `"<a [^>]*bhrefs*=s*"([^"]*`dataset'.dta[^"]*)"')
                local url = trim(ustrregexs(1))

                if "`url'" != "" display "The link is: `url'"
                else display "There is no such link"


                Your approach can also work using both strmatch() and a regular expression:



                tempname fh
                file open `fh' using "https://www.stata-press.com/data/r15/u.html", read
                file read `fh' line

                local tag = 0
                while r(eof) == 0
                if strmatch(`"`line'"', "*regsmpl.dta*") local tag = 1
                file read `fh' line


                if `tag' == 1 display "Dataset exists"
                else display "Dataset doesn't exit"




                tempname fh
                file open `fh' using "https://www.stata-press.com/data/r15/u.html", read
                file read `fh' line

                local tag = 0
                while r(eof) == 0
                local link = ustrregexm(`"`line'"', `"<a [^>]*bhrefs*=s*"([^"]*`dataset'.dta[^"]*)"')
                if `link' == 1
                local url = trim(ustrregexs(1))
                local tag = 1

                file read `fh' line


                if `tag' == 1 display "The link is: `url'"
                else display "There is no such link"





                share|improve this answer














                You could first feed the entire page into a scalar variable using the fileread() function:



                local dataset regsmpl
                scalar page = fileread("https://www.stata-press.com/data/r15/u.html")


                After the scalar is successfully created, you can go about this using two approaches.



                Solution 1: Check if the dataset is mentioned in the page



                if strmatch(page, "*`dataset'.dta*") display "Page mentions dataset"
                else display "No trace of dataset in page"


                Solution 2: Check if there is an actual link pointing to the dataset



                local link = ustrregexm(page, `"<a [^>]*bhrefs*=s*"([^"]*`dataset'.dta[^"]*)"')
                local url = trim(ustrregexs(1))

                if "`url'" != "" display "The link is: `url'"
                else display "There is no such link"


                Your approach can also work using both strmatch() and a regular expression:



                tempname fh
                file open `fh' using "https://www.stata-press.com/data/r15/u.html", read
                file read `fh' line

                local tag = 0
                while r(eof) == 0
                if strmatch(`"`line'"', "*regsmpl.dta*") local tag = 1
                file read `fh' line


                if `tag' == 1 display "Dataset exists"
                else display "Dataset doesn't exit"




                tempname fh
                file open `fh' using "https://www.stata-press.com/data/r15/u.html", read
                file read `fh' line

                local tag = 0
                while r(eof) == 0
                local link = ustrregexm(`"`line'"', `"<a [^>]*bhrefs*=s*"([^"]*`dataset'.dta[^"]*)"')
                if `link' == 1
                local url = trim(ustrregexs(1))
                local tag = 1

                file read `fh' line


                if `tag' == 1 display "The link is: `url'"
                else display "There is no such link"






                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Nov 11 at 0:04

























                answered Nov 9 at 18:49









                Pearly Spencer

                8,590163250




                8,590163250



























                     

                    draft saved


                    draft discarded















































                     


                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53231532%2fscrape-webpage-for-dataset%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    How to how show current date and time by default on contact form 7 in WordPress without taking input from user in datetimepicker

                    Syphilis

                    Darth Vader #20