What can R do about a messy data format?










36















Sometimes I see data posted in a Stack Overflow question formatted like in this question. This is not the first time, so I have decided to ask a question about it, and answer the question with a way to make the posted data palatable.



I will post the dataset example here just in case the question is deleted.



+------------+------+------+----------+--------------------------+
| Date | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A | A1 | 0 | 0 |
| 2018-06-03 | A | A2 | 0 | 1 |
| 2018-06-03 | A | A3 | 0 | 2 |
| 2018-06-03 | A | A4 | 1 | 1 |
| 2018-06-03 | A | A5 | 2 | 1 |
| 2018-06-04 | A | A6 | 0 | 3 |
| 2018-06-01 | B | B1 | 0 | 1 |
| 2018-06-02 | B | B2 | 0 | 2 |
| 2018-06-03 | B | B3 | 0 | 3 |
+------------+------+------+----------+--------------------------+


As you can see this is not the right way to post data. As a user wrote in a comment,




It must've taken a bit of time to format the data the way you're
showing it here. Unfortunately this is not a good format for us to
copy & paste.




I believe this says it all. The asker is well intended and it took some work and time to try to be nice, but the result is not good.



What can R code do to make that table usable, if anything? Will it take a great deal of trouble?










share|improve this question
























  • This is an interesting question by itself. However, I fear that a good answer might encourage people in presenting their dataset like that.

    – nicola
    Aug 26 '18 at 6:39











  • @nicola I surely hope not! It's already bad to see it every now and then, maybe it will have the opposite effect :).

    – Rui Barradas
    Aug 26 '18 at 6:48











  • I would enjoy a general smart messy format reader for SO. that would read those, and printed data.frames/ data.tables / tibbles, commented or not, with row numbers or not, recognize dates, and ideally try to guess a way to sort out strings containing spaces, I thought of building it in fact but never got around it.

    – Moody_Mudskipper
    Aug 29 '18 at 20:25















36















Sometimes I see data posted in a Stack Overflow question formatted like in this question. This is not the first time, so I have decided to ask a question about it, and answer the question with a way to make the posted data palatable.



I will post the dataset example here just in case the question is deleted.



+------------+------+------+----------+--------------------------+
| Date | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A | A1 | 0 | 0 |
| 2018-06-03 | A | A2 | 0 | 1 |
| 2018-06-03 | A | A3 | 0 | 2 |
| 2018-06-03 | A | A4 | 1 | 1 |
| 2018-06-03 | A | A5 | 2 | 1 |
| 2018-06-04 | A | A6 | 0 | 3 |
| 2018-06-01 | B | B1 | 0 | 1 |
| 2018-06-02 | B | B2 | 0 | 2 |
| 2018-06-03 | B | B3 | 0 | 3 |
+------------+------+------+----------+--------------------------+


As you can see this is not the right way to post data. As a user wrote in a comment,




It must've taken a bit of time to format the data the way you're
showing it here. Unfortunately this is not a good format for us to
copy & paste.




I believe this says it all. The asker is well intended and it took some work and time to try to be nice, but the result is not good.



What can R code do to make that table usable, if anything? Will it take a great deal of trouble?










share|improve this question
























  • This is an interesting question by itself. However, I fear that a good answer might encourage people in presenting their dataset like that.

    – nicola
    Aug 26 '18 at 6:39











  • @nicola I surely hope not! It's already bad to see it every now and then, maybe it will have the opposite effect :).

    – Rui Barradas
    Aug 26 '18 at 6:48











  • I would enjoy a general smart messy format reader for SO. that would read those, and printed data.frames/ data.tables / tibbles, commented or not, with row numbers or not, recognize dates, and ideally try to guess a way to sort out strings containing spaces, I thought of building it in fact but never got around it.

    – Moody_Mudskipper
    Aug 29 '18 at 20:25













36












36








36


8






Sometimes I see data posted in a Stack Overflow question formatted like in this question. This is not the first time, so I have decided to ask a question about it, and answer the question with a way to make the posted data palatable.



I will post the dataset example here just in case the question is deleted.



+------------+------+------+----------+--------------------------+
| Date | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A | A1 | 0 | 0 |
| 2018-06-03 | A | A2 | 0 | 1 |
| 2018-06-03 | A | A3 | 0 | 2 |
| 2018-06-03 | A | A4 | 1 | 1 |
| 2018-06-03 | A | A5 | 2 | 1 |
| 2018-06-04 | A | A6 | 0 | 3 |
| 2018-06-01 | B | B1 | 0 | 1 |
| 2018-06-02 | B | B2 | 0 | 2 |
| 2018-06-03 | B | B3 | 0 | 3 |
+------------+------+------+----------+--------------------------+


As you can see this is not the right way to post data. As a user wrote in a comment,




It must've taken a bit of time to format the data the way you're
showing it here. Unfortunately this is not a good format for us to
copy & paste.




I believe this says it all. The asker is well intended and it took some work and time to try to be nice, but the result is not good.



What can R code do to make that table usable, if anything? Will it take a great deal of trouble?










share|improve this question
















Sometimes I see data posted in a Stack Overflow question formatted like in this question. This is not the first time, so I have decided to ask a question about it, and answer the question with a way to make the posted data palatable.



I will post the dataset example here just in case the question is deleted.



+------------+------+------+----------+--------------------------+
| Date | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A | A1 | 0 | 0 |
| 2018-06-03 | A | A2 | 0 | 1 |
| 2018-06-03 | A | A3 | 0 | 2 |
| 2018-06-03 | A | A4 | 1 | 1 |
| 2018-06-03 | A | A5 | 2 | 1 |
| 2018-06-04 | A | A6 | 0 | 3 |
| 2018-06-01 | B | B1 | 0 | 1 |
| 2018-06-02 | B | B2 | 0 | 2 |
| 2018-06-03 | B | B3 | 0 | 3 |
+------------+------+------+----------+--------------------------+


As you can see this is not the right way to post data. As a user wrote in a comment,




It must've taken a bit of time to format the data the way you're
showing it here. Unfortunately this is not a good format for us to
copy & paste.




I believe this says it all. The asker is well intended and it took some work and time to try to be nice, but the result is not good.



What can R code do to make that table usable, if anything? Will it take a great deal of trouble?







r dataframe






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Aug 26 '18 at 11:13









Peter Mortensen

13.5k1984111




13.5k1984111










asked Aug 26 '18 at 6:21









Rui BarradasRui Barradas

16.4k51730




16.4k51730












  • This is an interesting question by itself. However, I fear that a good answer might encourage people in presenting their dataset like that.

    – nicola
    Aug 26 '18 at 6:39











  • @nicola I surely hope not! It's already bad to see it every now and then, maybe it will have the opposite effect :).

    – Rui Barradas
    Aug 26 '18 at 6:48











  • I would enjoy a general smart messy format reader for SO. that would read those, and printed data.frames/ data.tables / tibbles, commented or not, with row numbers or not, recognize dates, and ideally try to guess a way to sort out strings containing spaces, I thought of building it in fact but never got around it.

    – Moody_Mudskipper
    Aug 29 '18 at 20:25

















  • This is an interesting question by itself. However, I fear that a good answer might encourage people in presenting their dataset like that.

    – nicola
    Aug 26 '18 at 6:39











  • @nicola I surely hope not! It's already bad to see it every now and then, maybe it will have the opposite effect :).

    – Rui Barradas
    Aug 26 '18 at 6:48











  • I would enjoy a general smart messy format reader for SO. that would read those, and printed data.frames/ data.tables / tibbles, commented or not, with row numbers or not, recognize dates, and ideally try to guess a way to sort out strings containing spaces, I thought of building it in fact but never got around it.

    – Moody_Mudskipper
    Aug 29 '18 at 20:25
















This is an interesting question by itself. However, I fear that a good answer might encourage people in presenting their dataset like that.

– nicola
Aug 26 '18 at 6:39





This is an interesting question by itself. However, I fear that a good answer might encourage people in presenting their dataset like that.

– nicola
Aug 26 '18 at 6:39













@nicola I surely hope not! It's already bad to see it every now and then, maybe it will have the opposite effect :).

– Rui Barradas
Aug 26 '18 at 6:48





@nicola I surely hope not! It's already bad to see it every now and then, maybe it will have the opposite effect :).

– Rui Barradas
Aug 26 '18 at 6:48













I would enjoy a general smart messy format reader for SO. that would read those, and printed data.frames/ data.tables / tibbles, commented or not, with row numbers or not, recognize dates, and ideally try to guess a way to sort out strings containing spaces, I thought of building it in fact but never got around it.

– Moody_Mudskipper
Aug 29 '18 at 20:25





I would enjoy a general smart messy format reader for SO. that would read those, and printed data.frames/ data.tables / tibbles, commented or not, with row numbers or not, recognize dates, and ideally try to guess a way to sort out strings containing spaces, I thought of building it in fact but never got around it.

– Moody_Mudskipper
Aug 29 '18 at 20:25












6 Answers
6






active

oldest

votes


















24














Using data.table::fread:



x = '
+------------+------+------+----------+--------------------------+
| Date | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A | A1 | 0 | 0 |
| 2018-06-03 | A | A2 | 0 | 1 |
| 2018-06-03 | A | A3 | 0 | 2 |
| 2018-06-03 | A | A4 | 1 | 1 |
| 2018-06-03 | A | A5 | 2 | 1 |
| 2018-06-04 | A | A6 | 0 | 3 |
| 2018-06-01 | B | B1 | 0 | 1 |
| 2018-06-02 | B | B2 | 0 | 2 |
| 2018-06-03 | B | B3 | 0 | 3 |
+------------+------+------+----------+--------------------------+
'

fread(gsub('\+.+\n' ,'', x, perl = T), drop=c(1,7))

# Date Emp1 Case Priority PriorityCountinLast7days
# 1: 2018-06-01 A A1 0 0
# 2: 2018-06-03 A A2 0 1
# 3: 2018-06-03 A A3 0 2
# 4: 2018-06-03 A A4 1 1
# 5: 2018-06-03 A A5 2 1
# 6: 2018-06-04 A A6 0 3
# 7: 2018-06-01 B B1 0 1
# 8: 2018-06-02 B B2 0 2
# 9: 2018-06-03 B B3 0 3


The gsub part removes the horizontal rules. drop removes the extra columns caused by delimiters at the line ends.






share|improve this answer
































    18














    The short answer to the question is yes, R code can solve that mess and no, it doesn't take that much trouble.



    The first step after copying & pasting the table into an R session is to read it in with read.table setting the header, sep, comment.char and strip.white arguments.



    Credits for reminding me of arguments comment.char and strip.white go to @nicola, and his comment.



    dat <- read.table(text = "
    +------------+------+------+----------+--------------------------+
    | Date | Emp1 | Case | Priority | PriorityCountinLast7days |
    +------------+------+------+----------+--------------------------+
    | 2018-06-01 | A | A1 | 0 | 0 |
    | 2018-06-03 | A | A2 | 0 | 1 |
    | 2018-06-03 | A | A3 | 0 | 2 |
    | 2018-06-03 | A | A4 | 1 | 1 |
    | 2018-06-03 | A | A5 | 2 | 1 |
    | 2018-06-04 | A | A6 | 0 | 3 |
    | 2018-06-01 | B | B1 | 0 | 1 |
    | 2018-06-02 | B | B2 | 0 | 2 |
    | 2018-06-03 | B | B3 | 0 | 3 |
    +------------+------+------+----------+--------------------------+
    ", header = TRUE, sep = "|", comment.char = "+", strip.white = TRUE)


    But as you can see there are some issues with the result.



    dat
    X Date Emp1 Case Priority PriorityCountinLast7days X.1
    1 NA 2018-06-01 A A1 0 0 NA
    2 NA 2018-06-03 A A2 0 1 NA
    3 NA 2018-06-03 A A3 0 2 NA
    4 NA 2018-06-03 A A4 1 1 NA
    5 NA 2018-06-03 A A5 2 1 NA
    6 NA 2018-06-04 A A6 0 3 NA
    7 NA 2018-06-01 B B1 0 1 NA
    8 NA 2018-06-02 B B2 0 2 NA
    9 NA 2018-06-03 B B3 0 3 NA


    To have separators start and end each data row made R believe those separators mark extra columns, which is not what is meant by the original question's OP.



    So the second step is to keep only the real columns. I will do this subsetting the columns by their numbers, easily done, they usually are the first and last columns.



    dat <- dat[-c(1, ncol(dat))]
    dat
    Date Emp1 Case Priority PriorityCountinLast7days
    1 2018-06-01 A A1 0 0
    2 2018-06-03 A A2 0 1
    3 2018-06-03 A A3 0 2
    4 2018-06-03 A A4 1 1
    5 2018-06-03 A A5 2 1
    6 2018-06-04 A A6 0 3
    7 2018-06-01 B B1 0 1
    8 2018-06-02 B B2 0 2
    9 2018-06-03 B B3 0 3


    That wasn't too hard, much better.

    In this case there is still a problem, to coerce column Date to class Date.



    dat$Date <- as.Date(dat$Date)


    And the result is satisfactory.



    str(dat)
    'data.frame': 9 obs. of 5 variables:
    $ Date : Date, format: "2018-06-01" "2018-06-03" ...
    $ Emp1 : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 2 2 2
    $ Case : Factor w/ 9 levels "A1","A2","A3",..: 1 2 3 4 5 6 7 8 9
    $ Priority : int 0 0 0 1 2 0 0 0 0
    $ PriorityCountinLast7days: int 0 1 2 1 1 3 1 2 3


    Note that I have not set the more or less standard argument stringsAsFactors = FALSE. If needed, this should be done when running read.table.



    The whole process took only 3 lines of base R code.



    Finally, the end result in dput format, like it should be in the first place.



    dat <-
    structure(list(Date = structure(c(17683, 17685, 17685, 17685,
    17685, 17686, 17683, 17684, 17685), class = "Date"), Emp1 = c("A",
    "A", "A", "A", "A", "A", "B", "B", "B"), Case = c("A1", "A2",
    "A3", "A4", "A5", "A6", "B1", "B2", "B3"), Priority = c(0, 0,
    0, 1, 2, 0, 0, 0, 0), PriorityCountinLast7days = c(0, 1, 2, 1,
    1, 3, 1, 2, 3)), row.names = c(NA, -9L), class = "data.frame")





    share|improve this answer




















    • 1





      @李哲源 I don't know about "safer", I find it more natural to use read table functions since after all that does look like a table.

      – Rui Barradas
      Aug 26 '18 at 6:46






    • 1





      I was about to post utils:::head.default(read.table("clipboard",comment.char="+",sep="|",strip.white=TRUE,header=TRUE,flush=TRUE)[-1],-1), which is not different from what you did (in my line, you manage the first and third line and remove the undesired columns).

      – nicola
      Aug 26 '18 at 8:13


















    5














    md_table <- scan(text = "
    +------------+------+------+----------+--------------------------+
    | Date | Emp1 | Case | Priority | PriorityCountinLast7days |
    +------------+------+------+----------+--------------------------+
    | 2018-06-01 | A | A1 | 0 | 0 |
    | 2018-06-03 | A | A2 | 0 | 1 |
    | 2018-06-03 | A | A3 | 0 | 2 |
    | 2018-06-03 | A | A4 | 1 | 1 |
    | 2018-06-03 | A | A5 | 2 | 1 |
    | 2018-06-04 | A | A6 | 0 | 3 |
    | 2018-06-01 | B | B1 | 0 | 1 |
    | 2018-06-02 | B | B2 | 0 | 2 |
    | 2018-06-03 | B | B3 | 0 | 3 |
    +------------+------+------+----------+--------------------------+",
    what = "", sep = "", comment.char = "+", quiet = TRUE)

    ## it is clear that there are 5 columns
    mat <- matrix(md_table[md_table != "|"], ncol = 5, byrow = TRUE)
    # [,1] [,2] [,3] [,4] [,5]
    # [1,] "Date" "Emp1" "Case" "Priority" "PriorityCountinLast7days"
    # [2,] "2018-06-01" "A" "A1" "0" "0"
    # [3,] "2018-06-03" "A" "A2" "0" "1"
    # [4,] "2018-06-03" "A" "A3" "0" "2"
    # [5,] "2018-06-03" "A" "A4" "1" "1"
    # [6,] "2018-06-03" "A" "A5" "2" "1"
    # [7,] "2018-06-04" "A" "A6" "0" "3"
    # [8,] "2018-06-01" "B" "B1" "0" "1"
    # [9,] "2018-06-02" "B" "B2" "0" "2"
    #[10,] "2018-06-03" "B" "B3" "0" "3"




    ## a data frame with all character columns
    dat <- setNames(data.frame(mat[-1, ], stringsAsFactors = FALSE), mat[1, ])
    # Date Emp1 Case Priority PriorityCountinLast7days
    #1 2018-06-01 A A1 0 0
    #2 2018-06-03 A A2 0 1
    #3 2018-06-03 A A3 0 2
    #4 2018-06-03 A A4 1 1
    #5 2018-06-03 A A5 2 1
    #6 2018-06-04 A A6 0 3
    #7 2018-06-01 B B1 0 1
    #8 2018-06-02 B B2 0 2
    #9 2018-06-03 B B3 0 3




    ## or maybe just use `type.convert` on some columns?
    dat <- lapply(dat, type.convert)





    share|improve this answer
































      2














      Well, about this specific dataset I used the import feature in RStudio, but I took one additional step beforehand.



      1. Copy the dataset into the Notepad file.

      2. Replace all | characters with ,


      3. Import the Notepad file using read.csv to RStudio using this code (seperate columns by ,).

      But, if you mean use the R to fully understand it in one step, then I have no idea.






      share|improve this answer

























      • Notepad? Are you assuming Microsoft Windows?

        – Peter Mortensen
        Aug 26 '18 at 11:16











      • @PeterMortensen Yes, but it could be any other editor as well.

        – Salman Lashkarara
        Aug 26 '18 at 11:26


















      2














      The issue isn't so much how many lines of code it takes, two or five, not much difference. The question is more whether it will work beyond the example you posted here.



      I haven't come across this sort of thing in the wild, but I had a go at constructing another example that I thought could conceivably exist.



      I've since come across a couple more cases and added them to the test suite.



      x1 <- "
      +------------+------+------+----------+--------------------------+
      | Date | Emp1 | Case | Priority | PriorityCountinLast7days |
      +------------+------+------+----------+--------------------------+
      | 2018-06-01 | A | A1 | 0 | 0 |
      | 2018-06-03 | A | A2 | 0 | 1 |
      | 2018-06-02 | B | B2 | 0 | 2 |
      | 2018-06-03 | B | B3 | 0 | 3 |
      +------------+------+------+----------+--------------------------+
      "

      x2 <- "
      ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
      Date | Emp1 | Case | Priority | PriorityCountinLast7days
      ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
      2018-06-01 | A | A|1 | 0 | 0
      2018-06-03 | A | A|2 | 0 | 1
      2018-06-02 | B | B|2 | 0 | 2
      2018-06-03 | B | B|3 | 0 | 3
      ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
      "

      x3 <- "
      Maths | English | Science | History | Class

      0.1 | 0.2 | 0.3 | 0.2 | Y2

      0.9 | 0.5 | 0.7 | 0.4 | Y1

      0.2 | 0.4 | 0.6 | 0.2 | Y2

      0.9 | 0.5 | 0.2 | 0.7 | Y1
      "

      x4 <- "
      Season | Team | W | AHWO
      -------------------------------------
      1 | 2017/2018 | TeamA | 2 | 1.75
      2 | 2017/2018 | TeamB | 1 | 1.85
      3 | 2017/2018 | TeamC | 1 | 1.70
      4 | 2017/2018 | TeamD | 0 | 3.10
      5 | 2016/2017 | TeamA | 1 | 1.49
      6 | 2016/2017 | TeamB | 3 | 1.51
      7 | 2016/2017 | TeamC | 2 | 1.90
      8 | 2016/2017 | TeamD | 0 | N/A
      "


      My go at a function



      f <- function(x, header=TRUE, na.strings=c("NA", "N/A"), 
      stringsAsFactors=FALSE, ...) ", " ", x)

      # read the result as a table
      read.table(text=paste(x, collapse="n"), header=header,
      na.strings=na.strings, stringsAsFactors=stringsAsFactors, ...)


      lapply(c(x1, x2, x3, x4, x5), f)


      Output



      [[1]]
      Date Emp1 Case Priority PriorityCountinLast7days
      1 2018-06-01 A A1 0 0
      2 2018-06-03 A A2 0 1
      3 2018-06-02 B B2 0 2
      4 2018-06-03 B B3 0 3

      [[2]]
      Date Emp1 Case Priority PriorityCountinLast7days
      1 2018-06-01 A A|1 0 0
      2 2018-06-03 A A|2 0 1
      3 2018-06-02 B B|2 0 2
      4 2018-06-03 B B|3 0 3

      [[3]]
      Maths English Science History Class
      1 0.1 0.2 0.3 0.2 Y2
      2 0.9 0.5 0.7 0.4 Y1
      3 0.2 0.4 0.6 0.2 Y2
      4 0.9 0.5 0.2 0.7 Y1

      [[4]]
      Season Team W AHWO
      1 2017/2018 TeamA 2 1.75
      2 2017/2018 TeamB 1 1.85
      3 2017/2018 TeamC 1 1.70
      4 2017/2018 TeamD 0 3.10
      5 2016/2017 TeamA 1 1.49
      6 2016/2017 TeamB 3 1.51
      7 2016/2017 TeamC 2 1.90
      8 2016/2017 TeamD 0 NA



      x3 is from here (will have to look at the edit history).

      x4 is from here






      share|improve this answer

























      • Great answer, upvote. The more general the solution the better.

        – Rui Barradas
        Oct 2 '18 at 13:27


















      -3














      As it was suggested, you could use dput to save the content of a dataframe to a file, open the file in a text editor and paste its content. An example of mtcar's dataset limited to first 10 rows:



      dput(mtcars %>% head(10), file = 'reproducible.txt')


      The content of reproducible.txt can be used to make a dataframe/tibble as shown below. In such a case data the format is machine readable, but it is hard to be undestood by human at first glance (without pasting into R).



      df <- structure(list(mpg = c(21, 21, 22.8, 21.4, 18.7, 18.1, 14.3,
      24.4, 22.8, 19.2), cyl = c(6, 6, 4, 6, 8, 6, 8, 4, 4, 6), disp = c(160,
      160, 108, 258, 360, 225, 360, 146.7, 140.8, 167.6), hp = c(110,
      110, 93, 110, 175, 105, 245, 62, 95, 123), drat = c(3.9, 3.9,
      3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92), wt = c(2.62,
      2.875, 2.32, 3.215, 3.44, 3.46, 3.57, 3.19, 3.15, 3.44), qsec = c(16.46,
      17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20, 22.9, 18.3), vs = c(0,
      0, 1, 1, 0, 1, 0, 1, 1, 1), am = c(1, 1, 1, 0, 0, 0, 0, 0, 0,
      0), gear = c(4, 4, 4, 3, 3, 3, 3, 4, 4, 4), carb = c(4, 4, 1,
      1, 2, 1, 4, 2, 2, 4)), .Names = c("mpg", "cyl", "disp", "hp",
      "drat", "wt", "qsec", "vs", "am", "gear", "carb"), row.names = c("Mazda RX4",
      "Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive", "Hornet Sportabout",
      "Valiant", "Duster 360", "Merc 240D", "Merc 230", "Merc 280"), class = "data.frame")





      share|improve this answer




















      • 5





        Thanks for answering, but I think you have not understood what I am asking. In the case of a dataframe already existing in an R session, there is no problem, the problem is to have a table written in a text document with plus signs, dashes an pipe signs. How can that type of table be read in by R?

        – Rui Barradas
        Aug 26 '18 at 6:53










      Your Answer






      StackExchange.ifUsing("editor", function ()
      StackExchange.using("externalEditor", function ()
      StackExchange.using("snippets", function ()
      StackExchange.snippets.init();
      );
      );
      , "code-snippets");

      StackExchange.ready(function()
      var channelOptions =
      tags: "".split(" "),
      id: "1"
      ;
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function()
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled)
      StackExchange.using("snippets", function()
      createEditor();
      );

      else
      createEditor();

      );

      function createEditor()
      StackExchange.prepareEditor(
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: true,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      imageUploader:
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      ,
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      );



      );













      draft saved

      draft discarded


















      StackExchange.ready(
      function ()
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f52023709%2fwhat-can-r-do-about-a-messy-data-format%23new-answer', 'question_page');

      );

      Post as a guest















      Required, but never shown

























      6 Answers
      6






      active

      oldest

      votes








      6 Answers
      6






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      24














      Using data.table::fread:



      x = '
      +------------+------+------+----------+--------------------------+
      | Date | Emp1 | Case | Priority | PriorityCountinLast7days |
      +------------+------+------+----------+--------------------------+
      | 2018-06-01 | A | A1 | 0 | 0 |
      | 2018-06-03 | A | A2 | 0 | 1 |
      | 2018-06-03 | A | A3 | 0 | 2 |
      | 2018-06-03 | A | A4 | 1 | 1 |
      | 2018-06-03 | A | A5 | 2 | 1 |
      | 2018-06-04 | A | A6 | 0 | 3 |
      | 2018-06-01 | B | B1 | 0 | 1 |
      | 2018-06-02 | B | B2 | 0 | 2 |
      | 2018-06-03 | B | B3 | 0 | 3 |
      +------------+------+------+----------+--------------------------+
      '

      fread(gsub('\+.+\n' ,'', x, perl = T), drop=c(1,7))

      # Date Emp1 Case Priority PriorityCountinLast7days
      # 1: 2018-06-01 A A1 0 0
      # 2: 2018-06-03 A A2 0 1
      # 3: 2018-06-03 A A3 0 2
      # 4: 2018-06-03 A A4 1 1
      # 5: 2018-06-03 A A5 2 1
      # 6: 2018-06-04 A A6 0 3
      # 7: 2018-06-01 B B1 0 1
      # 8: 2018-06-02 B B2 0 2
      # 9: 2018-06-03 B B3 0 3


      The gsub part removes the horizontal rules. drop removes the extra columns caused by delimiters at the line ends.






      share|improve this answer





























        24














        Using data.table::fread:



        x = '
        +------------+------+------+----------+--------------------------+
        | Date | Emp1 | Case | Priority | PriorityCountinLast7days |
        +------------+------+------+----------+--------------------------+
        | 2018-06-01 | A | A1 | 0 | 0 |
        | 2018-06-03 | A | A2 | 0 | 1 |
        | 2018-06-03 | A | A3 | 0 | 2 |
        | 2018-06-03 | A | A4 | 1 | 1 |
        | 2018-06-03 | A | A5 | 2 | 1 |
        | 2018-06-04 | A | A6 | 0 | 3 |
        | 2018-06-01 | B | B1 | 0 | 1 |
        | 2018-06-02 | B | B2 | 0 | 2 |
        | 2018-06-03 | B | B3 | 0 | 3 |
        +------------+------+------+----------+--------------------------+
        '

        fread(gsub('\+.+\n' ,'', x, perl = T), drop=c(1,7))

        # Date Emp1 Case Priority PriorityCountinLast7days
        # 1: 2018-06-01 A A1 0 0
        # 2: 2018-06-03 A A2 0 1
        # 3: 2018-06-03 A A3 0 2
        # 4: 2018-06-03 A A4 1 1
        # 5: 2018-06-03 A A5 2 1
        # 6: 2018-06-04 A A6 0 3
        # 7: 2018-06-01 B B1 0 1
        # 8: 2018-06-02 B B2 0 2
        # 9: 2018-06-03 B B3 0 3


        The gsub part removes the horizontal rules. drop removes the extra columns caused by delimiters at the line ends.






        share|improve this answer



























          24












          24








          24







          Using data.table::fread:



          x = '
          +------------+------+------+----------+--------------------------+
          | Date | Emp1 | Case | Priority | PriorityCountinLast7days |
          +------------+------+------+----------+--------------------------+
          | 2018-06-01 | A | A1 | 0 | 0 |
          | 2018-06-03 | A | A2 | 0 | 1 |
          | 2018-06-03 | A | A3 | 0 | 2 |
          | 2018-06-03 | A | A4 | 1 | 1 |
          | 2018-06-03 | A | A5 | 2 | 1 |
          | 2018-06-04 | A | A6 | 0 | 3 |
          | 2018-06-01 | B | B1 | 0 | 1 |
          | 2018-06-02 | B | B2 | 0 | 2 |
          | 2018-06-03 | B | B3 | 0 | 3 |
          +------------+------+------+----------+--------------------------+
          '

          fread(gsub('\+.+\n' ,'', x, perl = T), drop=c(1,7))

          # Date Emp1 Case Priority PriorityCountinLast7days
          # 1: 2018-06-01 A A1 0 0
          # 2: 2018-06-03 A A2 0 1
          # 3: 2018-06-03 A A3 0 2
          # 4: 2018-06-03 A A4 1 1
          # 5: 2018-06-03 A A5 2 1
          # 6: 2018-06-04 A A6 0 3
          # 7: 2018-06-01 B B1 0 1
          # 8: 2018-06-02 B B2 0 2
          # 9: 2018-06-03 B B3 0 3


          The gsub part removes the horizontal rules. drop removes the extra columns caused by delimiters at the line ends.






          share|improve this answer















          Using data.table::fread:



          x = '
          +------------+------+------+----------+--------------------------+
          | Date | Emp1 | Case | Priority | PriorityCountinLast7days |
          +------------+------+------+----------+--------------------------+
          | 2018-06-01 | A | A1 | 0 | 0 |
          | 2018-06-03 | A | A2 | 0 | 1 |
          | 2018-06-03 | A | A3 | 0 | 2 |
          | 2018-06-03 | A | A4 | 1 | 1 |
          | 2018-06-03 | A | A5 | 2 | 1 |
          | 2018-06-04 | A | A6 | 0 | 3 |
          | 2018-06-01 | B | B1 | 0 | 1 |
          | 2018-06-02 | B | B2 | 0 | 2 |
          | 2018-06-03 | B | B3 | 0 | 3 |
          +------------+------+------+----------+--------------------------+
          '

          fread(gsub('\+.+\n' ,'', x, perl = T), drop=c(1,7))

          # Date Emp1 Case Priority PriorityCountinLast7days
          # 1: 2018-06-01 A A1 0 0
          # 2: 2018-06-03 A A2 0 1
          # 3: 2018-06-03 A A3 0 2
          # 4: 2018-06-03 A A4 1 1
          # 5: 2018-06-03 A A5 2 1
          # 6: 2018-06-04 A A6 0 3
          # 7: 2018-06-01 B B1 0 1
          # 8: 2018-06-02 B B2 0 2
          # 9: 2018-06-03 B B3 0 3


          The gsub part removes the horizontal rules. drop removes the extra columns caused by delimiters at the line ends.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Aug 26 '18 at 18:59

























          answered Aug 26 '18 at 7:42









          dwwdww

          14.7k22655




          14.7k22655























              18














              The short answer to the question is yes, R code can solve that mess and no, it doesn't take that much trouble.



              The first step after copying & pasting the table into an R session is to read it in with read.table setting the header, sep, comment.char and strip.white arguments.



              Credits for reminding me of arguments comment.char and strip.white go to @nicola, and his comment.



              dat <- read.table(text = "
              +------------+------+------+----------+--------------------------+
              | Date | Emp1 | Case | Priority | PriorityCountinLast7days |
              +------------+------+------+----------+--------------------------+
              | 2018-06-01 | A | A1 | 0 | 0 |
              | 2018-06-03 | A | A2 | 0 | 1 |
              | 2018-06-03 | A | A3 | 0 | 2 |
              | 2018-06-03 | A | A4 | 1 | 1 |
              | 2018-06-03 | A | A5 | 2 | 1 |
              | 2018-06-04 | A | A6 | 0 | 3 |
              | 2018-06-01 | B | B1 | 0 | 1 |
              | 2018-06-02 | B | B2 | 0 | 2 |
              | 2018-06-03 | B | B3 | 0 | 3 |
              +------------+------+------+----------+--------------------------+
              ", header = TRUE, sep = "|", comment.char = "+", strip.white = TRUE)


              But as you can see there are some issues with the result.



              dat
              X Date Emp1 Case Priority PriorityCountinLast7days X.1
              1 NA 2018-06-01 A A1 0 0 NA
              2 NA 2018-06-03 A A2 0 1 NA
              3 NA 2018-06-03 A A3 0 2 NA
              4 NA 2018-06-03 A A4 1 1 NA
              5 NA 2018-06-03 A A5 2 1 NA
              6 NA 2018-06-04 A A6 0 3 NA
              7 NA 2018-06-01 B B1 0 1 NA
              8 NA 2018-06-02 B B2 0 2 NA
              9 NA 2018-06-03 B B3 0 3 NA


              To have separators start and end each data row made R believe those separators mark extra columns, which is not what is meant by the original question's OP.



              So the second step is to keep only the real columns. I will do this subsetting the columns by their numbers, easily done, they usually are the first and last columns.



              dat <- dat[-c(1, ncol(dat))]
              dat
              Date Emp1 Case Priority PriorityCountinLast7days
              1 2018-06-01 A A1 0 0
              2 2018-06-03 A A2 0 1
              3 2018-06-03 A A3 0 2
              4 2018-06-03 A A4 1 1
              5 2018-06-03 A A5 2 1
              6 2018-06-04 A A6 0 3
              7 2018-06-01 B B1 0 1
              8 2018-06-02 B B2 0 2
              9 2018-06-03 B B3 0 3


              That wasn't too hard, much better.

              In this case there is still a problem, to coerce column Date to class Date.



              dat$Date <- as.Date(dat$Date)


              And the result is satisfactory.



              str(dat)
              'data.frame': 9 obs. of 5 variables:
              $ Date : Date, format: "2018-06-01" "2018-06-03" ...
              $ Emp1 : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 2 2 2
              $ Case : Factor w/ 9 levels "A1","A2","A3",..: 1 2 3 4 5 6 7 8 9
              $ Priority : int 0 0 0 1 2 0 0 0 0
              $ PriorityCountinLast7days: int 0 1 2 1 1 3 1 2 3


              Note that I have not set the more or less standard argument stringsAsFactors = FALSE. If needed, this should be done when running read.table.



              The whole process took only 3 lines of base R code.



              Finally, the end result in dput format, like it should be in the first place.



              dat <-
              structure(list(Date = structure(c(17683, 17685, 17685, 17685,
              17685, 17686, 17683, 17684, 17685), class = "Date"), Emp1 = c("A",
              "A", "A", "A", "A", "A", "B", "B", "B"), Case = c("A1", "A2",
              "A3", "A4", "A5", "A6", "B1", "B2", "B3"), Priority = c(0, 0,
              0, 1, 2, 0, 0, 0, 0), PriorityCountinLast7days = c(0, 1, 2, 1,
              1, 3, 1, 2, 3)), row.names = c(NA, -9L), class = "data.frame")





              share|improve this answer




















              • 1





                @李哲源 I don't know about "safer", I find it more natural to use read table functions since after all that does look like a table.

                – Rui Barradas
                Aug 26 '18 at 6:46






              • 1





                I was about to post utils:::head.default(read.table("clipboard",comment.char="+",sep="|",strip.white=TRUE,header=TRUE,flush=TRUE)[-1],-1), which is not different from what you did (in my line, you manage the first and third line and remove the undesired columns).

                – nicola
                Aug 26 '18 at 8:13















              18














              The short answer to the question is yes, R code can solve that mess and no, it doesn't take that much trouble.



              The first step after copying & pasting the table into an R session is to read it in with read.table setting the header, sep, comment.char and strip.white arguments.



              Credits for reminding me of arguments comment.char and strip.white go to @nicola, and his comment.



              dat <- read.table(text = "
              +------------+------+------+----------+--------------------------+
              | Date | Emp1 | Case | Priority | PriorityCountinLast7days |
              +------------+------+------+----------+--------------------------+
              | 2018-06-01 | A | A1 | 0 | 0 |
              | 2018-06-03 | A | A2 | 0 | 1 |
              | 2018-06-03 | A | A3 | 0 | 2 |
              | 2018-06-03 | A | A4 | 1 | 1 |
              | 2018-06-03 | A | A5 | 2 | 1 |
              | 2018-06-04 | A | A6 | 0 | 3 |
              | 2018-06-01 | B | B1 | 0 | 1 |
              | 2018-06-02 | B | B2 | 0 | 2 |
              | 2018-06-03 | B | B3 | 0 | 3 |
              +------------+------+------+----------+--------------------------+
              ", header = TRUE, sep = "|", comment.char = "+", strip.white = TRUE)


              But as you can see there are some issues with the result.



              dat
              X Date Emp1 Case Priority PriorityCountinLast7days X.1
              1 NA 2018-06-01 A A1 0 0 NA
              2 NA 2018-06-03 A A2 0 1 NA
              3 NA 2018-06-03 A A3 0 2 NA
              4 NA 2018-06-03 A A4 1 1 NA
              5 NA 2018-06-03 A A5 2 1 NA
              6 NA 2018-06-04 A A6 0 3 NA
              7 NA 2018-06-01 B B1 0 1 NA
              8 NA 2018-06-02 B B2 0 2 NA
              9 NA 2018-06-03 B B3 0 3 NA


              To have separators start and end each data row made R believe those separators mark extra columns, which is not what is meant by the original question's OP.



              So the second step is to keep only the real columns. I will do this subsetting the columns by their numbers, easily done, they usually are the first and last columns.



              dat <- dat[-c(1, ncol(dat))]
              dat
              Date Emp1 Case Priority PriorityCountinLast7days
              1 2018-06-01 A A1 0 0
              2 2018-06-03 A A2 0 1
              3 2018-06-03 A A3 0 2
              4 2018-06-03 A A4 1 1
              5 2018-06-03 A A5 2 1
              6 2018-06-04 A A6 0 3
              7 2018-06-01 B B1 0 1
              8 2018-06-02 B B2 0 2
              9 2018-06-03 B B3 0 3


              That wasn't too hard, much better.

              In this case there is still a problem, to coerce column Date to class Date.



              dat$Date <- as.Date(dat$Date)


              And the result is satisfactory.



              str(dat)
              'data.frame': 9 obs. of 5 variables:
              $ Date : Date, format: "2018-06-01" "2018-06-03" ...
              $ Emp1 : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 2 2 2
              $ Case : Factor w/ 9 levels "A1","A2","A3",..: 1 2 3 4 5 6 7 8 9
              $ Priority : int 0 0 0 1 2 0 0 0 0
              $ PriorityCountinLast7days: int 0 1 2 1 1 3 1 2 3


              Note that I have not set the more or less standard argument stringsAsFactors = FALSE. If needed, this should be done when running read.table.



              The whole process took only 3 lines of base R code.



              Finally, the end result in dput format, like it should be in the first place.



              dat <-
              structure(list(Date = structure(c(17683, 17685, 17685, 17685,
              17685, 17686, 17683, 17684, 17685), class = "Date"), Emp1 = c("A",
              "A", "A", "A", "A", "A", "B", "B", "B"), Case = c("A1", "A2",
              "A3", "A4", "A5", "A6", "B1", "B2", "B3"), Priority = c(0, 0,
              0, 1, 2, 0, 0, 0, 0), PriorityCountinLast7days = c(0, 1, 2, 1,
              1, 3, 1, 2, 3)), row.names = c(NA, -9L), class = "data.frame")





              share|improve this answer




















              • 1





                @李哲源 I don't know about "safer", I find it more natural to use read table functions since after all that does look like a table.

                – Rui Barradas
                Aug 26 '18 at 6:46






              • 1





                I was about to post utils:::head.default(read.table("clipboard",comment.char="+",sep="|",strip.white=TRUE,header=TRUE,flush=TRUE)[-1],-1), which is not different from what you did (in my line, you manage the first and third line and remove the undesired columns).

                – nicola
                Aug 26 '18 at 8:13













              18












              18








              18







              The short answer to the question is yes, R code can solve that mess and no, it doesn't take that much trouble.



              The first step after copying & pasting the table into an R session is to read it in with read.table setting the header, sep, comment.char and strip.white arguments.



              Credits for reminding me of arguments comment.char and strip.white go to @nicola, and his comment.



              dat <- read.table(text = "
              +------------+------+------+----------+--------------------------+
              | Date | Emp1 | Case | Priority | PriorityCountinLast7days |
              +------------+------+------+----------+--------------------------+
              | 2018-06-01 | A | A1 | 0 | 0 |
              | 2018-06-03 | A | A2 | 0 | 1 |
              | 2018-06-03 | A | A3 | 0 | 2 |
              | 2018-06-03 | A | A4 | 1 | 1 |
              | 2018-06-03 | A | A5 | 2 | 1 |
              | 2018-06-04 | A | A6 | 0 | 3 |
              | 2018-06-01 | B | B1 | 0 | 1 |
              | 2018-06-02 | B | B2 | 0 | 2 |
              | 2018-06-03 | B | B3 | 0 | 3 |
              +------------+------+------+----------+--------------------------+
              ", header = TRUE, sep = "|", comment.char = "+", strip.white = TRUE)


              But as you can see there are some issues with the result.



              dat
              X Date Emp1 Case Priority PriorityCountinLast7days X.1
              1 NA 2018-06-01 A A1 0 0 NA
              2 NA 2018-06-03 A A2 0 1 NA
              3 NA 2018-06-03 A A3 0 2 NA
              4 NA 2018-06-03 A A4 1 1 NA
              5 NA 2018-06-03 A A5 2 1 NA
              6 NA 2018-06-04 A A6 0 3 NA
              7 NA 2018-06-01 B B1 0 1 NA
              8 NA 2018-06-02 B B2 0 2 NA
              9 NA 2018-06-03 B B3 0 3 NA


              To have separators start and end each data row made R believe those separators mark extra columns, which is not what is meant by the original question's OP.



              So the second step is to keep only the real columns. I will do this subsetting the columns by their numbers, easily done, they usually are the first and last columns.



              dat <- dat[-c(1, ncol(dat))]
              dat
              Date Emp1 Case Priority PriorityCountinLast7days
              1 2018-06-01 A A1 0 0
              2 2018-06-03 A A2 0 1
              3 2018-06-03 A A3 0 2
              4 2018-06-03 A A4 1 1
              5 2018-06-03 A A5 2 1
              6 2018-06-04 A A6 0 3
              7 2018-06-01 B B1 0 1
              8 2018-06-02 B B2 0 2
              9 2018-06-03 B B3 0 3


              That wasn't too hard, much better.

              In this case there is still a problem, to coerce column Date to class Date.



              dat$Date <- as.Date(dat$Date)


              And the result is satisfactory.



              str(dat)
              'data.frame': 9 obs. of 5 variables:
              $ Date : Date, format: "2018-06-01" "2018-06-03" ...
              $ Emp1 : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 2 2 2
              $ Case : Factor w/ 9 levels "A1","A2","A3",..: 1 2 3 4 5 6 7 8 9
              $ Priority : int 0 0 0 1 2 0 0 0 0
              $ PriorityCountinLast7days: int 0 1 2 1 1 3 1 2 3


              Note that I have not set the more or less standard argument stringsAsFactors = FALSE. If needed, this should be done when running read.table.



              The whole process took only 3 lines of base R code.



              Finally, the end result in dput format, like it should be in the first place.



              dat <-
              structure(list(Date = structure(c(17683, 17685, 17685, 17685,
              17685, 17686, 17683, 17684, 17685), class = "Date"), Emp1 = c("A",
              "A", "A", "A", "A", "A", "B", "B", "B"), Case = c("A1", "A2",
              "A3", "A4", "A5", "A6", "B1", "B2", "B3"), Priority = c(0, 0,
              0, 1, 2, 0, 0, 0, 0), PriorityCountinLast7days = c(0, 1, 2, 1,
              1, 3, 1, 2, 3)), row.names = c(NA, -9L), class = "data.frame")





              share|improve this answer















              The short answer to the question is yes, R code can solve that mess and no, it doesn't take that much trouble.



              The first step after copying & pasting the table into an R session is to read it in with read.table setting the header, sep, comment.char and strip.white arguments.



              Credits for reminding me of arguments comment.char and strip.white go to @nicola, and his comment.



              dat <- read.table(text = "
              +------------+------+------+----------+--------------------------+
              | Date | Emp1 | Case | Priority | PriorityCountinLast7days |
              +------------+------+------+----------+--------------------------+
              | 2018-06-01 | A | A1 | 0 | 0 |
              | 2018-06-03 | A | A2 | 0 | 1 |
              | 2018-06-03 | A | A3 | 0 | 2 |
              | 2018-06-03 | A | A4 | 1 | 1 |
              | 2018-06-03 | A | A5 | 2 | 1 |
              | 2018-06-04 | A | A6 | 0 | 3 |
              | 2018-06-01 | B | B1 | 0 | 1 |
              | 2018-06-02 | B | B2 | 0 | 2 |
              | 2018-06-03 | B | B3 | 0 | 3 |
              +------------+------+------+----------+--------------------------+
              ", header = TRUE, sep = "|", comment.char = "+", strip.white = TRUE)


              But as you can see there are some issues with the result.



              dat
              X Date Emp1 Case Priority PriorityCountinLast7days X.1
              1 NA 2018-06-01 A A1 0 0 NA
              2 NA 2018-06-03 A A2 0 1 NA
              3 NA 2018-06-03 A A3 0 2 NA
              4 NA 2018-06-03 A A4 1 1 NA
              5 NA 2018-06-03 A A5 2 1 NA
              6 NA 2018-06-04 A A6 0 3 NA
              7 NA 2018-06-01 B B1 0 1 NA
              8 NA 2018-06-02 B B2 0 2 NA
              9 NA 2018-06-03 B B3 0 3 NA


              To have separators start and end each data row made R believe those separators mark extra columns, which is not what is meant by the original question's OP.



              So the second step is to keep only the real columns. I will do this subsetting the columns by their numbers, easily done, they usually are the first and last columns.



              dat <- dat[-c(1, ncol(dat))]
              dat
              Date Emp1 Case Priority PriorityCountinLast7days
              1 2018-06-01 A A1 0 0
              2 2018-06-03 A A2 0 1
              3 2018-06-03 A A3 0 2
              4 2018-06-03 A A4 1 1
              5 2018-06-03 A A5 2 1
              6 2018-06-04 A A6 0 3
              7 2018-06-01 B B1 0 1
              8 2018-06-02 B B2 0 2
              9 2018-06-03 B B3 0 3


              That wasn't too hard, much better.

              In this case there is still a problem, to coerce column Date to class Date.



              dat$Date <- as.Date(dat$Date)


              And the result is satisfactory.



              str(dat)
              'data.frame': 9 obs. of 5 variables:
              $ Date : Date, format: "2018-06-01" "2018-06-03" ...
              $ Emp1 : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 2 2 2
              $ Case : Factor w/ 9 levels "A1","A2","A3",..: 1 2 3 4 5 6 7 8 9
              $ Priority : int 0 0 0 1 2 0 0 0 0
              $ PriorityCountinLast7days: int 0 1 2 1 1 3 1 2 3


              Note that I have not set the more or less standard argument stringsAsFactors = FALSE. If needed, this should be done when running read.table.



              The whole process took only 3 lines of base R code.



              Finally, the end result in dput format, like it should be in the first place.



              dat <-
              structure(list(Date = structure(c(17683, 17685, 17685, 17685,
              17685, 17686, 17683, 17684, 17685), class = "Date"), Emp1 = c("A",
              "A", "A", "A", "A", "A", "B", "B", "B"), Case = c("A1", "A2",
              "A3", "A4", "A5", "A6", "B1", "B2", "B3"), Priority = c(0, 0,
              0, 1, 2, 0, 0, 0, 0), PriorityCountinLast7days = c(0, 1, 2, 1,
              1, 3, 1, 2, 3)), row.names = c(NA, -9L), class = "data.frame")






              share|improve this answer














              share|improve this answer



              share|improve this answer








              edited Aug 27 '18 at 2:01

























              answered Aug 26 '18 at 6:41









              Rui BarradasRui Barradas

              16.4k51730




              16.4k51730







              • 1





                @李哲源 I don't know about "safer", I find it more natural to use read table functions since after all that does look like a table.

                – Rui Barradas
                Aug 26 '18 at 6:46






              • 1





                I was about to post utils:::head.default(read.table("clipboard",comment.char="+",sep="|",strip.white=TRUE,header=TRUE,flush=TRUE)[-1],-1), which is not different from what you did (in my line, you manage the first and third line and remove the undesired columns).

                – nicola
                Aug 26 '18 at 8:13












              • 1





                @李哲源 I don't know about "safer", I find it more natural to use read table functions since after all that does look like a table.

                – Rui Barradas
                Aug 26 '18 at 6:46






              • 1





                I was about to post utils:::head.default(read.table("clipboard",comment.char="+",sep="|",strip.white=TRUE,header=TRUE,flush=TRUE)[-1],-1), which is not different from what you did (in my line, you manage the first and third line and remove the undesired columns).

                – nicola
                Aug 26 '18 at 8:13







              1




              1





              @李哲源 I don't know about "safer", I find it more natural to use read table functions since after all that does look like a table.

              – Rui Barradas
              Aug 26 '18 at 6:46





              @李哲源 I don't know about "safer", I find it more natural to use read table functions since after all that does look like a table.

              – Rui Barradas
              Aug 26 '18 at 6:46




              1




              1





              I was about to post utils:::head.default(read.table("clipboard",comment.char="+",sep="|",strip.white=TRUE,header=TRUE,flush=TRUE)[-1],-1), which is not different from what you did (in my line, you manage the first and third line and remove the undesired columns).

              – nicola
              Aug 26 '18 at 8:13





              I was about to post utils:::head.default(read.table("clipboard",comment.char="+",sep="|",strip.white=TRUE,header=TRUE,flush=TRUE)[-1],-1), which is not different from what you did (in my line, you manage the first and third line and remove the undesired columns).

              – nicola
              Aug 26 '18 at 8:13











              5














              md_table <- scan(text = "
              +------------+------+------+----------+--------------------------+
              | Date | Emp1 | Case | Priority | PriorityCountinLast7days |
              +------------+------+------+----------+--------------------------+
              | 2018-06-01 | A | A1 | 0 | 0 |
              | 2018-06-03 | A | A2 | 0 | 1 |
              | 2018-06-03 | A | A3 | 0 | 2 |
              | 2018-06-03 | A | A4 | 1 | 1 |
              | 2018-06-03 | A | A5 | 2 | 1 |
              | 2018-06-04 | A | A6 | 0 | 3 |
              | 2018-06-01 | B | B1 | 0 | 1 |
              | 2018-06-02 | B | B2 | 0 | 2 |
              | 2018-06-03 | B | B3 | 0 | 3 |
              +------------+------+------+----------+--------------------------+",
              what = "", sep = "", comment.char = "+", quiet = TRUE)

              ## it is clear that there are 5 columns
              mat <- matrix(md_table[md_table != "|"], ncol = 5, byrow = TRUE)
              # [,1] [,2] [,3] [,4] [,5]
              # [1,] "Date" "Emp1" "Case" "Priority" "PriorityCountinLast7days"
              # [2,] "2018-06-01" "A" "A1" "0" "0"
              # [3,] "2018-06-03" "A" "A2" "0" "1"
              # [4,] "2018-06-03" "A" "A3" "0" "2"
              # [5,] "2018-06-03" "A" "A4" "1" "1"
              # [6,] "2018-06-03" "A" "A5" "2" "1"
              # [7,] "2018-06-04" "A" "A6" "0" "3"
              # [8,] "2018-06-01" "B" "B1" "0" "1"
              # [9,] "2018-06-02" "B" "B2" "0" "2"
              #[10,] "2018-06-03" "B" "B3" "0" "3"




              ## a data frame with all character columns
              dat <- setNames(data.frame(mat[-1, ], stringsAsFactors = FALSE), mat[1, ])
              # Date Emp1 Case Priority PriorityCountinLast7days
              #1 2018-06-01 A A1 0 0
              #2 2018-06-03 A A2 0 1
              #3 2018-06-03 A A3 0 2
              #4 2018-06-03 A A4 1 1
              #5 2018-06-03 A A5 2 1
              #6 2018-06-04 A A6 0 3
              #7 2018-06-01 B B1 0 1
              #8 2018-06-02 B B2 0 2
              #9 2018-06-03 B B3 0 3




              ## or maybe just use `type.convert` on some columns?
              dat <- lapply(dat, type.convert)





              share|improve this answer





























                5














                md_table <- scan(text = "
                +------------+------+------+----------+--------------------------+
                | Date | Emp1 | Case | Priority | PriorityCountinLast7days |
                +------------+------+------+----------+--------------------------+
                | 2018-06-01 | A | A1 | 0 | 0 |
                | 2018-06-03 | A | A2 | 0 | 1 |
                | 2018-06-03 | A | A3 | 0 | 2 |
                | 2018-06-03 | A | A4 | 1 | 1 |
                | 2018-06-03 | A | A5 | 2 | 1 |
                | 2018-06-04 | A | A6 | 0 | 3 |
                | 2018-06-01 | B | B1 | 0 | 1 |
                | 2018-06-02 | B | B2 | 0 | 2 |
                | 2018-06-03 | B | B3 | 0 | 3 |
                +------------+------+------+----------+--------------------------+",
                what = "", sep = "", comment.char = "+", quiet = TRUE)

                ## it is clear that there are 5 columns
                mat <- matrix(md_table[md_table != "|"], ncol = 5, byrow = TRUE)
                # [,1] [,2] [,3] [,4] [,5]
                # [1,] "Date" "Emp1" "Case" "Priority" "PriorityCountinLast7days"
                # [2,] "2018-06-01" "A" "A1" "0" "0"
                # [3,] "2018-06-03" "A" "A2" "0" "1"
                # [4,] "2018-06-03" "A" "A3" "0" "2"
                # [5,] "2018-06-03" "A" "A4" "1" "1"
                # [6,] "2018-06-03" "A" "A5" "2" "1"
                # [7,] "2018-06-04" "A" "A6" "0" "3"
                # [8,] "2018-06-01" "B" "B1" "0" "1"
                # [9,] "2018-06-02" "B" "B2" "0" "2"
                #[10,] "2018-06-03" "B" "B3" "0" "3"




                ## a data frame with all character columns
                dat <- setNames(data.frame(mat[-1, ], stringsAsFactors = FALSE), mat[1, ])
                # Date Emp1 Case Priority PriorityCountinLast7days
                #1 2018-06-01 A A1 0 0
                #2 2018-06-03 A A2 0 1
                #3 2018-06-03 A A3 0 2
                #4 2018-06-03 A A4 1 1
                #5 2018-06-03 A A5 2 1
                #6 2018-06-04 A A6 0 3
                #7 2018-06-01 B B1 0 1
                #8 2018-06-02 B B2 0 2
                #9 2018-06-03 B B3 0 3




                ## or maybe just use `type.convert` on some columns?
                dat <- lapply(dat, type.convert)





                share|improve this answer



























                  5












                  5








                  5







                  md_table <- scan(text = "
                  +------------+------+------+----------+--------------------------+
                  | Date | Emp1 | Case | Priority | PriorityCountinLast7days |
                  +------------+------+------+----------+--------------------------+
                  | 2018-06-01 | A | A1 | 0 | 0 |
                  | 2018-06-03 | A | A2 | 0 | 1 |
                  | 2018-06-03 | A | A3 | 0 | 2 |
                  | 2018-06-03 | A | A4 | 1 | 1 |
                  | 2018-06-03 | A | A5 | 2 | 1 |
                  | 2018-06-04 | A | A6 | 0 | 3 |
                  | 2018-06-01 | B | B1 | 0 | 1 |
                  | 2018-06-02 | B | B2 | 0 | 2 |
                  | 2018-06-03 | B | B3 | 0 | 3 |
                  +------------+------+------+----------+--------------------------+",
                  what = "", sep = "", comment.char = "+", quiet = TRUE)

                  ## it is clear that there are 5 columns
                  mat <- matrix(md_table[md_table != "|"], ncol = 5, byrow = TRUE)
                  # [,1] [,2] [,3] [,4] [,5]
                  # [1,] "Date" "Emp1" "Case" "Priority" "PriorityCountinLast7days"
                  # [2,] "2018-06-01" "A" "A1" "0" "0"
                  # [3,] "2018-06-03" "A" "A2" "0" "1"
                  # [4,] "2018-06-03" "A" "A3" "0" "2"
                  # [5,] "2018-06-03" "A" "A4" "1" "1"
                  # [6,] "2018-06-03" "A" "A5" "2" "1"
                  # [7,] "2018-06-04" "A" "A6" "0" "3"
                  # [8,] "2018-06-01" "B" "B1" "0" "1"
                  # [9,] "2018-06-02" "B" "B2" "0" "2"
                  #[10,] "2018-06-03" "B" "B3" "0" "3"




                  ## a data frame with all character columns
                  dat <- setNames(data.frame(mat[-1, ], stringsAsFactors = FALSE), mat[1, ])
                  # Date Emp1 Case Priority PriorityCountinLast7days
                  #1 2018-06-01 A A1 0 0
                  #2 2018-06-03 A A2 0 1
                  #3 2018-06-03 A A3 0 2
                  #4 2018-06-03 A A4 1 1
                  #5 2018-06-03 A A5 2 1
                  #6 2018-06-04 A A6 0 3
                  #7 2018-06-01 B B1 0 1
                  #8 2018-06-02 B B2 0 2
                  #9 2018-06-03 B B3 0 3




                  ## or maybe just use `type.convert` on some columns?
                  dat <- lapply(dat, type.convert)





                  share|improve this answer















                  md_table <- scan(text = "
                  +------------+------+------+----------+--------------------------+
                  | Date | Emp1 | Case | Priority | PriorityCountinLast7days |
                  +------------+------+------+----------+--------------------------+
                  | 2018-06-01 | A | A1 | 0 | 0 |
                  | 2018-06-03 | A | A2 | 0 | 1 |
                  | 2018-06-03 | A | A3 | 0 | 2 |
                  | 2018-06-03 | A | A4 | 1 | 1 |
                  | 2018-06-03 | A | A5 | 2 | 1 |
                  | 2018-06-04 | A | A6 | 0 | 3 |
                  | 2018-06-01 | B | B1 | 0 | 1 |
                  | 2018-06-02 | B | B2 | 0 | 2 |
                  | 2018-06-03 | B | B3 | 0 | 3 |
                  +------------+------+------+----------+--------------------------+",
                  what = "", sep = "", comment.char = "+", quiet = TRUE)

                  ## it is clear that there are 5 columns
                  mat <- matrix(md_table[md_table != "|"], ncol = 5, byrow = TRUE)
                  # [,1] [,2] [,3] [,4] [,5]
                  # [1,] "Date" "Emp1" "Case" "Priority" "PriorityCountinLast7days"
                  # [2,] "2018-06-01" "A" "A1" "0" "0"
                  # [3,] "2018-06-03" "A" "A2" "0" "1"
                  # [4,] "2018-06-03" "A" "A3" "0" "2"
                  # [5,] "2018-06-03" "A" "A4" "1" "1"
                  # [6,] "2018-06-03" "A" "A5" "2" "1"
                  # [7,] "2018-06-04" "A" "A6" "0" "3"
                  # [8,] "2018-06-01" "B" "B1" "0" "1"
                  # [9,] "2018-06-02" "B" "B2" "0" "2"
                  #[10,] "2018-06-03" "B" "B3" "0" "3"




                  ## a data frame with all character columns
                  dat <- setNames(data.frame(mat[-1, ], stringsAsFactors = FALSE), mat[1, ])
                  # Date Emp1 Case Priority PriorityCountinLast7days
                  #1 2018-06-01 A A1 0 0
                  #2 2018-06-03 A A2 0 1
                  #3 2018-06-03 A A3 0 2
                  #4 2018-06-03 A A4 1 1
                  #5 2018-06-03 A A5 2 1
                  #6 2018-06-04 A A6 0 3
                  #7 2018-06-01 B B1 0 1
                  #8 2018-06-02 B B2 0 2
                  #9 2018-06-03 B B3 0 3




                  ## or maybe just use `type.convert` on some columns?
                  dat <- lapply(dat, type.convert)






                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited Aug 26 '18 at 8:05

























                  answered Aug 26 '18 at 6:32









                  李哲源李哲源

                  47.8k1494145




                  47.8k1494145





















                      2














                      Well, about this specific dataset I used the import feature in RStudio, but I took one additional step beforehand.



                      1. Copy the dataset into the Notepad file.

                      2. Replace all | characters with ,


                      3. Import the Notepad file using read.csv to RStudio using this code (seperate columns by ,).

                      But, if you mean use the R to fully understand it in one step, then I have no idea.






                      share|improve this answer

























                      • Notepad? Are you assuming Microsoft Windows?

                        – Peter Mortensen
                        Aug 26 '18 at 11:16











                      • @PeterMortensen Yes, but it could be any other editor as well.

                        – Salman Lashkarara
                        Aug 26 '18 at 11:26















                      2














                      Well, about this specific dataset I used the import feature in RStudio, but I took one additional step beforehand.



                      1. Copy the dataset into the Notepad file.

                      2. Replace all | characters with ,


                      3. Import the Notepad file using read.csv to RStudio using this code (seperate columns by ,).

                      But, if you mean use the R to fully understand it in one step, then I have no idea.






                      share|improve this answer

























                      • Notepad? Are you assuming Microsoft Windows?

                        – Peter Mortensen
                        Aug 26 '18 at 11:16











                      • @PeterMortensen Yes, but it could be any other editor as well.

                        – Salman Lashkarara
                        Aug 26 '18 at 11:26













                      2












                      2








                      2







                      Well, about this specific dataset I used the import feature in RStudio, but I took one additional step beforehand.



                      1. Copy the dataset into the Notepad file.

                      2. Replace all | characters with ,


                      3. Import the Notepad file using read.csv to RStudio using this code (seperate columns by ,).

                      But, if you mean use the R to fully understand it in one step, then I have no idea.






                      share|improve this answer















                      Well, about this specific dataset I used the import feature in RStudio, but I took one additional step beforehand.



                      1. Copy the dataset into the Notepad file.

                      2. Replace all | characters with ,


                      3. Import the Notepad file using read.csv to RStudio using this code (seperate columns by ,).

                      But, if you mean use the R to fully understand it in one step, then I have no idea.







                      share|improve this answer














                      share|improve this answer



                      share|improve this answer








                      edited Aug 26 '18 at 11:15









                      Peter Mortensen

                      13.5k1984111




                      13.5k1984111










                      answered Aug 26 '18 at 6:38









                      Salman LashkararaSalman Lashkarara

                      4,6991560110




                      4,6991560110












                      • Notepad? Are you assuming Microsoft Windows?

                        – Peter Mortensen
                        Aug 26 '18 at 11:16











                      • @PeterMortensen Yes, but it could be any other editor as well.

                        – Salman Lashkarara
                        Aug 26 '18 at 11:26

















                      • Notepad? Are you assuming Microsoft Windows?

                        – Peter Mortensen
                        Aug 26 '18 at 11:16











                      • @PeterMortensen Yes, but it could be any other editor as well.

                        – Salman Lashkarara
                        Aug 26 '18 at 11:26
















                      Notepad? Are you assuming Microsoft Windows?

                      – Peter Mortensen
                      Aug 26 '18 at 11:16





                      Notepad? Are you assuming Microsoft Windows?

                      – Peter Mortensen
                      Aug 26 '18 at 11:16













                      @PeterMortensen Yes, but it could be any other editor as well.

                      – Salman Lashkarara
                      Aug 26 '18 at 11:26





                      @PeterMortensen Yes, but it could be any other editor as well.

                      – Salman Lashkarara
                      Aug 26 '18 at 11:26











                      2














                      The issue isn't so much how many lines of code it takes, two or five, not much difference. The question is more whether it will work beyond the example you posted here.



                      I haven't come across this sort of thing in the wild, but I had a go at constructing another example that I thought could conceivably exist.



                      I've since come across a couple more cases and added them to the test suite.



                      x1 <- "
                      +------------+------+------+----------+--------------------------+
                      | Date | Emp1 | Case | Priority | PriorityCountinLast7days |
                      +------------+------+------+----------+--------------------------+
                      | 2018-06-01 | A | A1 | 0 | 0 |
                      | 2018-06-03 | A | A2 | 0 | 1 |
                      | 2018-06-02 | B | B2 | 0 | 2 |
                      | 2018-06-03 | B | B3 | 0 | 3 |
                      +------------+------+------+----------+--------------------------+
                      "

                      x2 <- "
                      ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
                      Date | Emp1 | Case | Priority | PriorityCountinLast7days
                      ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
                      2018-06-01 | A | A|1 | 0 | 0
                      2018-06-03 | A | A|2 | 0 | 1
                      2018-06-02 | B | B|2 | 0 | 2
                      2018-06-03 | B | B|3 | 0 | 3
                      ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
                      "

                      x3 <- "
                      Maths | English | Science | History | Class

                      0.1 | 0.2 | 0.3 | 0.2 | Y2

                      0.9 | 0.5 | 0.7 | 0.4 | Y1

                      0.2 | 0.4 | 0.6 | 0.2 | Y2

                      0.9 | 0.5 | 0.2 | 0.7 | Y1
                      "

                      x4 <- "
                      Season | Team | W | AHWO
                      -------------------------------------
                      1 | 2017/2018 | TeamA | 2 | 1.75
                      2 | 2017/2018 | TeamB | 1 | 1.85
                      3 | 2017/2018 | TeamC | 1 | 1.70
                      4 | 2017/2018 | TeamD | 0 | 3.10
                      5 | 2016/2017 | TeamA | 1 | 1.49
                      6 | 2016/2017 | TeamB | 3 | 1.51
                      7 | 2016/2017 | TeamC | 2 | 1.90
                      8 | 2016/2017 | TeamD | 0 | N/A
                      "


                      My go at a function



                      f <- function(x, header=TRUE, na.strings=c("NA", "N/A"), 
                      stringsAsFactors=FALSE, ...) ", " ", x)

                      # read the result as a table
                      read.table(text=paste(x, collapse="n"), header=header,
                      na.strings=na.strings, stringsAsFactors=stringsAsFactors, ...)


                      lapply(c(x1, x2, x3, x4, x5), f)


                      Output



                      [[1]]
                      Date Emp1 Case Priority PriorityCountinLast7days
                      1 2018-06-01 A A1 0 0
                      2 2018-06-03 A A2 0 1
                      3 2018-06-02 B B2 0 2
                      4 2018-06-03 B B3 0 3

                      [[2]]
                      Date Emp1 Case Priority PriorityCountinLast7days
                      1 2018-06-01 A A|1 0 0
                      2 2018-06-03 A A|2 0 1
                      3 2018-06-02 B B|2 0 2
                      4 2018-06-03 B B|3 0 3

                      [[3]]
                      Maths English Science History Class
                      1 0.1 0.2 0.3 0.2 Y2
                      2 0.9 0.5 0.7 0.4 Y1
                      3 0.2 0.4 0.6 0.2 Y2
                      4 0.9 0.5 0.2 0.7 Y1

                      [[4]]
                      Season Team W AHWO
                      1 2017/2018 TeamA 2 1.75
                      2 2017/2018 TeamB 1 1.85
                      3 2017/2018 TeamC 1 1.70
                      4 2017/2018 TeamD 0 3.10
                      5 2016/2017 TeamA 1 1.49
                      6 2016/2017 TeamB 3 1.51
                      7 2016/2017 TeamC 2 1.90
                      8 2016/2017 TeamD 0 NA



                      x3 is from here (will have to look at the edit history).

                      x4 is from here






                      share|improve this answer

























                      • Great answer, upvote. The more general the solution the better.

                        – Rui Barradas
                        Oct 2 '18 at 13:27















                      2














                      The issue isn't so much how many lines of code it takes, two or five, not much difference. The question is more whether it will work beyond the example you posted here.



                      I haven't come across this sort of thing in the wild, but I had a go at constructing another example that I thought could conceivably exist.



                      I've since come across a couple more cases and added them to the test suite.



                      x1 <- "
                      +------------+------+------+----------+--------------------------+
                      | Date | Emp1 | Case | Priority | PriorityCountinLast7days |
                      +------------+------+------+----------+--------------------------+
                      | 2018-06-01 | A | A1 | 0 | 0 |
                      | 2018-06-03 | A | A2 | 0 | 1 |
                      | 2018-06-02 | B | B2 | 0 | 2 |
                      | 2018-06-03 | B | B3 | 0 | 3 |
                      +------------+------+------+----------+--------------------------+
                      "

                      x2 <- "
                      ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
                      Date | Emp1 | Case | Priority | PriorityCountinLast7days
                      ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
                      2018-06-01 | A | A|1 | 0 | 0
                      2018-06-03 | A | A|2 | 0 | 1
                      2018-06-02 | B | B|2 | 0 | 2
                      2018-06-03 | B | B|3 | 0 | 3
                      ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
                      "

                      x3 <- "
                      Maths | English | Science | History | Class

                      0.1 | 0.2 | 0.3 | 0.2 | Y2

                      0.9 | 0.5 | 0.7 | 0.4 | Y1

                      0.2 | 0.4 | 0.6 | 0.2 | Y2

                      0.9 | 0.5 | 0.2 | 0.7 | Y1
                      "

                      x4 <- "
                      Season | Team | W | AHWO
                      -------------------------------------
                      1 | 2017/2018 | TeamA | 2 | 1.75
                      2 | 2017/2018 | TeamB | 1 | 1.85
                      3 | 2017/2018 | TeamC | 1 | 1.70
                      4 | 2017/2018 | TeamD | 0 | 3.10
                      5 | 2016/2017 | TeamA | 1 | 1.49
                      6 | 2016/2017 | TeamB | 3 | 1.51
                      7 | 2016/2017 | TeamC | 2 | 1.90
                      8 | 2016/2017 | TeamD | 0 | N/A
                      "


                      My go at a function



                      f <- function(x, header=TRUE, na.strings=c("NA", "N/A"), 
                      stringsAsFactors=FALSE, ...) ", " ", x)

                      # read the result as a table
                      read.table(text=paste(x, collapse="n"), header=header,
                      na.strings=na.strings, stringsAsFactors=stringsAsFactors, ...)


                      lapply(c(x1, x2, x3, x4, x5), f)


                      Output



                      [[1]]
                      Date Emp1 Case Priority PriorityCountinLast7days
                      1 2018-06-01 A A1 0 0
                      2 2018-06-03 A A2 0 1
                      3 2018-06-02 B B2 0 2
                      4 2018-06-03 B B3 0 3

                      [[2]]
                      Date Emp1 Case Priority PriorityCountinLast7days
                      1 2018-06-01 A A|1 0 0
                      2 2018-06-03 A A|2 0 1
                      3 2018-06-02 B B|2 0 2
                      4 2018-06-03 B B|3 0 3

                      [[3]]
                      Maths English Science History Class
                      1 0.1 0.2 0.3 0.2 Y2
                      2 0.9 0.5 0.7 0.4 Y1
                      3 0.2 0.4 0.6 0.2 Y2
                      4 0.9 0.5 0.2 0.7 Y1

                      [[4]]
                      Season Team W AHWO
                      1 2017/2018 TeamA 2 1.75
                      2 2017/2018 TeamB 1 1.85
                      3 2017/2018 TeamC 1 1.70
                      4 2017/2018 TeamD 0 3.10
                      5 2016/2017 TeamA 1 1.49
                      6 2016/2017 TeamB 3 1.51
                      7 2016/2017 TeamC 2 1.90
                      8 2016/2017 TeamD 0 NA



                      x3 is from here (will have to look at the edit history).

                      x4 is from here






                      share|improve this answer

























                      • Great answer, upvote. The more general the solution the better.

                        – Rui Barradas
                        Oct 2 '18 at 13:27













                      2












                      2








                      2







                      The issue isn't so much how many lines of code it takes, two or five, not much difference. The question is more whether it will work beyond the example you posted here.



                      I haven't come across this sort of thing in the wild, but I had a go at constructing another example that I thought could conceivably exist.



                      I've since come across a couple more cases and added them to the test suite.



                      x1 <- "
                      +------------+------+------+----------+--------------------------+
                      | Date | Emp1 | Case | Priority | PriorityCountinLast7days |
                      +------------+------+------+----------+--------------------------+
                      | 2018-06-01 | A | A1 | 0 | 0 |
                      | 2018-06-03 | A | A2 | 0 | 1 |
                      | 2018-06-02 | B | B2 | 0 | 2 |
                      | 2018-06-03 | B | B3 | 0 | 3 |
                      +------------+------+------+----------+--------------------------+
                      "

                      x2 <- "
                      ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
                      Date | Emp1 | Case | Priority | PriorityCountinLast7days
                      ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
                      2018-06-01 | A | A|1 | 0 | 0
                      2018-06-03 | A | A|2 | 0 | 1
                      2018-06-02 | B | B|2 | 0 | 2
                      2018-06-03 | B | B|3 | 0 | 3
                      ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
                      "

                      x3 <- "
                      Maths | English | Science | History | Class

                      0.1 | 0.2 | 0.3 | 0.2 | Y2

                      0.9 | 0.5 | 0.7 | 0.4 | Y1

                      0.2 | 0.4 | 0.6 | 0.2 | Y2

                      0.9 | 0.5 | 0.2 | 0.7 | Y1
                      "

                      x4 <- "
                      Season | Team | W | AHWO
                      -------------------------------------
                      1 | 2017/2018 | TeamA | 2 | 1.75
                      2 | 2017/2018 | TeamB | 1 | 1.85
                      3 | 2017/2018 | TeamC | 1 | 1.70
                      4 | 2017/2018 | TeamD | 0 | 3.10
                      5 | 2016/2017 | TeamA | 1 | 1.49
                      6 | 2016/2017 | TeamB | 3 | 1.51
                      7 | 2016/2017 | TeamC | 2 | 1.90
                      8 | 2016/2017 | TeamD | 0 | N/A
                      "


                      My go at a function



                      f <- function(x, header=TRUE, na.strings=c("NA", "N/A"), 
                      stringsAsFactors=FALSE, ...) ", " ", x)

                      # read the result as a table
                      read.table(text=paste(x, collapse="n"), header=header,
                      na.strings=na.strings, stringsAsFactors=stringsAsFactors, ...)


                      lapply(c(x1, x2, x3, x4, x5), f)


                      Output



                      [[1]]
                      Date Emp1 Case Priority PriorityCountinLast7days
                      1 2018-06-01 A A1 0 0
                      2 2018-06-03 A A2 0 1
                      3 2018-06-02 B B2 0 2
                      4 2018-06-03 B B3 0 3

                      [[2]]
                      Date Emp1 Case Priority PriorityCountinLast7days
                      1 2018-06-01 A A|1 0 0
                      2 2018-06-03 A A|2 0 1
                      3 2018-06-02 B B|2 0 2
                      4 2018-06-03 B B|3 0 3

                      [[3]]
                      Maths English Science History Class
                      1 0.1 0.2 0.3 0.2 Y2
                      2 0.9 0.5 0.7 0.4 Y1
                      3 0.2 0.4 0.6 0.2 Y2
                      4 0.9 0.5 0.2 0.7 Y1

                      [[4]]
                      Season Team W AHWO
                      1 2017/2018 TeamA 2 1.75
                      2 2017/2018 TeamB 1 1.85
                      3 2017/2018 TeamC 1 1.70
                      4 2017/2018 TeamD 0 3.10
                      5 2016/2017 TeamA 1 1.49
                      6 2016/2017 TeamB 3 1.51
                      7 2016/2017 TeamC 2 1.90
                      8 2016/2017 TeamD 0 NA



                      x3 is from here (will have to look at the edit history).

                      x4 is from here






                      share|improve this answer















                      The issue isn't so much how many lines of code it takes, two or five, not much difference. The question is more whether it will work beyond the example you posted here.



                      I haven't come across this sort of thing in the wild, but I had a go at constructing another example that I thought could conceivably exist.



                      I've since come across a couple more cases and added them to the test suite.



                      x1 <- "
                      +------------+------+------+----------+--------------------------+
                      | Date | Emp1 | Case | Priority | PriorityCountinLast7days |
                      +------------+------+------+----------+--------------------------+
                      | 2018-06-01 | A | A1 | 0 | 0 |
                      | 2018-06-03 | A | A2 | 0 | 1 |
                      | 2018-06-02 | B | B2 | 0 | 2 |
                      | 2018-06-03 | B | B3 | 0 | 3 |
                      +------------+------+------+----------+--------------------------+
                      "

                      x2 <- "
                      ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
                      Date | Emp1 | Case | Priority | PriorityCountinLast7days
                      ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
                      2018-06-01 | A | A|1 | 0 | 0
                      2018-06-03 | A | A|2 | 0 | 1
                      2018-06-02 | B | B|2 | 0 | 2
                      2018-06-03 | B | B|3 | 0 | 3
                      ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
                      "

                      x3 <- "
                      Maths | English | Science | History | Class

                      0.1 | 0.2 | 0.3 | 0.2 | Y2

                      0.9 | 0.5 | 0.7 | 0.4 | Y1

                      0.2 | 0.4 | 0.6 | 0.2 | Y2

                      0.9 | 0.5 | 0.2 | 0.7 | Y1
                      "

                      x4 <- "
                      Season | Team | W | AHWO
                      -------------------------------------
                      1 | 2017/2018 | TeamA | 2 | 1.75
                      2 | 2017/2018 | TeamB | 1 | 1.85
                      3 | 2017/2018 | TeamC | 1 | 1.70
                      4 | 2017/2018 | TeamD | 0 | 3.10
                      5 | 2016/2017 | TeamA | 1 | 1.49
                      6 | 2016/2017 | TeamB | 3 | 1.51
                      7 | 2016/2017 | TeamC | 2 | 1.90
                      8 | 2016/2017 | TeamD | 0 | N/A
                      "


                      My go at a function



                      f <- function(x, header=TRUE, na.strings=c("NA", "N/A"), 
                      stringsAsFactors=FALSE, ...) ", " ", x)

                      # read the result as a table
                      read.table(text=paste(x, collapse="n"), header=header,
                      na.strings=na.strings, stringsAsFactors=stringsAsFactors, ...)


                      lapply(c(x1, x2, x3, x4, x5), f)


                      Output



                      [[1]]
                      Date Emp1 Case Priority PriorityCountinLast7days
                      1 2018-06-01 A A1 0 0
                      2 2018-06-03 A A2 0 1
                      3 2018-06-02 B B2 0 2
                      4 2018-06-03 B B3 0 3

                      [[2]]
                      Date Emp1 Case Priority PriorityCountinLast7days
                      1 2018-06-01 A A|1 0 0
                      2 2018-06-03 A A|2 0 1
                      3 2018-06-02 B B|2 0 2
                      4 2018-06-03 B B|3 0 3

                      [[3]]
                      Maths English Science History Class
                      1 0.1 0.2 0.3 0.2 Y2
                      2 0.9 0.5 0.7 0.4 Y1
                      3 0.2 0.4 0.6 0.2 Y2
                      4 0.9 0.5 0.2 0.7 Y1

                      [[4]]
                      Season Team W AHWO
                      1 2017/2018 TeamA 2 1.75
                      2 2017/2018 TeamB 1 1.85
                      3 2017/2018 TeamC 1 1.70
                      4 2017/2018 TeamD 0 3.10
                      5 2016/2017 TeamA 1 1.49
                      6 2016/2017 TeamB 3 1.51
                      7 2016/2017 TeamC 2 1.90
                      8 2016/2017 TeamD 0 NA



                      x3 is from here (will have to look at the edit history).

                      x4 is from here







                      share|improve this answer














                      share|improve this answer



                      share|improve this answer








                      edited Dec 15 '18 at 23:32

























                      answered Oct 2 '18 at 12:26









                      AkselAAkselA

                      4,34621225




                      4,34621225












                      • Great answer, upvote. The more general the solution the better.

                        – Rui Barradas
                        Oct 2 '18 at 13:27

















                      • Great answer, upvote. The more general the solution the better.

                        – Rui Barradas
                        Oct 2 '18 at 13:27
















                      Great answer, upvote. The more general the solution the better.

                      – Rui Barradas
                      Oct 2 '18 at 13:27





                      Great answer, upvote. The more general the solution the better.

                      – Rui Barradas
                      Oct 2 '18 at 13:27











                      -3














                      As it was suggested, you could use dput to save the content of a dataframe to a file, open the file in a text editor and paste its content. An example of mtcar's dataset limited to first 10 rows:



                      dput(mtcars %>% head(10), file = 'reproducible.txt')


                      The content of reproducible.txt can be used to make a dataframe/tibble as shown below. In such a case data the format is machine readable, but it is hard to be undestood by human at first glance (without pasting into R).



                      df <- structure(list(mpg = c(21, 21, 22.8, 21.4, 18.7, 18.1, 14.3,
                      24.4, 22.8, 19.2), cyl = c(6, 6, 4, 6, 8, 6, 8, 4, 4, 6), disp = c(160,
                      160, 108, 258, 360, 225, 360, 146.7, 140.8, 167.6), hp = c(110,
                      110, 93, 110, 175, 105, 245, 62, 95, 123), drat = c(3.9, 3.9,
                      3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92), wt = c(2.62,
                      2.875, 2.32, 3.215, 3.44, 3.46, 3.57, 3.19, 3.15, 3.44), qsec = c(16.46,
                      17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20, 22.9, 18.3), vs = c(0,
                      0, 1, 1, 0, 1, 0, 1, 1, 1), am = c(1, 1, 1, 0, 0, 0, 0, 0, 0,
                      0), gear = c(4, 4, 4, 3, 3, 3, 3, 4, 4, 4), carb = c(4, 4, 1,
                      1, 2, 1, 4, 2, 2, 4)), .Names = c("mpg", "cyl", "disp", "hp",
                      "drat", "wt", "qsec", "vs", "am", "gear", "carb"), row.names = c("Mazda RX4",
                      "Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive", "Hornet Sportabout",
                      "Valiant", "Duster 360", "Merc 240D", "Merc 230", "Merc 280"), class = "data.frame")





                      share|improve this answer




















                      • 5





                        Thanks for answering, but I think you have not understood what I am asking. In the case of a dataframe already existing in an R session, there is no problem, the problem is to have a table written in a text document with plus signs, dashes an pipe signs. How can that type of table be read in by R?

                        – Rui Barradas
                        Aug 26 '18 at 6:53















                      -3














                      As it was suggested, you could use dput to save the content of a dataframe to a file, open the file in a text editor and paste its content. An example of mtcar's dataset limited to first 10 rows:



                      dput(mtcars %>% head(10), file = 'reproducible.txt')


                      The content of reproducible.txt can be used to make a dataframe/tibble as shown below. In such a case data the format is machine readable, but it is hard to be undestood by human at first glance (without pasting into R).



                      df <- structure(list(mpg = c(21, 21, 22.8, 21.4, 18.7, 18.1, 14.3,
                      24.4, 22.8, 19.2), cyl = c(6, 6, 4, 6, 8, 6, 8, 4, 4, 6), disp = c(160,
                      160, 108, 258, 360, 225, 360, 146.7, 140.8, 167.6), hp = c(110,
                      110, 93, 110, 175, 105, 245, 62, 95, 123), drat = c(3.9, 3.9,
                      3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92), wt = c(2.62,
                      2.875, 2.32, 3.215, 3.44, 3.46, 3.57, 3.19, 3.15, 3.44), qsec = c(16.46,
                      17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20, 22.9, 18.3), vs = c(0,
                      0, 1, 1, 0, 1, 0, 1, 1, 1), am = c(1, 1, 1, 0, 0, 0, 0, 0, 0,
                      0), gear = c(4, 4, 4, 3, 3, 3, 3, 4, 4, 4), carb = c(4, 4, 1,
                      1, 2, 1, 4, 2, 2, 4)), .Names = c("mpg", "cyl", "disp", "hp",
                      "drat", "wt", "qsec", "vs", "am", "gear", "carb"), row.names = c("Mazda RX4",
                      "Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive", "Hornet Sportabout",
                      "Valiant", "Duster 360", "Merc 240D", "Merc 230", "Merc 280"), class = "data.frame")





                      share|improve this answer




















                      • 5





                        Thanks for answering, but I think you have not understood what I am asking. In the case of a dataframe already existing in an R session, there is no problem, the problem is to have a table written in a text document with plus signs, dashes an pipe signs. How can that type of table be read in by R?

                        – Rui Barradas
                        Aug 26 '18 at 6:53













                      -3












                      -3








                      -3







                      As it was suggested, you could use dput to save the content of a dataframe to a file, open the file in a text editor and paste its content. An example of mtcar's dataset limited to first 10 rows:



                      dput(mtcars %>% head(10), file = 'reproducible.txt')


                      The content of reproducible.txt can be used to make a dataframe/tibble as shown below. In such a case data the format is machine readable, but it is hard to be undestood by human at first glance (without pasting into R).



                      df <- structure(list(mpg = c(21, 21, 22.8, 21.4, 18.7, 18.1, 14.3,
                      24.4, 22.8, 19.2), cyl = c(6, 6, 4, 6, 8, 6, 8, 4, 4, 6), disp = c(160,
                      160, 108, 258, 360, 225, 360, 146.7, 140.8, 167.6), hp = c(110,
                      110, 93, 110, 175, 105, 245, 62, 95, 123), drat = c(3.9, 3.9,
                      3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92), wt = c(2.62,
                      2.875, 2.32, 3.215, 3.44, 3.46, 3.57, 3.19, 3.15, 3.44), qsec = c(16.46,
                      17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20, 22.9, 18.3), vs = c(0,
                      0, 1, 1, 0, 1, 0, 1, 1, 1), am = c(1, 1, 1, 0, 0, 0, 0, 0, 0,
                      0), gear = c(4, 4, 4, 3, 3, 3, 3, 4, 4, 4), carb = c(4, 4, 1,
                      1, 2, 1, 4, 2, 2, 4)), .Names = c("mpg", "cyl", "disp", "hp",
                      "drat", "wt", "qsec", "vs", "am", "gear", "carb"), row.names = c("Mazda RX4",
                      "Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive", "Hornet Sportabout",
                      "Valiant", "Duster 360", "Merc 240D", "Merc 230", "Merc 280"), class = "data.frame")





                      share|improve this answer















                      As it was suggested, you could use dput to save the content of a dataframe to a file, open the file in a text editor and paste its content. An example of mtcar's dataset limited to first 10 rows:



                      dput(mtcars %>% head(10), file = 'reproducible.txt')


                      The content of reproducible.txt can be used to make a dataframe/tibble as shown below. In such a case data the format is machine readable, but it is hard to be undestood by human at first glance (without pasting into R).



                      df <- structure(list(mpg = c(21, 21, 22.8, 21.4, 18.7, 18.1, 14.3,
                      24.4, 22.8, 19.2), cyl = c(6, 6, 4, 6, 8, 6, 8, 4, 4, 6), disp = c(160,
                      160, 108, 258, 360, 225, 360, 146.7, 140.8, 167.6), hp = c(110,
                      110, 93, 110, 175, 105, 245, 62, 95, 123), drat = c(3.9, 3.9,
                      3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92), wt = c(2.62,
                      2.875, 2.32, 3.215, 3.44, 3.46, 3.57, 3.19, 3.15, 3.44), qsec = c(16.46,
                      17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20, 22.9, 18.3), vs = c(0,
                      0, 1, 1, 0, 1, 0, 1, 1, 1), am = c(1, 1, 1, 0, 0, 0, 0, 0, 0,
                      0), gear = c(4, 4, 4, 3, 3, 3, 3, 4, 4, 4), carb = c(4, 4, 1,
                      1, 2, 1, 4, 2, 2, 4)), .Names = c("mpg", "cyl", "disp", "hp",
                      "drat", "wt", "qsec", "vs", "am", "gear", "carb"), row.names = c("Mazda RX4",
                      "Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive", "Hornet Sportabout",
                      "Valiant", "Duster 360", "Merc 240D", "Merc 230", "Merc 280"), class = "data.frame")






                      share|improve this answer














                      share|improve this answer



                      share|improve this answer








                      edited Aug 26 '18 at 11:19









                      Peter Mortensen

                      13.5k1984111




                      13.5k1984111










                      answered Aug 26 '18 at 6:39









                      Pawel StradowskiPawel Stradowski

                      14919




                      14919







                      • 5





                        Thanks for answering, but I think you have not understood what I am asking. In the case of a dataframe already existing in an R session, there is no problem, the problem is to have a table written in a text document with plus signs, dashes an pipe signs. How can that type of table be read in by R?

                        – Rui Barradas
                        Aug 26 '18 at 6:53












                      • 5





                        Thanks for answering, but I think you have not understood what I am asking. In the case of a dataframe already existing in an R session, there is no problem, the problem is to have a table written in a text document with plus signs, dashes an pipe signs. How can that type of table be read in by R?

                        – Rui Barradas
                        Aug 26 '18 at 6:53







                      5




                      5





                      Thanks for answering, but I think you have not understood what I am asking. In the case of a dataframe already existing in an R session, there is no problem, the problem is to have a table written in a text document with plus signs, dashes an pipe signs. How can that type of table be read in by R?

                      – Rui Barradas
                      Aug 26 '18 at 6:53





                      Thanks for answering, but I think you have not understood what I am asking. In the case of a dataframe already existing in an R session, there is no problem, the problem is to have a table written in a text document with plus signs, dashes an pipe signs. How can that type of table be read in by R?

                      – Rui Barradas
                      Aug 26 '18 at 6:53

















                      draft saved

                      draft discarded
















































                      Thanks for contributing an answer to Stack Overflow!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid


                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.

                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function ()
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f52023709%2fwhat-can-r-do-about-a-messy-data-format%23new-answer', 'question_page');

                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      How to how show current date and time by default on contact form 7 in WordPress without taking input from user in datetimepicker

                      Syphilis

                      Darth Vader #20