What can R do about a messy data format?

Sometimes I see data posted in a Stack Overflow question formatted like in this question. This is not the first time, so I have decided to ask a question about it, and answer the question with a way to make the posted data palatable.

I will post the dataset example here just in case the question is deleted.

+------------+------+------+----------+--------------------------+
| Date | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A | A1 | 0 | 0 |
| 2018-06-03 | A | A2 | 0 | 1 |
| 2018-06-03 | A | A3 | 0 | 2 |
| 2018-06-03 | A | A4 | 1 | 1 |
| 2018-06-03 | A | A5 | 2 | 1 |
| 2018-06-04 | A | A6 | 0 | 3 |
| 2018-06-01 | B | B1 | 0 | 1 |
| 2018-06-02 | B | B2 | 0 | 2 |
| 2018-06-03 | B | B3 | 0 | 3 |
+------------+------+------+----------+--------------------------+

As you can see this is not the right way to post data. As a user wrote in a comment,

It must've taken a bit of time to format the data the way you're
showing it here. Unfortunately this is not a good format for us to
copy & paste.

I believe this says it all. The asker is well intended and it took some work and time to try to be nice, but the result is not good.

What can R code do to make that table usable, if anything? Will it take a great deal of trouble?

edited Aug 26 '18 at 11:13

Peter Mortensen

13.5k1984111

asked Aug 26 '18 at 6:21

Rui Barradas

16.4k51730

This is an interesting question by itself. However, I fear that a good answer might encourage people in presenting their dataset like that.

– nicola
Aug 26 '18 at 6:39

@nicola I surely hope not! It's already bad to see it every now and then, maybe it will have the opposite effect :).

– Rui Barradas
Aug 26 '18 at 6:48

I would enjoy a general smart messy format reader for SO. that would read those, and printed data.frames/ data.tables / tibbles, commented or not, with row numbers or not, recognize dates, and ideally try to guess a way to sort out strings containing spaces, I thought of building it in fact but never got around it.

– Moody_Mudskipper
Aug 29 '18 at 20:25

add a comment |

I will post the dataset example here just in case the question is deleted.

+------------+------+------+----------+--------------------------+
| Date | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A | A1 | 0 | 0 |
| 2018-06-03 | A | A2 | 0 | 1 |
| 2018-06-03 | A | A3 | 0 | 2 |
| 2018-06-03 | A | A4 | 1 | 1 |
| 2018-06-03 | A | A5 | 2 | 1 |
| 2018-06-04 | A | A6 | 0 | 3 |
| 2018-06-01 | B | B1 | 0 | 1 |
| 2018-06-02 | B | B2 | 0 | 2 |
| 2018-06-03 | B | B3 | 0 | 3 |
+------------+------+------+----------+--------------------------+

As you can see this is not the right way to post data. As a user wrote in a comment,

It must've taken a bit of time to format the data the way you're
showing it here. Unfortunately this is not a good format for us to
copy & paste.

I believe this says it all. The asker is well intended and it took some work and time to try to be nice, but the result is not good.

What can R code do to make that table usable, if anything? Will it take a great deal of trouble?

edited Aug 26 '18 at 11:13

Peter Mortensen

13.5k1984111

asked Aug 26 '18 at 6:21

Rui Barradas

16.4k51730

This is an interesting question by itself. However, I fear that a good answer might encourage people in presenting their dataset like that.

– nicola
Aug 26 '18 at 6:39

@nicola I surely hope not! It's already bad to see it every now and then, maybe it will have the opposite effect :).

– Rui Barradas
Aug 26 '18 at 6:48

I would enjoy a general smart messy format reader for SO. that would read those, and printed data.frames/ data.tables / tibbles, commented or not, with row numbers or not, recognize dates, and ideally try to guess a way to sort out strings containing spaces, I thought of building it in fact but never got around it.

– Moody_Mudskipper
Aug 29 '18 at 20:25

add a comment |

I will post the dataset example here just in case the question is deleted.

+------------+------+------+----------+--------------------------+
| Date | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A | A1 | 0 | 0 |
| 2018-06-03 | A | A2 | 0 | 1 |
| 2018-06-03 | A | A3 | 0 | 2 |
| 2018-06-03 | A | A4 | 1 | 1 |
| 2018-06-03 | A | A5 | 2 | 1 |
| 2018-06-04 | A | A6 | 0 | 3 |
| 2018-06-01 | B | B1 | 0 | 1 |
| 2018-06-02 | B | B2 | 0 | 2 |
| 2018-06-03 | B | B3 | 0 | 3 |
+------------+------+------+----------+--------------------------+

As you can see this is not the right way to post data. As a user wrote in a comment,

It must've taken a bit of time to format the data the way you're
showing it here. Unfortunately this is not a good format for us to
copy & paste.

I believe this says it all. The asker is well intended and it took some work and time to try to be nice, but the result is not good.

What can R code do to make that table usable, if anything? Will it take a great deal of trouble?

edited Aug 26 '18 at 11:13

Peter Mortensen

13.5k1984111

asked Aug 26 '18 at 6:21

Rui Barradas

16.4k51730

I will post the dataset example here just in case the question is deleted.

+------------+------+------+----------+--------------------------+
| Date | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A | A1 | 0 | 0 |
| 2018-06-03 | A | A2 | 0 | 1 |
| 2018-06-03 | A | A3 | 0 | 2 |
| 2018-06-03 | A | A4 | 1 | 1 |
| 2018-06-03 | A | A5 | 2 | 1 |
| 2018-06-04 | A | A6 | 0 | 3 |
| 2018-06-01 | B | B1 | 0 | 1 |
| 2018-06-02 | B | B2 | 0 | 2 |
| 2018-06-03 | B | B3 | 0 | 3 |
+------------+------+------+----------+--------------------------+

As you can see this is not the right way to post data. As a user wrote in a comment,

It must've taken a bit of time to format the data the way you're
showing it here. Unfortunately this is not a good format for us to
copy & paste.

I believe this says it all. The asker is well intended and it took some work and time to try to be nice, but the result is not good.

What can R code do to make that table usable, if anything? Will it take a great deal of trouble?

r dataframe

edited Aug 26 '18 at 11:13

Peter Mortensen

13.5k1984111

asked Aug 26 '18 at 6:21

Rui Barradas

16.4k51730

edited Aug 26 '18 at 11:13

Peter Mortensen

13.5k1984111

asked Aug 26 '18 at 6:21

Rui Barradas

16.4k51730

edited Aug 26 '18 at 11:13

Peter Mortensen

13.5k1984111

edited Aug 26 '18 at 11:13

Peter Mortensen

13.5k1984111

edited Aug 26 '18 at 11:13

Peter Mortensen

13.5k1984111

asked Aug 26 '18 at 6:21

Rui Barradas

16.4k51730

asked Aug 26 '18 at 6:21

Rui Barradas

16.4k51730

asked Aug 26 '18 at 6:21

Rui Barradas

16.4k51730

This is an interesting question by itself. However, I fear that a good answer might encourage people in presenting their dataset like that.

– nicola
Aug 26 '18 at 6:39

@nicola I surely hope not! It's already bad to see it every now and then, maybe it will have the opposite effect :).

– Rui Barradas
Aug 26 '18 at 6:48

I would enjoy a general smart messy format reader for SO. that would read those, and printed data.frames/ data.tables / tibbles, commented or not, with row numbers or not, recognize dates, and ideally try to guess a way to sort out strings containing spaces, I thought of building it in fact but never got around it.

– Moody_Mudskipper
Aug 29 '18 at 20:25

add a comment |

This is an interesting question by itself. However, I fear that a good answer might encourage people in presenting their dataset like that.

– nicola
Aug 26 '18 at 6:39

@nicola I surely hope not! It's already bad to see it every now and then, maybe it will have the opposite effect :).

– Rui Barradas
Aug 26 '18 at 6:48

I would enjoy a general smart messy format reader for SO. that would read those, and printed data.frames/ data.tables / tibbles, commented or not, with row numbers or not, recognize dates, and ideally try to guess a way to sort out strings containing spaces, I thought of building it in fact but never got around it.

– Moody_Mudskipper
Aug 29 '18 at 20:25

This is an interesting question by itself. However, I fear that a good answer might encourage people in presenting their dataset like that.

– nicola
Aug 26 '18 at 6:39

@nicola I surely hope not! It's already bad to see it every now and then, maybe it will have the opposite effect :).

– Rui Barradas
Aug 26 '18 at 6:48

I would enjoy a general smart messy format reader for SO. that would read those, and printed data.frames/ data.tables / tibbles, commented or not, with row numbers or not, recognize dates, and ideally try to guess a way to sort out strings containing spaces, I thought of building it in fact but never got around it.

– Moody_Mudskipper
Aug 29 '18 at 20:25

add a comment |

6 Answers
6

active

oldest

votes

Using data.table::fread:

x = '
+------------+------+------+----------+--------------------------+
| Date | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A | A1 | 0 | 0 |
| 2018-06-03 | A | A2 | 0 | 1 |
| 2018-06-03 | A | A3 | 0 | 2 |
| 2018-06-03 | A | A4 | 1 | 1 |
| 2018-06-03 | A | A5 | 2 | 1 |
| 2018-06-04 | A | A6 | 0 | 3 |
| 2018-06-01 | B | B1 | 0 | 1 |
| 2018-06-02 | B | B2 | 0 | 2 |
| 2018-06-03 | B | B3 | 0 | 3 |
+------------+------+------+----------+--------------------------+
'

fread(gsub('\+.+\n' ,'', x, perl = T), drop=c(1,7))

# Date Emp1 Case Priority PriorityCountinLast7days
# 1: 2018-06-01 A A1 0 0
# 2: 2018-06-03 A A2 0 1
# 3: 2018-06-03 A A3 0 2
# 4: 2018-06-03 A A4 1 1
# 5: 2018-06-03 A A5 2 1
# 6: 2018-06-04 A A6 0 3
# 7: 2018-06-01 B B1 0 1
# 8: 2018-06-02 B B2 0 2
# 9: 2018-06-03 B B3 0 3

The gsub part removes the horizontal rules. drop removes the extra columns caused by delimiters at the line ends.

edited Aug 26 '18 at 18:59

answered Aug 26 '18 at 7:42

dww

14.7k22655

add a comment |

The short answer to the question is yes, R code can solve that mess and no, it doesn't take that much trouble.

The first step after copying & pasting the table into an R session is to read it in with read.table setting the header, sep, comment.char and strip.white arguments.

Credits for reminding me of arguments comment.char and strip.white go to @nicola, and his comment.

dat <- read.table(text = "
+------------+------+------+----------+--------------------------+
| Date | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A | A1 | 0 | 0 |
| 2018-06-03 | A | A2 | 0 | 1 |
| 2018-06-03 | A | A3 | 0 | 2 |
| 2018-06-03 | A | A4 | 1 | 1 |
| 2018-06-03 | A | A5 | 2 | 1 |
| 2018-06-04 | A | A6 | 0 | 3 |
| 2018-06-01 | B | B1 | 0 | 1 |
| 2018-06-02 | B | B2 | 0 | 2 |
| 2018-06-03 | B | B3 | 0 | 3 |
+------------+------+------+----------+--------------------------+
", header = TRUE, sep = "|", comment.char = "+", strip.white = TRUE)

But as you can see there are some issues with the result.

dat
 X Date Emp1 Case Priority PriorityCountinLast7days X.1
1 NA 2018-06-01 A A1 0 0 NA
2 NA 2018-06-03 A A2 0 1 NA
3 NA 2018-06-03 A A3 0 2 NA
4 NA 2018-06-03 A A4 1 1 NA
5 NA 2018-06-03 A A5 2 1 NA
6 NA 2018-06-04 A A6 0 3 NA
7 NA 2018-06-01 B B1 0 1 NA
8 NA 2018-06-02 B B2 0 2 NA
9 NA 2018-06-03 B B3 0 3 NA

To have separators start and end each data row made R believe those separators mark extra columns, which is not what is meant by the original question's OP.

So the second step is to keep only the real columns. I will do this subsetting the columns by their numbers, easily done, they usually are the first and last columns.

dat <- dat[-c(1, ncol(dat))]
dat
 Date Emp1 Case Priority PriorityCountinLast7days
1 2018-06-01 A A1 0 0
2 2018-06-03 A A2 0 1
3 2018-06-03 A A3 0 2
4 2018-06-03 A A4 1 1
5 2018-06-03 A A5 2 1
6 2018-06-04 A A6 0 3
7 2018-06-01 B B1 0 1
8 2018-06-02 B B2 0 2
9 2018-06-03 B B3 0 3

That wasn't too hard, much better.

In this case there is still a problem, to coerce column Date to class Date.

dat$Date <- as.Date(dat$Date)

And the result is satisfactory.

str(dat)
'data.frame': 9 obs. of 5 variables:
 $ Date : Date, format: "2018-06-01" "2018-06-03" ...
 $ Emp1 : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 2 2 2
 $ Case : Factor w/ 9 levels "A1","A2","A3",..: 1 2 3 4 5 6 7 8 9
 $ Priority : int 0 0 0 1 2 0 0 0 0
 $ PriorityCountinLast7days: int 0 1 2 1 1 3 1 2 3

Note that I have not set the more or less standard argument stringsAsFactors = FALSE. If needed, this should be done when running read.table.

The whole process took only 3 lines of base R code.

Finally, the end result in dput format, like it should be in the first place.

dat <-
structure(list(Date = structure(c(17683, 17685, 17685, 17685, 
17685, 17686, 17683, 17684, 17685), class = "Date"), Emp1 = c("A", 
"A", "A", "A", "A", "A", "B", "B", "B"), Case = c("A1", "A2", 
"A3", "A4", "A5", "A6", "B1", "B2", "B3"), Priority = c(0, 0, 
0, 1, 2, 0, 0, 0, 0), PriorityCountinLast7days = c(0, 1, 2, 1, 
1, 3, 1, 2, 3)), row.names = c(NA, -9L), class = "data.frame")

edited Aug 27 '18 at 2:01

answered Aug 26 '18 at 6:41

Rui Barradas

16.4k51730

1

@李哲源 I don't know about "safer", I find it more natural to use read table functions since after all that does look like a table.

– Rui Barradas
Aug 26 '18 at 6:46

1

I was about to post utils:::head.default(read.table("clipboard",comment.char="+",sep="|",strip.white=TRUE,header=TRUE,flush=TRUE)[-1],-1), which is not different from what you did (in my line, you manage the first and third line and remove the undesired columns).

– nicola
Aug 26 '18 at 8:13

add a comment |

md_table <- scan(text = "
+------------+------+------+----------+--------------------------+
| Date | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A | A1 | 0 | 0 |
| 2018-06-03 | A | A2 | 0 | 1 |
| 2018-06-03 | A | A3 | 0 | 2 |
| 2018-06-03 | A | A4 | 1 | 1 |
| 2018-06-03 | A | A5 | 2 | 1 |
| 2018-06-04 | A | A6 | 0 | 3 |
| 2018-06-01 | B | B1 | 0 | 1 |
| 2018-06-02 | B | B2 | 0 | 2 |
| 2018-06-03 | B | B3 | 0 | 3 |
+------------+------+------+----------+--------------------------+",
what = "", sep = "", comment.char = "+", quiet = TRUE)

## it is clear that there are 5 columns
mat <- matrix(md_table[md_table != "|"], ncol = 5, byrow = TRUE)
# [,1] [,2] [,3] [,4] [,5] 
# [1,] "Date" "Emp1" "Case" "Priority" "PriorityCountinLast7days"
# [2,] "2018-06-01" "A" "A1" "0" "0" 
# [3,] "2018-06-03" "A" "A2" "0" "1" 
# [4,] "2018-06-03" "A" "A3" "0" "2" 
# [5,] "2018-06-03" "A" "A4" "1" "1" 
# [6,] "2018-06-03" "A" "A5" "2" "1" 
# [7,] "2018-06-04" "A" "A6" "0" "3" 
# [8,] "2018-06-01" "B" "B1" "0" "1" 
# [9,] "2018-06-02" "B" "B2" "0" "2" 
#[10,] "2018-06-03" "B" "B3" "0" "3"

## a data frame with all character columns
dat <- setNames(data.frame(mat[-1, ], stringsAsFactors = FALSE), mat[1, ])
# Date Emp1 Case Priority PriorityCountinLast7days
#1 2018-06-01 A A1 0 0
#2 2018-06-03 A A2 0 1
#3 2018-06-03 A A3 0 2
#4 2018-06-03 A A4 1 1
#5 2018-06-03 A A5 2 1
#6 2018-06-04 A A6 0 3
#7 2018-06-01 B B1 0 1
#8 2018-06-02 B B2 0 2
#9 2018-06-03 B B3 0 3

## or maybe just use `type.convert` on some columns?
dat <- lapply(dat, type.convert)

edited Aug 26 '18 at 8:05

answered Aug 26 '18 at 6:32

李哲源

47.8k1494145

add a comment |

Well, about this specific dataset I used the import feature in RStudio, but I took one additional step beforehand.

Copy the dataset into the Notepad file.

Replace all | characters with ,

Import the Notepad file using read.csv to RStudio using this code (seperate columns by ,).

But, if you mean use the R to fully understand it in one step, then I have no idea.

edited Aug 26 '18 at 11:15

Peter Mortensen

13.5k1984111

answered Aug 26 '18 at 6:38

Salman Lashkarara

4,6991560110

Notepad? Are you assuming Microsoft Windows?

– Peter Mortensen
Aug 26 '18 at 11:16

@PeterMortensen Yes, but it could be any other editor as well.

– Salman Lashkarara
Aug 26 '18 at 11:26

add a comment |

The issue isn't so much how many lines of code it takes, two or five, not much difference. The question is more whether it will work beyond the example you posted here.

I haven't come across this sort of thing in the wild, but I had a go at constructing another example that I thought could conceivably exist.

I've since come across a couple more cases and added them to the test suite.

x1 <- "
+------------+------+------+----------+--------------------------+
| Date | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A | A1 | 0 | 0 |
| 2018-06-03 | A | A2 | 0 | 1 |
| 2018-06-02 | B | B2 | 0 | 2 |
| 2018-06-03 | B | B3 | 0 | 3 |
+------------+------+------+----------+--------------------------+
"

x2 <- "
––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
 Date | Emp1 | Case | Priority | PriorityCountinLast7days 
––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
 2018-06-01 | A | A|1 | 0 | 0 
 2018-06-03 | A | A|2 | 0 | 1 
 2018-06-02 | B | B|2 | 0 | 2 
 2018-06-03 | B | B|3 | 0 | 3 
––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
"

x3 <- "
 Maths | English | Science | History | Class

 0.1 | 0.2 | 0.3 | 0.2 | Y2

 0.9 | 0.5 | 0.7 | 0.4 | Y1

 0.2 | 0.4 | 0.6 | 0.2 | Y2

 0.9 | 0.5 | 0.2 | 0.7 | Y1
"

x4 <- "
 Season | Team | W | AHWO
-------------------------------------
1 | 2017/2018 | TeamA | 2 | 1.75
2 | 2017/2018 | TeamB | 1 | 1.85
3 | 2017/2018 | TeamC | 1 | 1.70
4 | 2017/2018 | TeamD | 0 | 3.10
5 | 2016/2017 | TeamA | 1 | 1.49
6 | 2016/2017 | TeamB | 3 | 1.51
7 | 2016/2017 | TeamC | 2 | 1.90
8 | 2016/2017 | TeamD | 0 | N/A 
"

My go at a function

f <- function(x, header=TRUE, na.strings=c("NA", "N/A"), 
 stringsAsFactors=FALSE, ...) ", " ", x)

 # read the result as a table
 read.table(text=paste(x, collapse="n"), header=header, 
 na.strings=na.strings, stringsAsFactors=stringsAsFactors, ...) 


lapply(c(x1, x2, x3, x4, x5), f)

Output

[[1]]
 Date Emp1 Case Priority PriorityCountinLast7days
1 2018-06-01 A A1 0 0
2 2018-06-03 A A2 0 1
3 2018-06-02 B B2 0 2
4 2018-06-03 B B3 0 3

[[2]]
 Date Emp1 Case Priority PriorityCountinLast7days
1 2018-06-01 A A|1 0 0
2 2018-06-03 A A|2 0 1
3 2018-06-02 B B|2 0 2
4 2018-06-03 B B|3 0 3

[[3]]
 Maths English Science History Class
1 0.1 0.2 0.3 0.2 Y2
2 0.9 0.5 0.7 0.4 Y1
3 0.2 0.4 0.6 0.2 Y2
4 0.9 0.5 0.2 0.7 Y1

[[4]]
 Season Team W AHWO
1 2017/2018 TeamA 2 1.75
2 2017/2018 TeamB 1 1.85
3 2017/2018 TeamC 1 1.70
4 2017/2018 TeamD 0 3.10
5 2016/2017 TeamA 1 1.49
6 2016/2017 TeamB 3 1.51
7 2016/2017 TeamC 2 1.90
8 2016/2017 TeamD 0 NA

x3 is from here (will have to look at the edit history).

x4 is from here

edited Dec 15 '18 at 23:32

answered Oct 2 '18 at 12:26

AkselA

4,34621225

Great answer, upvote. The more general the solution the better.

– Rui Barradas
Oct 2 '18 at 13:27

add a comment |

-3

As it was suggested, you could use dput to save the content of a dataframe to a file, open the file in a text editor and paste its content. An example of mtcar's dataset limited to first 10 rows:

dput(mtcars %>% head(10), file = 'reproducible.txt')

The content of reproducible.txt can be used to make a dataframe/tibble as shown below. In such a case data the format is machine readable, but it is hard to be undestood by human at first glance (without pasting into R).

df <- structure(list(mpg = c(21, 21, 22.8, 21.4, 18.7, 18.1, 14.3,
24.4, 22.8, 19.2), cyl = c(6, 6, 4, 6, 8, 6, 8, 4, 4, 6), disp = c(160,
160, 108, 258, 360, 225, 360, 146.7, 140.8, 167.6), hp = c(110,
110, 93, 110, 175, 105, 245, 62, 95, 123), drat = c(3.9, 3.9,
3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92), wt = c(2.62,
2.875, 2.32, 3.215, 3.44, 3.46, 3.57, 3.19, 3.15, 3.44), qsec = c(16.46,
17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20, 22.9, 18.3), vs = c(0,
0, 1, 1, 0, 1, 0, 1, 1, 1), am = c(1, 1, 1, 0, 0, 0, 0, 0, 0,
0), gear = c(4, 4, 4, 3, 3, 3, 3, 4, 4, 4), carb = c(4, 4, 1,
1, 2, 1, 4, 2, 2, 4)), .Names = c("mpg", "cyl", "disp", "hp",
"drat", "wt", "qsec", "vs", "am", "gear", "carb"), row.names = c("Mazda RX4",
"Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive", "Hornet Sportabout",
"Valiant", "Duster 360", "Merc 240D", "Merc 230", "Merc 280"), class = "data.frame")

edited Aug 26 '18 at 11:19

Peter Mortensen

13.5k1984111

answered Aug 26 '18 at 6:39

Pawel Stradowski

14919

5

Thanks for answering, but I think you have not understood what I am asking. In the case of a dataframe already existing in an R session, there is no problem, the problem is to have a table written in a text document with plus signs, dashes an pipe signs. How can that type of table be read in by R?

– Rui Barradas
Aug 26 '18 at 6:53

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f52023709%2fwhat-can-r-do-about-a-messy-data-format%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

6 Answers
6

active

oldest

votes

6 Answers
6

active

oldest

votes

Using data.table::fread:

x = '
+------------+------+------+----------+--------------------------+
| Date | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A | A1 | 0 | 0 |
| 2018-06-03 | A | A2 | 0 | 1 |
| 2018-06-03 | A | A3 | 0 | 2 |
| 2018-06-03 | A | A4 | 1 | 1 |
| 2018-06-03 | A | A5 | 2 | 1 |
| 2018-06-04 | A | A6 | 0 | 3 |
| 2018-06-01 | B | B1 | 0 | 1 |
| 2018-06-02 | B | B2 | 0 | 2 |
| 2018-06-03 | B | B3 | 0 | 3 |
+------------+------+------+----------+--------------------------+
'

fread(gsub('\+.+\n' ,'', x, perl = T), drop=c(1,7))

# Date Emp1 Case Priority PriorityCountinLast7days
# 1: 2018-06-01 A A1 0 0
# 2: 2018-06-03 A A2 0 1
# 3: 2018-06-03 A A3 0 2
# 4: 2018-06-03 A A4 1 1
# 5: 2018-06-03 A A5 2 1
# 6: 2018-06-04 A A6 0 3
# 7: 2018-06-01 B B1 0 1
# 8: 2018-06-02 B B2 0 2
# 9: 2018-06-03 B B3 0 3

The gsub part removes the horizontal rules. drop removes the extra columns caused by delimiters at the line ends.

edited Aug 26 '18 at 18:59

answered Aug 26 '18 at 7:42

dww

14.7k22655

add a comment |

Using data.table::fread:

x = '
+------------+------+------+----------+--------------------------+
| Date | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A | A1 | 0 | 0 |
| 2018-06-03 | A | A2 | 0 | 1 |
| 2018-06-03 | A | A3 | 0 | 2 |
| 2018-06-03 | A | A4 | 1 | 1 |
| 2018-06-03 | A | A5 | 2 | 1 |
| 2018-06-04 | A | A6 | 0 | 3 |
| 2018-06-01 | B | B1 | 0 | 1 |
| 2018-06-02 | B | B2 | 0 | 2 |
| 2018-06-03 | B | B3 | 0 | 3 |
+------------+------+------+----------+--------------------------+
'

fread(gsub('\+.+\n' ,'', x, perl = T), drop=c(1,7))

# Date Emp1 Case Priority PriorityCountinLast7days
# 1: 2018-06-01 A A1 0 0
# 2: 2018-06-03 A A2 0 1
# 3: 2018-06-03 A A3 0 2
# 4: 2018-06-03 A A4 1 1
# 5: 2018-06-03 A A5 2 1
# 6: 2018-06-04 A A6 0 3
# 7: 2018-06-01 B B1 0 1
# 8: 2018-06-02 B B2 0 2
# 9: 2018-06-03 B B3 0 3

The gsub part removes the horizontal rules. drop removes the extra columns caused by delimiters at the line ends.

edited Aug 26 '18 at 18:59

answered Aug 26 '18 at 7:42

dww

14.7k22655

add a comment |

Using data.table::fread:

x = '
+------------+------+------+----------+--------------------------+
| Date | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A | A1 | 0 | 0 |
| 2018-06-03 | A | A2 | 0 | 1 |
| 2018-06-03 | A | A3 | 0 | 2 |
| 2018-06-03 | A | A4 | 1 | 1 |
| 2018-06-03 | A | A5 | 2 | 1 |
| 2018-06-04 | A | A6 | 0 | 3 |
| 2018-06-01 | B | B1 | 0 | 1 |
| 2018-06-02 | B | B2 | 0 | 2 |
| 2018-06-03 | B | B3 | 0 | 3 |
+------------+------+------+----------+--------------------------+
'

fread(gsub('\+.+\n' ,'', x, perl = T), drop=c(1,7))

# Date Emp1 Case Priority PriorityCountinLast7days
# 1: 2018-06-01 A A1 0 0
# 2: 2018-06-03 A A2 0 1
# 3: 2018-06-03 A A3 0 2
# 4: 2018-06-03 A A4 1 1
# 5: 2018-06-03 A A5 2 1
# 6: 2018-06-04 A A6 0 3
# 7: 2018-06-01 B B1 0 1
# 8: 2018-06-02 B B2 0 2
# 9: 2018-06-03 B B3 0 3

The gsub part removes the horizontal rules. drop removes the extra columns caused by delimiters at the line ends.

edited Aug 26 '18 at 18:59

answered Aug 26 '18 at 7:42

dww

14.7k22655

Using data.table::fread:

x = '
+------------+------+------+----------+--------------------------+
| Date | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A | A1 | 0 | 0 |
| 2018-06-03 | A | A2 | 0 | 1 |
| 2018-06-03 | A | A3 | 0 | 2 |
| 2018-06-03 | A | A4 | 1 | 1 |
| 2018-06-03 | A | A5 | 2 | 1 |
| 2018-06-04 | A | A6 | 0 | 3 |
| 2018-06-01 | B | B1 | 0 | 1 |
| 2018-06-02 | B | B2 | 0 | 2 |
| 2018-06-03 | B | B3 | 0 | 3 |
+------------+------+------+----------+--------------------------+
'

fread(gsub('\+.+\n' ,'', x, perl = T), drop=c(1,7))

# Date Emp1 Case Priority PriorityCountinLast7days
# 1: 2018-06-01 A A1 0 0
# 2: 2018-06-03 A A2 0 1
# 3: 2018-06-03 A A3 0 2
# 4: 2018-06-03 A A4 1 1
# 5: 2018-06-03 A A5 2 1
# 6: 2018-06-04 A A6 0 3
# 7: 2018-06-01 B B1 0 1
# 8: 2018-06-02 B B2 0 2
# 9: 2018-06-03 B B3 0 3

The gsub part removes the horizontal rules. drop removes the extra columns caused by delimiters at the line ends.

edited Aug 26 '18 at 18:59

answered Aug 26 '18 at 7:42

dww

14.7k22655

edited Aug 26 '18 at 18:59

answered Aug 26 '18 at 7:42

dww

14.7k22655

answered Aug 26 '18 at 7:42

dww

14.7k22655

answered Aug 26 '18 at 7:42

dww

14.7k22655

add a comment |

The short answer to the question is yes, R code can solve that mess and no, it doesn't take that much trouble.

The first step after copying & pasting the table into an R session is to read it in with read.table setting the header, sep, comment.char and strip.white arguments.

Credits for reminding me of arguments comment.char and strip.white go to @nicola, and his comment.

dat <- read.table(text = "
+------------+------+------+----------+--------------------------+
| Date | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A | A1 | 0 | 0 |
| 2018-06-03 | A | A2 | 0 | 1 |
| 2018-06-03 | A | A3 | 0 | 2 |
| 2018-06-03 | A | A4 | 1 | 1 |
| 2018-06-03 | A | A5 | 2 | 1 |
| 2018-06-04 | A | A6 | 0 | 3 |
| 2018-06-01 | B | B1 | 0 | 1 |
| 2018-06-02 | B | B2 | 0 | 2 |
| 2018-06-03 | B | B3 | 0 | 3 |
+------------+------+------+----------+--------------------------+
", header = TRUE, sep = "|", comment.char = "+", strip.white = TRUE)

But as you can see there are some issues with the result.

dat
 X Date Emp1 Case Priority PriorityCountinLast7days X.1
1 NA 2018-06-01 A A1 0 0 NA
2 NA 2018-06-03 A A2 0 1 NA
3 NA 2018-06-03 A A3 0 2 NA
4 NA 2018-06-03 A A4 1 1 NA
5 NA 2018-06-03 A A5 2 1 NA
6 NA 2018-06-04 A A6 0 3 NA
7 NA 2018-06-01 B B1 0 1 NA
8 NA 2018-06-02 B B2 0 2 NA
9 NA 2018-06-03 B B3 0 3 NA

To have separators start and end each data row made R believe those separators mark extra columns, which is not what is meant by the original question's OP.

So the second step is to keep only the real columns. I will do this subsetting the columns by their numbers, easily done, they usually are the first and last columns.

dat <- dat[-c(1, ncol(dat))]
dat
 Date Emp1 Case Priority PriorityCountinLast7days
1 2018-06-01 A A1 0 0
2 2018-06-03 A A2 0 1
3 2018-06-03 A A3 0 2
4 2018-06-03 A A4 1 1
5 2018-06-03 A A5 2 1
6 2018-06-04 A A6 0 3
7 2018-06-01 B B1 0 1
8 2018-06-02 B B2 0 2
9 2018-06-03 B B3 0 3

That wasn't too hard, much better.

In this case there is still a problem, to coerce column Date to class Date.

dat$Date <- as.Date(dat$Date)

And the result is satisfactory.

str(dat)
'data.frame': 9 obs. of 5 variables:
 $ Date : Date, format: "2018-06-01" "2018-06-03" ...
 $ Emp1 : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 2 2 2
 $ Case : Factor w/ 9 levels "A1","A2","A3",..: 1 2 3 4 5 6 7 8 9
 $ Priority : int 0 0 0 1 2 0 0 0 0
 $ PriorityCountinLast7days: int 0 1 2 1 1 3 1 2 3

Note that I have not set the more or less standard argument stringsAsFactors = FALSE. If needed, this should be done when running read.table.

The whole process took only 3 lines of base R code.

Finally, the end result in dput format, like it should be in the first place.

dat <-
structure(list(Date = structure(c(17683, 17685, 17685, 17685, 
17685, 17686, 17683, 17684, 17685), class = "Date"), Emp1 = c("A", 
"A", "A", "A", "A", "A", "B", "B", "B"), Case = c("A1", "A2", 
"A3", "A4", "A5", "A6", "B1", "B2", "B3"), Priority = c(0, 0, 
0, 1, 2, 0, 0, 0, 0), PriorityCountinLast7days = c(0, 1, 2, 1, 
1, 3, 1, 2, 3)), row.names = c(NA, -9L), class = "data.frame")

edited Aug 27 '18 at 2:01

answered Aug 26 '18 at 6:41

Rui Barradas

16.4k51730

1

@李哲源 I don't know about "safer", I find it more natural to use read table functions since after all that does look like a table.

– Rui Barradas
Aug 26 '18 at 6:46

1

I was about to post utils:::head.default(read.table("clipboard",comment.char="+",sep="|",strip.white=TRUE,header=TRUE,flush=TRUE)[-1],-1), which is not different from what you did (in my line, you manage the first and third line and remove the undesired columns).

– nicola
Aug 26 '18 at 8:13

add a comment |

The short answer to the question is yes, R code can solve that mess and no, it doesn't take that much trouble.

The first step after copying & pasting the table into an R session is to read it in with read.table setting the header, sep, comment.char and strip.white arguments.

Credits for reminding me of arguments comment.char and strip.white go to @nicola, and his comment.

dat <- read.table(text = "
+------------+------+------+----------+--------------------------+
| Date | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A | A1 | 0 | 0 |
| 2018-06-03 | A | A2 | 0 | 1 |
| 2018-06-03 | A | A3 | 0 | 2 |
| 2018-06-03 | A | A4 | 1 | 1 |
| 2018-06-03 | A | A5 | 2 | 1 |
| 2018-06-04 | A | A6 | 0 | 3 |
| 2018-06-01 | B | B1 | 0 | 1 |
| 2018-06-02 | B | B2 | 0 | 2 |
| 2018-06-03 | B | B3 | 0 | 3 |
+------------+------+------+----------+--------------------------+
", header = TRUE, sep = "|", comment.char = "+", strip.white = TRUE)

But as you can see there are some issues with the result.

dat
 X Date Emp1 Case Priority PriorityCountinLast7days X.1
1 NA 2018-06-01 A A1 0 0 NA
2 NA 2018-06-03 A A2 0 1 NA
3 NA 2018-06-03 A A3 0 2 NA
4 NA 2018-06-03 A A4 1 1 NA
5 NA 2018-06-03 A A5 2 1 NA
6 NA 2018-06-04 A A6 0 3 NA
7 NA 2018-06-01 B B1 0 1 NA
8 NA 2018-06-02 B B2 0 2 NA
9 NA 2018-06-03 B B3 0 3 NA

To have separators start and end each data row made R believe those separators mark extra columns, which is not what is meant by the original question's OP.

So the second step is to keep only the real columns. I will do this subsetting the columns by their numbers, easily done, they usually are the first and last columns.

dat <- dat[-c(1, ncol(dat))]
dat
 Date Emp1 Case Priority PriorityCountinLast7days
1 2018-06-01 A A1 0 0
2 2018-06-03 A A2 0 1
3 2018-06-03 A A3 0 2
4 2018-06-03 A A4 1 1
5 2018-06-03 A A5 2 1
6 2018-06-04 A A6 0 3
7 2018-06-01 B B1 0 1
8 2018-06-02 B B2 0 2
9 2018-06-03 B B3 0 3

That wasn't too hard, much better.

In this case there is still a problem, to coerce column Date to class Date.

dat$Date <- as.Date(dat$Date)

And the result is satisfactory.

str(dat)
'data.frame': 9 obs. of 5 variables:
 $ Date : Date, format: "2018-06-01" "2018-06-03" ...
 $ Emp1 : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 2 2 2
 $ Case : Factor w/ 9 levels "A1","A2","A3",..: 1 2 3 4 5 6 7 8 9
 $ Priority : int 0 0 0 1 2 0 0 0 0
 $ PriorityCountinLast7days: int 0 1 2 1 1 3 1 2 3

Note that I have not set the more or less standard argument stringsAsFactors = FALSE. If needed, this should be done when running read.table.

The whole process took only 3 lines of base R code.

Finally, the end result in dput format, like it should be in the first place.

dat <-
structure(list(Date = structure(c(17683, 17685, 17685, 17685, 
17685, 17686, 17683, 17684, 17685), class = "Date"), Emp1 = c("A", 
"A", "A", "A", "A", "A", "B", "B", "B"), Case = c("A1", "A2", 
"A3", "A4", "A5", "A6", "B1", "B2", "B3"), Priority = c(0, 0, 
0, 1, 2, 0, 0, 0, 0), PriorityCountinLast7days = c(0, 1, 2, 1, 
1, 3, 1, 2, 3)), row.names = c(NA, -9L), class = "data.frame")

edited Aug 27 '18 at 2:01

answered Aug 26 '18 at 6:41

Rui Barradas

16.4k51730

1

@李哲源 I don't know about "safer", I find it more natural to use read table functions since after all that does look like a table.

– Rui Barradas
Aug 26 '18 at 6:46

1

I was about to post utils:::head.default(read.table("clipboard",comment.char="+",sep="|",strip.white=TRUE,header=TRUE,flush=TRUE)[-1],-1), which is not different from what you did (in my line, you manage the first and third line and remove the undesired columns).

– nicola
Aug 26 '18 at 8:13

add a comment |

The short answer to the question is yes, R code can solve that mess and no, it doesn't take that much trouble.

The first step after copying & pasting the table into an R session is to read it in with read.table setting the header, sep, comment.char and strip.white arguments.

Credits for reminding me of arguments comment.char and strip.white go to @nicola, and his comment.

dat <- read.table(text = "
+------------+------+------+----------+--------------------------+
| Date | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A | A1 | 0 | 0 |
| 2018-06-03 | A | A2 | 0 | 1 |
| 2018-06-03 | A | A3 | 0 | 2 |
| 2018-06-03 | A | A4 | 1 | 1 |
| 2018-06-03 | A | A5 | 2 | 1 |
| 2018-06-04 | A | A6 | 0 | 3 |
| 2018-06-01 | B | B1 | 0 | 1 |
| 2018-06-02 | B | B2 | 0 | 2 |
| 2018-06-03 | B | B3 | 0 | 3 |
+------------+------+------+----------+--------------------------+
", header = TRUE, sep = "|", comment.char = "+", strip.white = TRUE)

But as you can see there are some issues with the result.

dat
 X Date Emp1 Case Priority PriorityCountinLast7days X.1
1 NA 2018-06-01 A A1 0 0 NA
2 NA 2018-06-03 A A2 0 1 NA
3 NA 2018-06-03 A A3 0 2 NA
4 NA 2018-06-03 A A4 1 1 NA
5 NA 2018-06-03 A A5 2 1 NA
6 NA 2018-06-04 A A6 0 3 NA
7 NA 2018-06-01 B B1 0 1 NA
8 NA 2018-06-02 B B2 0 2 NA
9 NA 2018-06-03 B B3 0 3 NA

To have separators start and end each data row made R believe those separators mark extra columns, which is not what is meant by the original question's OP.

So the second step is to keep only the real columns. I will do this subsetting the columns by their numbers, easily done, they usually are the first and last columns.

dat <- dat[-c(1, ncol(dat))]
dat
 Date Emp1 Case Priority PriorityCountinLast7days
1 2018-06-01 A A1 0 0
2 2018-06-03 A A2 0 1
3 2018-06-03 A A3 0 2
4 2018-06-03 A A4 1 1
5 2018-06-03 A A5 2 1
6 2018-06-04 A A6 0 3
7 2018-06-01 B B1 0 1
8 2018-06-02 B B2 0 2
9 2018-06-03 B B3 0 3

That wasn't too hard, much better.

In this case there is still a problem, to coerce column Date to class Date.

dat$Date <- as.Date(dat$Date)

And the result is satisfactory.

str(dat)
'data.frame': 9 obs. of 5 variables:
 $ Date : Date, format: "2018-06-01" "2018-06-03" ...
 $ Emp1 : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 2 2 2
 $ Case : Factor w/ 9 levels "A1","A2","A3",..: 1 2 3 4 5 6 7 8 9
 $ Priority : int 0 0 0 1 2 0 0 0 0
 $ PriorityCountinLast7days: int 0 1 2 1 1 3 1 2 3

Note that I have not set the more or less standard argument stringsAsFactors = FALSE. If needed, this should be done when running read.table.

The whole process took only 3 lines of base R code.

Finally, the end result in dput format, like it should be in the first place.

dat <-
structure(list(Date = structure(c(17683, 17685, 17685, 17685, 
17685, 17686, 17683, 17684, 17685), class = "Date"), Emp1 = c("A", 
"A", "A", "A", "A", "A", "B", "B", "B"), Case = c("A1", "A2", 
"A3", "A4", "A5", "A6", "B1", "B2", "B3"), Priority = c(0, 0, 
0, 1, 2, 0, 0, 0, 0), PriorityCountinLast7days = c(0, 1, 2, 1, 
1, 3, 1, 2, 3)), row.names = c(NA, -9L), class = "data.frame")

edited Aug 27 '18 at 2:01

answered Aug 26 '18 at 6:41

Rui Barradas

16.4k51730

The short answer to the question is yes, R code can solve that mess and no, it doesn't take that much trouble.

The first step after copying & pasting the table into an R session is to read it in with read.table setting the header, sep, comment.char and strip.white arguments.

Credits for reminding me of arguments comment.char and strip.white go to @nicola, and his comment.

dat <- read.table(text = "
+------------+------+------+----------+--------------------------+
| Date | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A | A1 | 0 | 0 |
| 2018-06-03 | A | A2 | 0 | 1 |
| 2018-06-03 | A | A3 | 0 | 2 |
| 2018-06-03 | A | A4 | 1 | 1 |
| 2018-06-03 | A | A5 | 2 | 1 |
| 2018-06-04 | A | A6 | 0 | 3 |
| 2018-06-01 | B | B1 | 0 | 1 |
| 2018-06-02 | B | B2 | 0 | 2 |
| 2018-06-03 | B | B3 | 0 | 3 |
+------------+------+------+----------+--------------------------+
", header = TRUE, sep = "|", comment.char = "+", strip.white = TRUE)

But as you can see there are some issues with the result.

dat
 X Date Emp1 Case Priority PriorityCountinLast7days X.1
1 NA 2018-06-01 A A1 0 0 NA
2 NA 2018-06-03 A A2 0 1 NA
3 NA 2018-06-03 A A3 0 2 NA
4 NA 2018-06-03 A A4 1 1 NA
5 NA 2018-06-03 A A5 2 1 NA
6 NA 2018-06-04 A A6 0 3 NA
7 NA 2018-06-01 B B1 0 1 NA
8 NA 2018-06-02 B B2 0 2 NA
9 NA 2018-06-03 B B3 0 3 NA

To have separators start and end each data row made R believe those separators mark extra columns, which is not what is meant by the original question's OP.

So the second step is to keep only the real columns. I will do this subsetting the columns by their numbers, easily done, they usually are the first and last columns.

dat <- dat[-c(1, ncol(dat))]
dat
 Date Emp1 Case Priority PriorityCountinLast7days
1 2018-06-01 A A1 0 0
2 2018-06-03 A A2 0 1
3 2018-06-03 A A3 0 2
4 2018-06-03 A A4 1 1
5 2018-06-03 A A5 2 1
6 2018-06-04 A A6 0 3
7 2018-06-01 B B1 0 1
8 2018-06-02 B B2 0 2
9 2018-06-03 B B3 0 3

That wasn't too hard, much better.

In this case there is still a problem, to coerce column Date to class Date.

dat$Date <- as.Date(dat$Date)

And the result is satisfactory.

str(dat)
'data.frame': 9 obs. of 5 variables:
 $ Date : Date, format: "2018-06-01" "2018-06-03" ...
 $ Emp1 : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 2 2 2
 $ Case : Factor w/ 9 levels "A1","A2","A3",..: 1 2 3 4 5 6 7 8 9
 $ Priority : int 0 0 0 1 2 0 0 0 0
 $ PriorityCountinLast7days: int 0 1 2 1 1 3 1 2 3

Note that I have not set the more or less standard argument stringsAsFactors = FALSE. If needed, this should be done when running read.table.

The whole process took only 3 lines of base R code.

Finally, the end result in dput format, like it should be in the first place.

dat <-
structure(list(Date = structure(c(17683, 17685, 17685, 17685, 
17685, 17686, 17683, 17684, 17685), class = "Date"), Emp1 = c("A", 
"A", "A", "A", "A", "A", "B", "B", "B"), Case = c("A1", "A2", 
"A3", "A4", "A5", "A6", "B1", "B2", "B3"), Priority = c(0, 0, 
0, 1, 2, 0, 0, 0, 0), PriorityCountinLast7days = c(0, 1, 2, 1, 
1, 3, 1, 2, 3)), row.names = c(NA, -9L), class = "data.frame")

edited Aug 27 '18 at 2:01

answered Aug 26 '18 at 6:41

Rui Barradas

16.4k51730

edited Aug 27 '18 at 2:01

answered Aug 26 '18 at 6:41

Rui Barradas

16.4k51730

answered Aug 26 '18 at 6:41

Rui Barradas

16.4k51730

answered Aug 26 '18 at 6:41

Rui Barradas

16.4k51730

1

@李哲源 I don't know about "safer", I find it more natural to use read table functions since after all that does look like a table.

– Rui Barradas
Aug 26 '18 at 6:46

1

I was about to post utils:::head.default(read.table("clipboard",comment.char="+",sep="|",strip.white=TRUE,header=TRUE,flush=TRUE)[-1],-1), which is not different from what you did (in my line, you manage the first and third line and remove the undesired columns).

– nicola
Aug 26 '18 at 8:13

add a comment |

1

@李哲源 I don't know about "safer", I find it more natural to use read table functions since after all that does look like a table.

– Rui Barradas
Aug 26 '18 at 6:46

1

I was about to post utils:::head.default(read.table("clipboard",comment.char="+",sep="|",strip.white=TRUE,header=TRUE,flush=TRUE)[-1],-1), which is not different from what you did (in my line, you manage the first and third line and remove the undesired columns).

– nicola
Aug 26 '18 at 8:13

@李哲源 I don't know about "safer", I find it more natural to use read table functions since after all that does look like a table.

– Rui Barradas
Aug 26 '18 at 6:46

I was about to post utils:::head.default(read.table("clipboard",comment.char="+",sep="|",strip.white=TRUE,header=TRUE,flush=TRUE)[-1],-1), which is not different from what you did (in my line, you manage the first and third line and remove the undesired columns).

– nicola
Aug 26 '18 at 8:13

add a comment |

md_table <- scan(text = "
+------------+------+------+----------+--------------------------+
| Date | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A | A1 | 0 | 0 |
| 2018-06-03 | A | A2 | 0 | 1 |
| 2018-06-03 | A | A3 | 0 | 2 |
| 2018-06-03 | A | A4 | 1 | 1 |
| 2018-06-03 | A | A5 | 2 | 1 |
| 2018-06-04 | A | A6 | 0 | 3 |
| 2018-06-01 | B | B1 | 0 | 1 |
| 2018-06-02 | B | B2 | 0 | 2 |
| 2018-06-03 | B | B3 | 0 | 3 |
+------------+------+------+----------+--------------------------+",
what = "", sep = "", comment.char = "+", quiet = TRUE)

## it is clear that there are 5 columns
mat <- matrix(md_table[md_table != "|"], ncol = 5, byrow = TRUE)
# [,1] [,2] [,3] [,4] [,5] 
# [1,] "Date" "Emp1" "Case" "Priority" "PriorityCountinLast7days"
# [2,] "2018-06-01" "A" "A1" "0" "0" 
# [3,] "2018-06-03" "A" "A2" "0" "1" 
# [4,] "2018-06-03" "A" "A3" "0" "2" 
# [5,] "2018-06-03" "A" "A4" "1" "1" 
# [6,] "2018-06-03" "A" "A5" "2" "1" 
# [7,] "2018-06-04" "A" "A6" "0" "3" 
# [8,] "2018-06-01" "B" "B1" "0" "1" 
# [9,] "2018-06-02" "B" "B2" "0" "2" 
#[10,] "2018-06-03" "B" "B3" "0" "3"

## a data frame with all character columns
dat <- setNames(data.frame(mat[-1, ], stringsAsFactors = FALSE), mat[1, ])
# Date Emp1 Case Priority PriorityCountinLast7days
#1 2018-06-01 A A1 0 0
#2 2018-06-03 A A2 0 1
#3 2018-06-03 A A3 0 2
#4 2018-06-03 A A4 1 1
#5 2018-06-03 A A5 2 1
#6 2018-06-04 A A6 0 3
#7 2018-06-01 B B1 0 1
#8 2018-06-02 B B2 0 2
#9 2018-06-03 B B3 0 3

## or maybe just use `type.convert` on some columns?
dat <- lapply(dat, type.convert)

edited Aug 26 '18 at 8:05

answered Aug 26 '18 at 6:32

李哲源

47.8k1494145

add a comment |

md_table <- scan(text = "
+------------+------+------+----------+--------------------------+
| Date | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A | A1 | 0 | 0 |
| 2018-06-03 | A | A2 | 0 | 1 |
| 2018-06-03 | A | A3 | 0 | 2 |
| 2018-06-03 | A | A4 | 1 | 1 |
| 2018-06-03 | A | A5 | 2 | 1 |
| 2018-06-04 | A | A6 | 0 | 3 |
| 2018-06-01 | B | B1 | 0 | 1 |
| 2018-06-02 | B | B2 | 0 | 2 |
| 2018-06-03 | B | B3 | 0 | 3 |
+------------+------+------+----------+--------------------------+",
what = "", sep = "", comment.char = "+", quiet = TRUE)

## it is clear that there are 5 columns
mat <- matrix(md_table[md_table != "|"], ncol = 5, byrow = TRUE)
# [,1] [,2] [,3] [,4] [,5] 
# [1,] "Date" "Emp1" "Case" "Priority" "PriorityCountinLast7days"
# [2,] "2018-06-01" "A" "A1" "0" "0" 
# [3,] "2018-06-03" "A" "A2" "0" "1" 
# [4,] "2018-06-03" "A" "A3" "0" "2" 
# [5,] "2018-06-03" "A" "A4" "1" "1" 
# [6,] "2018-06-03" "A" "A5" "2" "1" 
# [7,] "2018-06-04" "A" "A6" "0" "3" 
# [8,] "2018-06-01" "B" "B1" "0" "1" 
# [9,] "2018-06-02" "B" "B2" "0" "2" 
#[10,] "2018-06-03" "B" "B3" "0" "3"

## a data frame with all character columns
dat <- setNames(data.frame(mat[-1, ], stringsAsFactors = FALSE), mat[1, ])
# Date Emp1 Case Priority PriorityCountinLast7days
#1 2018-06-01 A A1 0 0
#2 2018-06-03 A A2 0 1
#3 2018-06-03 A A3 0 2
#4 2018-06-03 A A4 1 1
#5 2018-06-03 A A5 2 1
#6 2018-06-04 A A6 0 3
#7 2018-06-01 B B1 0 1
#8 2018-06-02 B B2 0 2
#9 2018-06-03 B B3 0 3

## or maybe just use `type.convert` on some columns?
dat <- lapply(dat, type.convert)

edited Aug 26 '18 at 8:05

answered Aug 26 '18 at 6:32

李哲源

47.8k1494145

add a comment |

md_table <- scan(text = "
+------------+------+------+----------+--------------------------+
| Date | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A | A1 | 0 | 0 |
| 2018-06-03 | A | A2 | 0 | 1 |
| 2018-06-03 | A | A3 | 0 | 2 |
| 2018-06-03 | A | A4 | 1 | 1 |
| 2018-06-03 | A | A5 | 2 | 1 |
| 2018-06-04 | A | A6 | 0 | 3 |
| 2018-06-01 | B | B1 | 0 | 1 |
| 2018-06-02 | B | B2 | 0 | 2 |
| 2018-06-03 | B | B3 | 0 | 3 |
+------------+------+------+----------+--------------------------+",
what = "", sep = "", comment.char = "+", quiet = TRUE)

## it is clear that there are 5 columns
mat <- matrix(md_table[md_table != "|"], ncol = 5, byrow = TRUE)
# [,1] [,2] [,3] [,4] [,5] 
# [1,] "Date" "Emp1" "Case" "Priority" "PriorityCountinLast7days"
# [2,] "2018-06-01" "A" "A1" "0" "0" 
# [3,] "2018-06-03" "A" "A2" "0" "1" 
# [4,] "2018-06-03" "A" "A3" "0" "2" 
# [5,] "2018-06-03" "A" "A4" "1" "1" 
# [6,] "2018-06-03" "A" "A5" "2" "1" 
# [7,] "2018-06-04" "A" "A6" "0" "3" 
# [8,] "2018-06-01" "B" "B1" "0" "1" 
# [9,] "2018-06-02" "B" "B2" "0" "2" 
#[10,] "2018-06-03" "B" "B3" "0" "3"

## a data frame with all character columns
dat <- setNames(data.frame(mat[-1, ], stringsAsFactors = FALSE), mat[1, ])
# Date Emp1 Case Priority PriorityCountinLast7days
#1 2018-06-01 A A1 0 0
#2 2018-06-03 A A2 0 1
#3 2018-06-03 A A3 0 2
#4 2018-06-03 A A4 1 1
#5 2018-06-03 A A5 2 1
#6 2018-06-04 A A6 0 3
#7 2018-06-01 B B1 0 1
#8 2018-06-02 B B2 0 2
#9 2018-06-03 B B3 0 3

## or maybe just use `type.convert` on some columns?
dat <- lapply(dat, type.convert)

edited Aug 26 '18 at 8:05

answered Aug 26 '18 at 6:32

李哲源

47.8k1494145

md_table <- scan(text = "
+------------+------+------+----------+--------------------------+
| Date | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A | A1 | 0 | 0 |
| 2018-06-03 | A | A2 | 0 | 1 |
| 2018-06-03 | A | A3 | 0 | 2 |
| 2018-06-03 | A | A4 | 1 | 1 |
| 2018-06-03 | A | A5 | 2 | 1 |
| 2018-06-04 | A | A6 | 0 | 3 |
| 2018-06-01 | B | B1 | 0 | 1 |
| 2018-06-02 | B | B2 | 0 | 2 |
| 2018-06-03 | B | B3 | 0 | 3 |
+------------+------+------+----------+--------------------------+",
what = "", sep = "", comment.char = "+", quiet = TRUE)

## it is clear that there are 5 columns
mat <- matrix(md_table[md_table != "|"], ncol = 5, byrow = TRUE)
# [,1] [,2] [,3] [,4] [,5] 
# [1,] "Date" "Emp1" "Case" "Priority" "PriorityCountinLast7days"
# [2,] "2018-06-01" "A" "A1" "0" "0" 
# [3,] "2018-06-03" "A" "A2" "0" "1" 
# [4,] "2018-06-03" "A" "A3" "0" "2" 
# [5,] "2018-06-03" "A" "A4" "1" "1" 
# [6,] "2018-06-03" "A" "A5" "2" "1" 
# [7,] "2018-06-04" "A" "A6" "0" "3" 
# [8,] "2018-06-01" "B" "B1" "0" "1" 
# [9,] "2018-06-02" "B" "B2" "0" "2" 
#[10,] "2018-06-03" "B" "B3" "0" "3"

## a data frame with all character columns
dat <- setNames(data.frame(mat[-1, ], stringsAsFactors = FALSE), mat[1, ])
# Date Emp1 Case Priority PriorityCountinLast7days
#1 2018-06-01 A A1 0 0
#2 2018-06-03 A A2 0 1
#3 2018-06-03 A A3 0 2
#4 2018-06-03 A A4 1 1
#5 2018-06-03 A A5 2 1
#6 2018-06-04 A A6 0 3
#7 2018-06-01 B B1 0 1
#8 2018-06-02 B B2 0 2
#9 2018-06-03 B B3 0 3

## or maybe just use `type.convert` on some columns?
dat <- lapply(dat, type.convert)

edited Aug 26 '18 at 8:05

answered Aug 26 '18 at 6:32

李哲源

47.8k1494145

edited Aug 26 '18 at 8:05

answered Aug 26 '18 at 6:32

李哲源

47.8k1494145

answered Aug 26 '18 at 6:32

李哲源

47.8k1494145

answered Aug 26 '18 at 6:32

李哲源

47.8k1494145

add a comment |

Well, about this specific dataset I used the import feature in RStudio, but I took one additional step beforehand.

Copy the dataset into the Notepad file.

Replace all | characters with ,

Import the Notepad file using read.csv to RStudio using this code (seperate columns by ,).

But, if you mean use the R to fully understand it in one step, then I have no idea.

edited Aug 26 '18 at 11:15

Peter Mortensen

13.5k1984111

answered Aug 26 '18 at 6:38

Salman Lashkarara

4,6991560110

Notepad? Are you assuming Microsoft Windows?

– Peter Mortensen
Aug 26 '18 at 11:16

@PeterMortensen Yes, but it could be any other editor as well.

– Salman Lashkarara
Aug 26 '18 at 11:26

add a comment |

Well, about this specific dataset I used the import feature in RStudio, but I took one additional step beforehand.

Copy the dataset into the Notepad file.

Replace all | characters with ,

Import the Notepad file using read.csv to RStudio using this code (seperate columns by ,).

But, if you mean use the R to fully understand it in one step, then I have no idea.

edited Aug 26 '18 at 11:15

Peter Mortensen

13.5k1984111

answered Aug 26 '18 at 6:38

Salman Lashkarara

4,6991560110

Notepad? Are you assuming Microsoft Windows?

– Peter Mortensen
Aug 26 '18 at 11:16

@PeterMortensen Yes, but it could be any other editor as well.

– Salman Lashkarara
Aug 26 '18 at 11:26

add a comment |

Well, about this specific dataset I used the import feature in RStudio, but I took one additional step beforehand.

Copy the dataset into the Notepad file.

Replace all | characters with ,

Import the Notepad file using read.csv to RStudio using this code (seperate columns by ,).

But, if you mean use the R to fully understand it in one step, then I have no idea.

edited Aug 26 '18 at 11:15

Peter Mortensen

13.5k1984111

answered Aug 26 '18 at 6:38

Salman Lashkarara

4,6991560110

Well, about this specific dataset I used the import feature in RStudio, but I took one additional step beforehand.

Copy the dataset into the Notepad file.

Replace all | characters with ,

Import the Notepad file using read.csv to RStudio using this code (seperate columns by ,).

But, if you mean use the R to fully understand it in one step, then I have no idea.

edited Aug 26 '18 at 11:15

Peter Mortensen

13.5k1984111

answered Aug 26 '18 at 6:38

Salman Lashkarara

4,6991560110

edited Aug 26 '18 at 11:15

Peter Mortensen

13.5k1984111

edited Aug 26 '18 at 11:15

Peter Mortensen

13.5k1984111

edited Aug 26 '18 at 11:15

Peter Mortensen

13.5k1984111

answered Aug 26 '18 at 6:38

Salman Lashkarara

4,6991560110

answered Aug 26 '18 at 6:38

Salman Lashkarara

4,6991560110

answered Aug 26 '18 at 6:38

Salman Lashkarara

4,6991560110

Notepad? Are you assuming Microsoft Windows?

– Peter Mortensen
Aug 26 '18 at 11:16

@PeterMortensen Yes, but it could be any other editor as well.

– Salman Lashkarara
Aug 26 '18 at 11:26

add a comment |

Notepad? Are you assuming Microsoft Windows?

– Peter Mortensen
Aug 26 '18 at 11:16

@PeterMortensen Yes, but it could be any other editor as well.

– Salman Lashkarara
Aug 26 '18 at 11:26

Notepad? Are you assuming Microsoft Windows?

– Peter Mortensen
Aug 26 '18 at 11:16

@PeterMortensen Yes, but it could be any other editor as well.

– Salman Lashkarara
Aug 26 '18 at 11:26

add a comment |

The issue isn't so much how many lines of code it takes, two or five, not much difference. The question is more whether it will work beyond the example you posted here.

I haven't come across this sort of thing in the wild, but I had a go at constructing another example that I thought could conceivably exist.

I've since come across a couple more cases and added them to the test suite.

x1 <- "
+------------+------+------+----------+--------------------------+
| Date | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A | A1 | 0 | 0 |
| 2018-06-03 | A | A2 | 0 | 1 |
| 2018-06-02 | B | B2 | 0 | 2 |
| 2018-06-03 | B | B3 | 0 | 3 |
+------------+------+------+----------+--------------------------+
"

x2 <- "
––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
 Date | Emp1 | Case | Priority | PriorityCountinLast7days 
––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
 2018-06-01 | A | A|1 | 0 | 0 
 2018-06-03 | A | A|2 | 0 | 1 
 2018-06-02 | B | B|2 | 0 | 2 
 2018-06-03 | B | B|3 | 0 | 3 
––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
"

x3 <- "
 Maths | English | Science | History | Class

 0.1 | 0.2 | 0.3 | 0.2 | Y2

 0.9 | 0.5 | 0.7 | 0.4 | Y1

 0.2 | 0.4 | 0.6 | 0.2 | Y2

 0.9 | 0.5 | 0.2 | 0.7 | Y1
"

x4 <- "
 Season | Team | W | AHWO
-------------------------------------
1 | 2017/2018 | TeamA | 2 | 1.75
2 | 2017/2018 | TeamB | 1 | 1.85
3 | 2017/2018 | TeamC | 1 | 1.70
4 | 2017/2018 | TeamD | 0 | 3.10
5 | 2016/2017 | TeamA | 1 | 1.49
6 | 2016/2017 | TeamB | 3 | 1.51
7 | 2016/2017 | TeamC | 2 | 1.90
8 | 2016/2017 | TeamD | 0 | N/A 
"

My go at a function

f <- function(x, header=TRUE, na.strings=c("NA", "N/A"), 
 stringsAsFactors=FALSE, ...) ", " ", x)

 # read the result as a table
 read.table(text=paste(x, collapse="n"), header=header, 
 na.strings=na.strings, stringsAsFactors=stringsAsFactors, ...) 


lapply(c(x1, x2, x3, x4, x5), f)

Output

[[1]]
 Date Emp1 Case Priority PriorityCountinLast7days
1 2018-06-01 A A1 0 0
2 2018-06-03 A A2 0 1
3 2018-06-02 B B2 0 2
4 2018-06-03 B B3 0 3

[[2]]
 Date Emp1 Case Priority PriorityCountinLast7days
1 2018-06-01 A A|1 0 0
2 2018-06-03 A A|2 0 1
3 2018-06-02 B B|2 0 2
4 2018-06-03 B B|3 0 3

[[3]]
 Maths English Science History Class
1 0.1 0.2 0.3 0.2 Y2
2 0.9 0.5 0.7 0.4 Y1
3 0.2 0.4 0.6 0.2 Y2
4 0.9 0.5 0.2 0.7 Y1

[[4]]
 Season Team W AHWO
1 2017/2018 TeamA 2 1.75
2 2017/2018 TeamB 1 1.85
3 2017/2018 TeamC 1 1.70
4 2017/2018 TeamD 0 3.10
5 2016/2017 TeamA 1 1.49
6 2016/2017 TeamB 3 1.51
7 2016/2017 TeamC 2 1.90
8 2016/2017 TeamD 0 NA

x3 is from here (will have to look at the edit history).

x4 is from here

edited Dec 15 '18 at 23:32

answered Oct 2 '18 at 12:26

AkselA

4,34621225

Great answer, upvote. The more general the solution the better.

– Rui Barradas
Oct 2 '18 at 13:27

add a comment |

The issue isn't so much how many lines of code it takes, two or five, not much difference. The question is more whether it will work beyond the example you posted here.

I haven't come across this sort of thing in the wild, but I had a go at constructing another example that I thought could conceivably exist.

I've since come across a couple more cases and added them to the test suite.

x1 <- "
+------------+------+------+----------+--------------------------+
| Date | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A | A1 | 0 | 0 |
| 2018-06-03 | A | A2 | 0 | 1 |
| 2018-06-02 | B | B2 | 0 | 2 |
| 2018-06-03 | B | B3 | 0 | 3 |
+------------+------+------+----------+--------------------------+
"

x2 <- "
––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
 Date | Emp1 | Case | Priority | PriorityCountinLast7days 
––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
 2018-06-01 | A | A|1 | 0 | 0 
 2018-06-03 | A | A|2 | 0 | 1 
 2018-06-02 | B | B|2 | 0 | 2 
 2018-06-03 | B | B|3 | 0 | 3 
––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
"

x3 <- "
 Maths | English | Science | History | Class

 0.1 | 0.2 | 0.3 | 0.2 | Y2

 0.9 | 0.5 | 0.7 | 0.4 | Y1

 0.2 | 0.4 | 0.6 | 0.2 | Y2

 0.9 | 0.5 | 0.2 | 0.7 | Y1
"

x4 <- "
 Season | Team | W | AHWO
-------------------------------------
1 | 2017/2018 | TeamA | 2 | 1.75
2 | 2017/2018 | TeamB | 1 | 1.85
3 | 2017/2018 | TeamC | 1 | 1.70
4 | 2017/2018 | TeamD | 0 | 3.10
5 | 2016/2017 | TeamA | 1 | 1.49
6 | 2016/2017 | TeamB | 3 | 1.51
7 | 2016/2017 | TeamC | 2 | 1.90
8 | 2016/2017 | TeamD | 0 | N/A 
"

My go at a function

f <- function(x, header=TRUE, na.strings=c("NA", "N/A"), 
 stringsAsFactors=FALSE, ...) ", " ", x)

 # read the result as a table
 read.table(text=paste(x, collapse="n"), header=header, 
 na.strings=na.strings, stringsAsFactors=stringsAsFactors, ...) 


lapply(c(x1, x2, x3, x4, x5), f)

Output

[[1]]
 Date Emp1 Case Priority PriorityCountinLast7days
1 2018-06-01 A A1 0 0
2 2018-06-03 A A2 0 1
3 2018-06-02 B B2 0 2
4 2018-06-03 B B3 0 3

[[2]]
 Date Emp1 Case Priority PriorityCountinLast7days
1 2018-06-01 A A|1 0 0
2 2018-06-03 A A|2 0 1
3 2018-06-02 B B|2 0 2
4 2018-06-03 B B|3 0 3

[[3]]
 Maths English Science History Class
1 0.1 0.2 0.3 0.2 Y2
2 0.9 0.5 0.7 0.4 Y1
3 0.2 0.4 0.6 0.2 Y2
4 0.9 0.5 0.2 0.7 Y1

[[4]]
 Season Team W AHWO
1 2017/2018 TeamA 2 1.75
2 2017/2018 TeamB 1 1.85
3 2017/2018 TeamC 1 1.70
4 2017/2018 TeamD 0 3.10
5 2016/2017 TeamA 1 1.49
6 2016/2017 TeamB 3 1.51
7 2016/2017 TeamC 2 1.90
8 2016/2017 TeamD 0 NA

x3 is from here (will have to look at the edit history).

x4 is from here

edited Dec 15 '18 at 23:32

answered Oct 2 '18 at 12:26

AkselA

4,34621225

Great answer, upvote. The more general the solution the better.

– Rui Barradas
Oct 2 '18 at 13:27

add a comment |

The issue isn't so much how many lines of code it takes, two or five, not much difference. The question is more whether it will work beyond the example you posted here.

I haven't come across this sort of thing in the wild, but I had a go at constructing another example that I thought could conceivably exist.

I've since come across a couple more cases and added them to the test suite.

x1 <- "
+------------+------+------+----------+--------------------------+
| Date | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A | A1 | 0 | 0 |
| 2018-06-03 | A | A2 | 0 | 1 |
| 2018-06-02 | B | B2 | 0 | 2 |
| 2018-06-03 | B | B3 | 0 | 3 |
+------------+------+------+----------+--------------------------+
"

x2 <- "
––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
 Date | Emp1 | Case | Priority | PriorityCountinLast7days 
––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
 2018-06-01 | A | A|1 | 0 | 0 
 2018-06-03 | A | A|2 | 0 | 1 
 2018-06-02 | B | B|2 | 0 | 2 
 2018-06-03 | B | B|3 | 0 | 3 
––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
"

x3 <- "
 Maths | English | Science | History | Class

 0.1 | 0.2 | 0.3 | 0.2 | Y2

 0.9 | 0.5 | 0.7 | 0.4 | Y1

 0.2 | 0.4 | 0.6 | 0.2 | Y2

 0.9 | 0.5 | 0.2 | 0.7 | Y1
"

x4 <- "
 Season | Team | W | AHWO
-------------------------------------
1 | 2017/2018 | TeamA | 2 | 1.75
2 | 2017/2018 | TeamB | 1 | 1.85
3 | 2017/2018 | TeamC | 1 | 1.70
4 | 2017/2018 | TeamD | 0 | 3.10
5 | 2016/2017 | TeamA | 1 | 1.49
6 | 2016/2017 | TeamB | 3 | 1.51
7 | 2016/2017 | TeamC | 2 | 1.90
8 | 2016/2017 | TeamD | 0 | N/A 
"

My go at a function

f <- function(x, header=TRUE, na.strings=c("NA", "N/A"), 
 stringsAsFactors=FALSE, ...) ", " ", x)

 # read the result as a table
 read.table(text=paste(x, collapse="n"), header=header, 
 na.strings=na.strings, stringsAsFactors=stringsAsFactors, ...) 


lapply(c(x1, x2, x3, x4, x5), f)

Output

[[1]]
 Date Emp1 Case Priority PriorityCountinLast7days
1 2018-06-01 A A1 0 0
2 2018-06-03 A A2 0 1
3 2018-06-02 B B2 0 2
4 2018-06-03 B B3 0 3

[[2]]
 Date Emp1 Case Priority PriorityCountinLast7days
1 2018-06-01 A A|1 0 0
2 2018-06-03 A A|2 0 1
3 2018-06-02 B B|2 0 2
4 2018-06-03 B B|3 0 3

[[3]]
 Maths English Science History Class
1 0.1 0.2 0.3 0.2 Y2
2 0.9 0.5 0.7 0.4 Y1
3 0.2 0.4 0.6 0.2 Y2
4 0.9 0.5 0.2 0.7 Y1

[[4]]
 Season Team W AHWO
1 2017/2018 TeamA 2 1.75
2 2017/2018 TeamB 1 1.85
3 2017/2018 TeamC 1 1.70
4 2017/2018 TeamD 0 3.10
5 2016/2017 TeamA 1 1.49
6 2016/2017 TeamB 3 1.51
7 2016/2017 TeamC 2 1.90
8 2016/2017 TeamD 0 NA

x3 is from here (will have to look at the edit history).

x4 is from here

edited Dec 15 '18 at 23:32

answered Oct 2 '18 at 12:26

AkselA

4,34621225

The issue isn't so much how many lines of code it takes, two or five, not much difference. The question is more whether it will work beyond the example you posted here.

I haven't come across this sort of thing in the wild, but I had a go at constructing another example that I thought could conceivably exist.

I've since come across a couple more cases and added them to the test suite.

x1 <- "
+------------+------+------+----------+--------------------------+
| Date | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A | A1 | 0 | 0 |
| 2018-06-03 | A | A2 | 0 | 1 |
| 2018-06-02 | B | B2 | 0 | 2 |
| 2018-06-03 | B | B3 | 0 | 3 |
+------------+------+------+----------+--------------------------+
"

x2 <- "
––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
 Date | Emp1 | Case | Priority | PriorityCountinLast7days 
––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
 2018-06-01 | A | A|1 | 0 | 0 
 2018-06-03 | A | A|2 | 0 | 1 
 2018-06-02 | B | B|2 | 0 | 2 
 2018-06-03 | B | B|3 | 0 | 3 
––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
"

x3 <- "
 Maths | English | Science | History | Class

 0.1 | 0.2 | 0.3 | 0.2 | Y2

 0.9 | 0.5 | 0.7 | 0.4 | Y1

 0.2 | 0.4 | 0.6 | 0.2 | Y2

 0.9 | 0.5 | 0.2 | 0.7 | Y1
"

x4 <- "
 Season | Team | W | AHWO
-------------------------------------
1 | 2017/2018 | TeamA | 2 | 1.75
2 | 2017/2018 | TeamB | 1 | 1.85
3 | 2017/2018 | TeamC | 1 | 1.70
4 | 2017/2018 | TeamD | 0 | 3.10
5 | 2016/2017 | TeamA | 1 | 1.49
6 | 2016/2017 | TeamB | 3 | 1.51
7 | 2016/2017 | TeamC | 2 | 1.90
8 | 2016/2017 | TeamD | 0 | N/A 
"

My go at a function

f <- function(x, header=TRUE, na.strings=c("NA", "N/A"), 
 stringsAsFactors=FALSE, ...) ", " ", x)

 # read the result as a table
 read.table(text=paste(x, collapse="n"), header=header, 
 na.strings=na.strings, stringsAsFactors=stringsAsFactors, ...) 


lapply(c(x1, x2, x3, x4, x5), f)

Output

[[1]]
 Date Emp1 Case Priority PriorityCountinLast7days
1 2018-06-01 A A1 0 0
2 2018-06-03 A A2 0 1
3 2018-06-02 B B2 0 2
4 2018-06-03 B B3 0 3

[[2]]
 Date Emp1 Case Priority PriorityCountinLast7days
1 2018-06-01 A A|1 0 0
2 2018-06-03 A A|2 0 1
3 2018-06-02 B B|2 0 2
4 2018-06-03 B B|3 0 3

[[3]]
 Maths English Science History Class
1 0.1 0.2 0.3 0.2 Y2
2 0.9 0.5 0.7 0.4 Y1
3 0.2 0.4 0.6 0.2 Y2
4 0.9 0.5 0.2 0.7 Y1

[[4]]
 Season Team W AHWO
1 2017/2018 TeamA 2 1.75
2 2017/2018 TeamB 1 1.85
3 2017/2018 TeamC 1 1.70
4 2017/2018 TeamD 0 3.10
5 2016/2017 TeamA 1 1.49
6 2016/2017 TeamB 3 1.51
7 2016/2017 TeamC 2 1.90
8 2016/2017 TeamD 0 NA

x3 is from here (will have to look at the edit history).

x4 is from here

edited Dec 15 '18 at 23:32

answered Oct 2 '18 at 12:26

AkselA

4,34621225

edited Dec 15 '18 at 23:32

answered Oct 2 '18 at 12:26

AkselA

4,34621225

answered Oct 2 '18 at 12:26

AkselA

4,34621225

answered Oct 2 '18 at 12:26

AkselA

4,34621225

Great answer, upvote. The more general the solution the better.

– Rui Barradas
Oct 2 '18 at 13:27

add a comment |

Great answer, upvote. The more general the solution the better.

– Rui Barradas
Oct 2 '18 at 13:27

Great answer, upvote. The more general the solution the better.

– Rui Barradas
Oct 2 '18 at 13:27

add a comment |

-3

As it was suggested, you could use dput to save the content of a dataframe to a file, open the file in a text editor and paste its content. An example of mtcar's dataset limited to first 10 rows:

dput(mtcars %>% head(10), file = 'reproducible.txt')

df <- structure(list(mpg = c(21, 21, 22.8, 21.4, 18.7, 18.1, 14.3,
24.4, 22.8, 19.2), cyl = c(6, 6, 4, 6, 8, 6, 8, 4, 4, 6), disp = c(160,
160, 108, 258, 360, 225, 360, 146.7, 140.8, 167.6), hp = c(110,
110, 93, 110, 175, 105, 245, 62, 95, 123), drat = c(3.9, 3.9,
3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92), wt = c(2.62,
2.875, 2.32, 3.215, 3.44, 3.46, 3.57, 3.19, 3.15, 3.44), qsec = c(16.46,
17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20, 22.9, 18.3), vs = c(0,
0, 1, 1, 0, 1, 0, 1, 1, 1), am = c(1, 1, 1, 0, 0, 0, 0, 0, 0,
0), gear = c(4, 4, 4, 3, 3, 3, 3, 4, 4, 4), carb = c(4, 4, 1,
1, 2, 1, 4, 2, 2, 4)), .Names = c("mpg", "cyl", "disp", "hp",
"drat", "wt", "qsec", "vs", "am", "gear", "carb"), row.names = c("Mazda RX4",
"Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive", "Hornet Sportabout",
"Valiant", "Duster 360", "Merc 240D", "Merc 230", "Merc 280"), class = "data.frame")

edited Aug 26 '18 at 11:19

Peter Mortensen

13.5k1984111

answered Aug 26 '18 at 6:39

Pawel Stradowski

14919

5

Thanks for answering, but I think you have not understood what I am asking. In the case of a dataframe already existing in an R session, there is no problem, the problem is to have a table written in a text document with plus signs, dashes an pipe signs. How can that type of table be read in by R?

– Rui Barradas
Aug 26 '18 at 6:53

add a comment |

-3

As it was suggested, you could use dput to save the content of a dataframe to a file, open the file in a text editor and paste its content. An example of mtcar's dataset limited to first 10 rows:

dput(mtcars %>% head(10), file = 'reproducible.txt')

df <- structure(list(mpg = c(21, 21, 22.8, 21.4, 18.7, 18.1, 14.3,
24.4, 22.8, 19.2), cyl = c(6, 6, 4, 6, 8, 6, 8, 4, 4, 6), disp = c(160,
160, 108, 258, 360, 225, 360, 146.7, 140.8, 167.6), hp = c(110,
110, 93, 110, 175, 105, 245, 62, 95, 123), drat = c(3.9, 3.9,
3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92), wt = c(2.62,
2.875, 2.32, 3.215, 3.44, 3.46, 3.57, 3.19, 3.15, 3.44), qsec = c(16.46,
17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20, 22.9, 18.3), vs = c(0,
0, 1, 1, 0, 1, 0, 1, 1, 1), am = c(1, 1, 1, 0, 0, 0, 0, 0, 0,
0), gear = c(4, 4, 4, 3, 3, 3, 3, 4, 4, 4), carb = c(4, 4, 1,
1, 2, 1, 4, 2, 2, 4)), .Names = c("mpg", "cyl", "disp", "hp",
"drat", "wt", "qsec", "vs", "am", "gear", "carb"), row.names = c("Mazda RX4",
"Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive", "Hornet Sportabout",
"Valiant", "Duster 360", "Merc 240D", "Merc 230", "Merc 280"), class = "data.frame")

edited Aug 26 '18 at 11:19

Peter Mortensen

13.5k1984111

answered Aug 26 '18 at 6:39

Pawel Stradowski

14919

5

Thanks for answering, but I think you have not understood what I am asking. In the case of a dataframe already existing in an R session, there is no problem, the problem is to have a table written in a text document with plus signs, dashes an pipe signs. How can that type of table be read in by R?

– Rui Barradas
Aug 26 '18 at 6:53

add a comment |

-3

As it was suggested, you could use dput to save the content of a dataframe to a file, open the file in a text editor and paste its content. An example of mtcar's dataset limited to first 10 rows:

dput(mtcars %>% head(10), file = 'reproducible.txt')

df <- structure(list(mpg = c(21, 21, 22.8, 21.4, 18.7, 18.1, 14.3,
24.4, 22.8, 19.2), cyl = c(6, 6, 4, 6, 8, 6, 8, 4, 4, 6), disp = c(160,
160, 108, 258, 360, 225, 360, 146.7, 140.8, 167.6), hp = c(110,
110, 93, 110, 175, 105, 245, 62, 95, 123), drat = c(3.9, 3.9,
3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92), wt = c(2.62,
2.875, 2.32, 3.215, 3.44, 3.46, 3.57, 3.19, 3.15, 3.44), qsec = c(16.46,
17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20, 22.9, 18.3), vs = c(0,
0, 1, 1, 0, 1, 0, 1, 1, 1), am = c(1, 1, 1, 0, 0, 0, 0, 0, 0,
0), gear = c(4, 4, 4, 3, 3, 3, 3, 4, 4, 4), carb = c(4, 4, 1,
1, 2, 1, 4, 2, 2, 4)), .Names = c("mpg", "cyl", "disp", "hp",
"drat", "wt", "qsec", "vs", "am", "gear", "carb"), row.names = c("Mazda RX4",
"Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive", "Hornet Sportabout",
"Valiant", "Duster 360", "Merc 240D", "Merc 230", "Merc 280"), class = "data.frame")

edited Aug 26 '18 at 11:19

Peter Mortensen

13.5k1984111

answered Aug 26 '18 at 6:39

Pawel Stradowski

14919

As it was suggested, you could use dput to save the content of a dataframe to a file, open the file in a text editor and paste its content. An example of mtcar's dataset limited to first 10 rows:

dput(mtcars %>% head(10), file = 'reproducible.txt')

df <- structure(list(mpg = c(21, 21, 22.8, 21.4, 18.7, 18.1, 14.3,
24.4, 22.8, 19.2), cyl = c(6, 6, 4, 6, 8, 6, 8, 4, 4, 6), disp = c(160,
160, 108, 258, 360, 225, 360, 146.7, 140.8, 167.6), hp = c(110,
110, 93, 110, 175, 105, 245, 62, 95, 123), drat = c(3.9, 3.9,
3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92), wt = c(2.62,
2.875, 2.32, 3.215, 3.44, 3.46, 3.57, 3.19, 3.15, 3.44), qsec = c(16.46,
17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20, 22.9, 18.3), vs = c(0,
0, 1, 1, 0, 1, 0, 1, 1, 1), am = c(1, 1, 1, 0, 0, 0, 0, 0, 0,
0), gear = c(4, 4, 4, 3, 3, 3, 3, 4, 4, 4), carb = c(4, 4, 1,
1, 2, 1, 4, 2, 2, 4)), .Names = c("mpg", "cyl", "disp", "hp",
"drat", "wt", "qsec", "vs", "am", "gear", "carb"), row.names = c("Mazda RX4",
"Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive", "Hornet Sportabout",
"Valiant", "Duster 360", "Merc 240D", "Merc 230", "Merc 280"), class = "data.frame")

edited Aug 26 '18 at 11:19

Peter Mortensen

13.5k1984111

answered Aug 26 '18 at 6:39

Pawel Stradowski

14919

edited Aug 26 '18 at 11:19

Peter Mortensen

13.5k1984111

edited Aug 26 '18 at 11:19

Peter Mortensen

13.5k1984111

edited Aug 26 '18 at 11:19

Peter Mortensen

13.5k1984111

answered Aug 26 '18 at 6:39

Pawel Stradowski

14919

answered Aug 26 '18 at 6:39

Pawel Stradowski

14919

answered Aug 26 '18 at 6:39

Pawel Stradowski

14919

5

Thanks for answering, but I think you have not understood what I am asking. In the case of a dataframe already existing in an R session, there is no problem, the problem is to have a table written in a text document with plus signs, dashes an pipe signs. How can that type of table be read in by R?

– Rui Barradas
Aug 26 '18 at 6:53

add a comment |

5

Thanks for answering, but I think you have not understood what I am asking. In the case of a dataframe already existing in an R session, there is no problem, the problem is to have a table written in a text document with plus signs, dashes an pipe signs. How can that type of table be read in by R?

– Rui Barradas
Aug 26 '18 at 6:53

Thanks for answering, but I think you have not understood what I am asking. In the case of a dataframe already existing in an R session, there is no problem, the problem is to have a table written in a text document with plus signs, dashes an pipe signs. How can that type of table be read in by R?

– Rui Barradas
Aug 26 '18 at 6:53

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Pfthb