How to reshape data from long to wide format?
up vote
192
down vote
favorite
I'm having trouble rearranging the following data frame:
set.seed(45)
dat1 <- data.frame(
name = rep(c("firstName", "secondName"), each=4),
numbers = rep(1:4, 2),
value = rnorm(8)
)
dat1
name numbers value
1 firstName 1 0.3407997
2 firstName 2 -0.7033403
3 firstName 3 -0.3795377
4 firstName 4 -0.7460474
5 secondName 1 -0.8981073
6 secondName 2 -0.3347941
7 secondName 3 -0.5013782
8 secondName 4 -0.1745357
I want to reshape it so that each unique "name" variable is a rowname, with the "values" as observations along that row and the "numbers" as colnames. Sort of like this:
name 1 2 3 4
1 firstName 0.3407997 -0.7033403 -0.3795377 -0.7460474
5 secondName -0.8981073 -0.3347941 -0.5013782 -0.1745357
I've looked at melt
and cast
and a few other things, but none seem to do the job.
r reshape r-faq
|
show 1 more comment
up vote
192
down vote
favorite
I'm having trouble rearranging the following data frame:
set.seed(45)
dat1 <- data.frame(
name = rep(c("firstName", "secondName"), each=4),
numbers = rep(1:4, 2),
value = rnorm(8)
)
dat1
name numbers value
1 firstName 1 0.3407997
2 firstName 2 -0.7033403
3 firstName 3 -0.3795377
4 firstName 4 -0.7460474
5 secondName 1 -0.8981073
6 secondName 2 -0.3347941
7 secondName 3 -0.5013782
8 secondName 4 -0.1745357
I want to reshape it so that each unique "name" variable is a rowname, with the "values" as observations along that row and the "numbers" as colnames. Sort of like this:
name 1 2 3 4
1 firstName 0.3407997 -0.7033403 -0.3795377 -0.7460474
5 secondName -0.8981073 -0.3347941 -0.5013782 -0.1745357
I've looked at melt
and cast
and a few other things, but none seem to do the job.
r reshape r-faq
1
possible duplicate of Reshape three column data frame to matrix
– Frank
Oct 8 '13 at 20:53
3
@Frank: this is a much better title. long-form and wide-form are the standard terms used. The other answer cannot be found by searching on those terms.
– smci
Apr 11 '14 at 5:21
2
@smci Having a better title is not a good reason not to link the questions, is it? If the questions are essentially the same, it's better for future visitors that they be linked so that all the answers can be found easily. You could mark the duplicate in the other direction instead, I suppose... Also, I do not know why you have made new tags here.
– Frank
Apr 11 '14 at 16:49
1
^^ @Frank - this answer cannot be found by searching on the terms users are likely to use. It's not a question of aesthetics.
– smci
Apr 11 '14 at 21:03
1
@smci By "this answer" you mean... the one I linked to? Yeah, I agree that it is unlikely but someone might land on one or the other in their search, and we want that person to find all the relevant answers (here and there). It's the exact same question so the answers in both places are relevant -- that's why it should be marked as a dupe. The titles don't matter; and which one is a dupe of the other doesn't matter either (as far as I can think of). If you're talking about the tags, maybe you should take it to meta...?
– Frank
Apr 11 '14 at 21:13
|
show 1 more comment
up vote
192
down vote
favorite
up vote
192
down vote
favorite
I'm having trouble rearranging the following data frame:
set.seed(45)
dat1 <- data.frame(
name = rep(c("firstName", "secondName"), each=4),
numbers = rep(1:4, 2),
value = rnorm(8)
)
dat1
name numbers value
1 firstName 1 0.3407997
2 firstName 2 -0.7033403
3 firstName 3 -0.3795377
4 firstName 4 -0.7460474
5 secondName 1 -0.8981073
6 secondName 2 -0.3347941
7 secondName 3 -0.5013782
8 secondName 4 -0.1745357
I want to reshape it so that each unique "name" variable is a rowname, with the "values" as observations along that row and the "numbers" as colnames. Sort of like this:
name 1 2 3 4
1 firstName 0.3407997 -0.7033403 -0.3795377 -0.7460474
5 secondName -0.8981073 -0.3347941 -0.5013782 -0.1745357
I've looked at melt
and cast
and a few other things, but none seem to do the job.
r reshape r-faq
I'm having trouble rearranging the following data frame:
set.seed(45)
dat1 <- data.frame(
name = rep(c("firstName", "secondName"), each=4),
numbers = rep(1:4, 2),
value = rnorm(8)
)
dat1
name numbers value
1 firstName 1 0.3407997
2 firstName 2 -0.7033403
3 firstName 3 -0.3795377
4 firstName 4 -0.7460474
5 secondName 1 -0.8981073
6 secondName 2 -0.3347941
7 secondName 3 -0.5013782
8 secondName 4 -0.1745357
I want to reshape it so that each unique "name" variable is a rowname, with the "values" as observations along that row and the "numbers" as colnames. Sort of like this:
name 1 2 3 4
1 firstName 0.3407997 -0.7033403 -0.3795377 -0.7460474
5 secondName -0.8981073 -0.3347941 -0.5013782 -0.1745357
I've looked at melt
and cast
and a few other things, but none seem to do the job.
r reshape r-faq
r reshape r-faq
edited Mar 22 '17 at 16:25
Taryn♦
187k45284348
187k45284348
asked May 4 '11 at 22:27
Steve
1,82472428
1,82472428
1
possible duplicate of Reshape three column data frame to matrix
– Frank
Oct 8 '13 at 20:53
3
@Frank: this is a much better title. long-form and wide-form are the standard terms used. The other answer cannot be found by searching on those terms.
– smci
Apr 11 '14 at 5:21
2
@smci Having a better title is not a good reason not to link the questions, is it? If the questions are essentially the same, it's better for future visitors that they be linked so that all the answers can be found easily. You could mark the duplicate in the other direction instead, I suppose... Also, I do not know why you have made new tags here.
– Frank
Apr 11 '14 at 16:49
1
^^ @Frank - this answer cannot be found by searching on the terms users are likely to use. It's not a question of aesthetics.
– smci
Apr 11 '14 at 21:03
1
@smci By "this answer" you mean... the one I linked to? Yeah, I agree that it is unlikely but someone might land on one or the other in their search, and we want that person to find all the relevant answers (here and there). It's the exact same question so the answers in both places are relevant -- that's why it should be marked as a dupe. The titles don't matter; and which one is a dupe of the other doesn't matter either (as far as I can think of). If you're talking about the tags, maybe you should take it to meta...?
– Frank
Apr 11 '14 at 21:13
|
show 1 more comment
1
possible duplicate of Reshape three column data frame to matrix
– Frank
Oct 8 '13 at 20:53
3
@Frank: this is a much better title. long-form and wide-form are the standard terms used. The other answer cannot be found by searching on those terms.
– smci
Apr 11 '14 at 5:21
2
@smci Having a better title is not a good reason not to link the questions, is it? If the questions are essentially the same, it's better for future visitors that they be linked so that all the answers can be found easily. You could mark the duplicate in the other direction instead, I suppose... Also, I do not know why you have made new tags here.
– Frank
Apr 11 '14 at 16:49
1
^^ @Frank - this answer cannot be found by searching on the terms users are likely to use. It's not a question of aesthetics.
– smci
Apr 11 '14 at 21:03
1
@smci By "this answer" you mean... the one I linked to? Yeah, I agree that it is unlikely but someone might land on one or the other in their search, and we want that person to find all the relevant answers (here and there). It's the exact same question so the answers in both places are relevant -- that's why it should be marked as a dupe. The titles don't matter; and which one is a dupe of the other doesn't matter either (as far as I can think of). If you're talking about the tags, maybe you should take it to meta...?
– Frank
Apr 11 '14 at 21:13
1
1
possible duplicate of Reshape three column data frame to matrix
– Frank
Oct 8 '13 at 20:53
possible duplicate of Reshape three column data frame to matrix
– Frank
Oct 8 '13 at 20:53
3
3
@Frank: this is a much better title. long-form and wide-form are the standard terms used. The other answer cannot be found by searching on those terms.
– smci
Apr 11 '14 at 5:21
@Frank: this is a much better title. long-form and wide-form are the standard terms used. The other answer cannot be found by searching on those terms.
– smci
Apr 11 '14 at 5:21
2
2
@smci Having a better title is not a good reason not to link the questions, is it? If the questions are essentially the same, it's better for future visitors that they be linked so that all the answers can be found easily. You could mark the duplicate in the other direction instead, I suppose... Also, I do not know why you have made new tags here.
– Frank
Apr 11 '14 at 16:49
@smci Having a better title is not a good reason not to link the questions, is it? If the questions are essentially the same, it's better for future visitors that they be linked so that all the answers can be found easily. You could mark the duplicate in the other direction instead, I suppose... Also, I do not know why you have made new tags here.
– Frank
Apr 11 '14 at 16:49
1
1
^^ @Frank - this answer cannot be found by searching on the terms users are likely to use. It's not a question of aesthetics.
– smci
Apr 11 '14 at 21:03
^^ @Frank - this answer cannot be found by searching on the terms users are likely to use. It's not a question of aesthetics.
– smci
Apr 11 '14 at 21:03
1
1
@smci By "this answer" you mean... the one I linked to? Yeah, I agree that it is unlikely but someone might land on one or the other in their search, and we want that person to find all the relevant answers (here and there). It's the exact same question so the answers in both places are relevant -- that's why it should be marked as a dupe. The titles don't matter; and which one is a dupe of the other doesn't matter either (as far as I can think of). If you're talking about the tags, maybe you should take it to meta...?
– Frank
Apr 11 '14 at 21:13
@smci By "this answer" you mean... the one I linked to? Yeah, I agree that it is unlikely but someone might land on one or the other in their search, and we want that person to find all the relevant answers (here and there). It's the exact same question so the answers in both places are relevant -- that's why it should be marked as a dupe. The titles don't matter; and which one is a dupe of the other doesn't matter either (as far as I can think of). If you're talking about the tags, maybe you should take it to meta...?
– Frank
Apr 11 '14 at 21:13
|
show 1 more comment
9 Answers
9
active
oldest
votes
up vote
186
down vote
accepted
Using reshape
function:
reshape(dat1, idvar = "name", timevar = "numbers", direction = "wide")
12
+1 and you don't need to rely on external packages, sincereshape
comes withstats
. Not to mention that it's faster! =)
– aL3xa
May 5 '11 at 0:07
102
Good luck figuring out the arguments you need though
– hadley
May 5 '11 at 1:40
2
@hadley, yepp... that's true! =)
– aL3xa
May 5 '11 at 9:45
2
reshape
is an outstanding example for a horrible function API. It is very close to useless.
– NoBackingDown
Oct 26 '17 at 15:18
2
Thereshape
comments and similar argument names aren't all that helpful. However, I have found that for long to wide, you need to providedata =
your data.frame,idvar
= the variable that identifies your groups,v.names
= the variables that will become multiple columns in wide format,timevar
= the variable containing the values that will be appended tov.names
in wide format,direction = wide
, andsep = "_"
. Clear enough? ;)
– Brian D
Nov 17 '17 at 17:11
|
show 2 more comments
up vote
103
down vote
The new (in 2014) tidyr
package also does this simply, with gather()
/spread()
being the terms for melt
/cast
.
library(tidyr)
spread(dat1, key = numbers, value = value)
From github,
tidyr
is a reframing ofreshape2
designed to accompany the tidy data framework, and to work hand-in-hand withmagrittr
anddplyr
to build a solid pipeline for data analysis.
Just as
reshape2
did less than reshape,tidyr
does less thanreshape2
. It's designed specifically for tidying data, not the general reshaping thatreshape2
does, or the general aggregation that reshape did. In particular, built-in methods only work for data frames, andtidyr
provides no margins or aggregation.
2
Just wanted to add a link to the R Cookbook page that discusses the use of these functions fromtidyr
andreshape2
. It provides good examples and explanations.
– Jake
Apr 12 '17 at 13:01
add a comment |
up vote
63
down vote
You can do this with the reshape()
function, or with the melt()
/ cast()
functions in the reshape package. For the second option, example code is
library(reshape)
cast(dat1, name ~ numbers)
Or using reshape2
library(reshape2)
dcast(dat1, name ~ numbers)
thank you! I can't believe I didn't see that - I just looked at 'cast' before posting, but couldn't get it to work how I wanted.
– Steve
May 4 '11 at 22:45
7
+1 And usereshape2
for a performance gain.
– Andrie
May 6 '11 at 11:56
1
It might be worth noting that just usingcast
ordcast
will not work nicely if you don't have a clear "value" column. Trydat <- data.frame(id=c(1,1,2,2),blah=c(8,4,7,6),index=c(1,2,1,2)); dcast(dat, id ~ index); cast(dat, id ~ index)
and you will not get what you expect. You need to explicitly note thevalue/value.var
-cast(dat, id ~ index, value="blah")
anddcast(dat, id ~ index, value.var="blah")
for instance.
– thelatemail
Jun 21 '17 at 22:37
add a comment |
up vote
27
down vote
Another option if performance is a concern is to use data.table
's extension of reshape2
's melt & dcast functions
(Reference: Efficient reshaping using data.tables)
library(data.table)
setDT(dat1)
dcast(dat1, name ~ numbers, value.var = "value")
# name 1 2 3 4
# 1: firstName 0.1836433 -0.8356286 1.5952808 0.3295078
# 2: secondName -0.8204684 0.4874291 0.7383247 0.5757814
And, as of data.table v1.9.6 we can cast on multiple columns
## add an extra column
dat1[, value2 := value * 2]
## cast multiple value columns
dcast(dat1, name ~ numbers, value.var = c("value", "value2"))
# name value_1 value_2 value_3 value_4 value2_1 value2_2 value2_3 value2_4
# 1: firstName 0.1836433 -0.8356286 1.5952808 0.3295078 0.3672866 -1.6712572 3.190562 0.6590155
# 2: secondName -0.8204684 0.4874291 0.7383247 0.5757814 -1.6409368 0.9748581 1.476649 1.1515627
3
data.table
approach is the best ! very efficient ... you will see the difference whenname
is a combination of 30-40 columns !!
– joel.wilson
Aug 31 '17 at 12:06
3
I personally think this is the best answer
– cloudscomputes
Oct 20 '17 at 4:37
add a comment |
up vote
22
down vote
Using your example dataframe, we could:
xtabs(value ~ name + numbers, data = dat1)
2
this one is good, but the result is of format table which not may be not so easy to handle as data.frame or data.table, both has plenty of packages
– cloudscomputes
Oct 20 '17 at 4:44
add a comment |
up vote
14
down vote
Other two options:
Base package:
df <- unstack(dat1, form = value ~ numbers)
rownames(df) <- unique(dat1$name)
df
sqldf
package:
library(sqldf)
sqldf('SELECT name,
MAX(CASE WHEN numbers = 1 THEN value ELSE NULL END) x1,
MAX(CASE WHEN numbers = 2 THEN value ELSE NULL END) x2,
MAX(CASE WHEN numbers = 3 THEN value ELSE NULL END) x3,
MAX(CASE WHEN numbers = 4 THEN value ELSE NULL END) x4
FROM dat1
GROUP BY name')
add a comment |
up vote
9
down vote
Using base R aggregate
function:
aggregate(value ~ name, dat1, I)
# name value.1 value.2 value.3 value.4
#1 firstName 0.4145 -0.4747 0.0659 -0.5024
#2 secondName -0.8259 0.1669 -0.8962 0.1681
1
why use the paste function instead of theI
(identity function)?
– Onyambu
Dec 25 '17 at 4:04
add a comment |
up vote
4
down vote
There's very powerful new package from genius data scientists at Win-Vector (folks that made vtreat
, seplyr
and replyr
) called cdata
. It implements "coordinated data" principles described in this document and also in this blog post. The idea is that regardless how you organize your data, it should be possible to identify individual data points using a system of "data coordinates". Here's a excerpt from the recent blog post by John Mount:
The whole system is based on two primitives or operators
cdata::moveValuesToRowsD() and cdata::moveValuesToColumnsD(). These
operators have pivot, un-pivot, one-hot encode, transpose, moving
multiple rows and columns, and many other transforms as simple special
cases.
It is easy to write many different operations in terms of the
cdata primitives. These operators can work-in memory or at big data
scale (with databases and Apache Spark; for big data use the
cdata::moveValuesToRowsN() and cdata::moveValuesToColumnsN()
variants). The transforms are controlled by a control table that
itself is a diagram of (or picture of) the transform.
We will first build the control table (see blog post for details) and then perform the move of data from rows to columns.
library(cdata)
# first build the control table
pivotControlTable <- buildPivotControlTableD(table = dat1, # reference to dataset
columnToTakeKeysFrom = 'numbers', # this will become column headers
columnToTakeValuesFrom = 'value', # this contains data
sep="_") # optional for making column names
# perform the move of data to columns
dat_wide <- moveValuesToColumnsD(tallTable = dat1, # reference to dataset
keyColumns = c('name'), # this(these) column(s) should stay untouched
controlTable = pivotControlTable# control table above
)
dat_wide
#> name numbers_1 numbers_2 numbers_3 numbers_4
#> 1 firstName 0.3407997 -0.7033403 -0.3795377 -0.7460474
#> 2 secondName -0.8981073 -0.3347941 -0.5013782 -0.1745357
add a comment |
up vote
2
down vote
The base reshape
function works perfectly fine:
df <- data.frame(
year = c(rep(2000, 12), rep(2001, 12)),
month = rep(1:12, 2),
values = rnorm(24)
)
df_wide <- reshape(df, idvar="year", timevar="month", v.names="values", direction="wide", sep="_")
df_wide
Where
idvar
is the column of classes that separates rowstimevar
is the column of classes to cast widev.names
is the column containing numeric valuesdirection
specifies wide or long format- the optional
sep
argument is the separator used in betweentimevar
class names andv.names
in the outputdata.frame
.
If no idvar
exists, create one before using the reshape()
function:
df$id <- c(rep("year1", 12), rep("year2", 12))
df_wide <- reshape(df, idvar="id", timevar="month", v.names="values", direction="wide", sep="_")
df_wide
Just remember that idvar
is required! The timevar
and v.names
part is easy. The output of this function is more predictable than some of the others, as everything is explicitly defined.
add a comment |
9 Answers
9
active
oldest
votes
9 Answers
9
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
186
down vote
accepted
Using reshape
function:
reshape(dat1, idvar = "name", timevar = "numbers", direction = "wide")
12
+1 and you don't need to rely on external packages, sincereshape
comes withstats
. Not to mention that it's faster! =)
– aL3xa
May 5 '11 at 0:07
102
Good luck figuring out the arguments you need though
– hadley
May 5 '11 at 1:40
2
@hadley, yepp... that's true! =)
– aL3xa
May 5 '11 at 9:45
2
reshape
is an outstanding example for a horrible function API. It is very close to useless.
– NoBackingDown
Oct 26 '17 at 15:18
2
Thereshape
comments and similar argument names aren't all that helpful. However, I have found that for long to wide, you need to providedata =
your data.frame,idvar
= the variable that identifies your groups,v.names
= the variables that will become multiple columns in wide format,timevar
= the variable containing the values that will be appended tov.names
in wide format,direction = wide
, andsep = "_"
. Clear enough? ;)
– Brian D
Nov 17 '17 at 17:11
|
show 2 more comments
up vote
186
down vote
accepted
Using reshape
function:
reshape(dat1, idvar = "name", timevar = "numbers", direction = "wide")
12
+1 and you don't need to rely on external packages, sincereshape
comes withstats
. Not to mention that it's faster! =)
– aL3xa
May 5 '11 at 0:07
102
Good luck figuring out the arguments you need though
– hadley
May 5 '11 at 1:40
2
@hadley, yepp... that's true! =)
– aL3xa
May 5 '11 at 9:45
2
reshape
is an outstanding example for a horrible function API. It is very close to useless.
– NoBackingDown
Oct 26 '17 at 15:18
2
Thereshape
comments and similar argument names aren't all that helpful. However, I have found that for long to wide, you need to providedata =
your data.frame,idvar
= the variable that identifies your groups,v.names
= the variables that will become multiple columns in wide format,timevar
= the variable containing the values that will be appended tov.names
in wide format,direction = wide
, andsep = "_"
. Clear enough? ;)
– Brian D
Nov 17 '17 at 17:11
|
show 2 more comments
up vote
186
down vote
accepted
up vote
186
down vote
accepted
Using reshape
function:
reshape(dat1, idvar = "name", timevar = "numbers", direction = "wide")
Using reshape
function:
reshape(dat1, idvar = "name", timevar = "numbers", direction = "wide")
answered May 4 '11 at 23:20
Chase
47.9k12115147
47.9k12115147
12
+1 and you don't need to rely on external packages, sincereshape
comes withstats
. Not to mention that it's faster! =)
– aL3xa
May 5 '11 at 0:07
102
Good luck figuring out the arguments you need though
– hadley
May 5 '11 at 1:40
2
@hadley, yepp... that's true! =)
– aL3xa
May 5 '11 at 9:45
2
reshape
is an outstanding example for a horrible function API. It is very close to useless.
– NoBackingDown
Oct 26 '17 at 15:18
2
Thereshape
comments and similar argument names aren't all that helpful. However, I have found that for long to wide, you need to providedata =
your data.frame,idvar
= the variable that identifies your groups,v.names
= the variables that will become multiple columns in wide format,timevar
= the variable containing the values that will be appended tov.names
in wide format,direction = wide
, andsep = "_"
. Clear enough? ;)
– Brian D
Nov 17 '17 at 17:11
|
show 2 more comments
12
+1 and you don't need to rely on external packages, sincereshape
comes withstats
. Not to mention that it's faster! =)
– aL3xa
May 5 '11 at 0:07
102
Good luck figuring out the arguments you need though
– hadley
May 5 '11 at 1:40
2
@hadley, yepp... that's true! =)
– aL3xa
May 5 '11 at 9:45
2
reshape
is an outstanding example for a horrible function API. It is very close to useless.
– NoBackingDown
Oct 26 '17 at 15:18
2
Thereshape
comments and similar argument names aren't all that helpful. However, I have found that for long to wide, you need to providedata =
your data.frame,idvar
= the variable that identifies your groups,v.names
= the variables that will become multiple columns in wide format,timevar
= the variable containing the values that will be appended tov.names
in wide format,direction = wide
, andsep = "_"
. Clear enough? ;)
– Brian D
Nov 17 '17 at 17:11
12
12
+1 and you don't need to rely on external packages, since
reshape
comes with stats
. Not to mention that it's faster! =)– aL3xa
May 5 '11 at 0:07
+1 and you don't need to rely on external packages, since
reshape
comes with stats
. Not to mention that it's faster! =)– aL3xa
May 5 '11 at 0:07
102
102
Good luck figuring out the arguments you need though
– hadley
May 5 '11 at 1:40
Good luck figuring out the arguments you need though
– hadley
May 5 '11 at 1:40
2
2
@hadley, yepp... that's true! =)
– aL3xa
May 5 '11 at 9:45
@hadley, yepp... that's true! =)
– aL3xa
May 5 '11 at 9:45
2
2
reshape
is an outstanding example for a horrible function API. It is very close to useless.– NoBackingDown
Oct 26 '17 at 15:18
reshape
is an outstanding example for a horrible function API. It is very close to useless.– NoBackingDown
Oct 26 '17 at 15:18
2
2
The
reshape
comments and similar argument names aren't all that helpful. However, I have found that for long to wide, you need to provide data =
your data.frame, idvar
= the variable that identifies your groups, v.names
= the variables that will become multiple columns in wide format, timevar
= the variable containing the values that will be appended to v.names
in wide format, direction = wide
, and sep = "_"
. Clear enough? ;)– Brian D
Nov 17 '17 at 17:11
The
reshape
comments and similar argument names aren't all that helpful. However, I have found that for long to wide, you need to provide data =
your data.frame, idvar
= the variable that identifies your groups, v.names
= the variables that will become multiple columns in wide format, timevar
= the variable containing the values that will be appended to v.names
in wide format, direction = wide
, and sep = "_"
. Clear enough? ;)– Brian D
Nov 17 '17 at 17:11
|
show 2 more comments
up vote
103
down vote
The new (in 2014) tidyr
package also does this simply, with gather()
/spread()
being the terms for melt
/cast
.
library(tidyr)
spread(dat1, key = numbers, value = value)
From github,
tidyr
is a reframing ofreshape2
designed to accompany the tidy data framework, and to work hand-in-hand withmagrittr
anddplyr
to build a solid pipeline for data analysis.
Just as
reshape2
did less than reshape,tidyr
does less thanreshape2
. It's designed specifically for tidying data, not the general reshaping thatreshape2
does, or the general aggregation that reshape did. In particular, built-in methods only work for data frames, andtidyr
provides no margins or aggregation.
2
Just wanted to add a link to the R Cookbook page that discusses the use of these functions fromtidyr
andreshape2
. It provides good examples and explanations.
– Jake
Apr 12 '17 at 13:01
add a comment |
up vote
103
down vote
The new (in 2014) tidyr
package also does this simply, with gather()
/spread()
being the terms for melt
/cast
.
library(tidyr)
spread(dat1, key = numbers, value = value)
From github,
tidyr
is a reframing ofreshape2
designed to accompany the tidy data framework, and to work hand-in-hand withmagrittr
anddplyr
to build a solid pipeline for data analysis.
Just as
reshape2
did less than reshape,tidyr
does less thanreshape2
. It's designed specifically for tidying data, not the general reshaping thatreshape2
does, or the general aggregation that reshape did. In particular, built-in methods only work for data frames, andtidyr
provides no margins or aggregation.
2
Just wanted to add a link to the R Cookbook page that discusses the use of these functions fromtidyr
andreshape2
. It provides good examples and explanations.
– Jake
Apr 12 '17 at 13:01
add a comment |
up vote
103
down vote
up vote
103
down vote
The new (in 2014) tidyr
package also does this simply, with gather()
/spread()
being the terms for melt
/cast
.
library(tidyr)
spread(dat1, key = numbers, value = value)
From github,
tidyr
is a reframing ofreshape2
designed to accompany the tidy data framework, and to work hand-in-hand withmagrittr
anddplyr
to build a solid pipeline for data analysis.
Just as
reshape2
did less than reshape,tidyr
does less thanreshape2
. It's designed specifically for tidying data, not the general reshaping thatreshape2
does, or the general aggregation that reshape did. In particular, built-in methods only work for data frames, andtidyr
provides no margins or aggregation.
The new (in 2014) tidyr
package also does this simply, with gather()
/spread()
being the terms for melt
/cast
.
library(tidyr)
spread(dat1, key = numbers, value = value)
From github,
tidyr
is a reframing ofreshape2
designed to accompany the tidy data framework, and to work hand-in-hand withmagrittr
anddplyr
to build a solid pipeline for data analysis.
Just as
reshape2
did less than reshape,tidyr
does less thanreshape2
. It's designed specifically for tidying data, not the general reshaping thatreshape2
does, or the general aggregation that reshape did. In particular, built-in methods only work for data frames, andtidyr
provides no margins or aggregation.
edited Jul 29 '15 at 16:39
answered Jul 29 '14 at 19:37
Gregor
61.3k988163
61.3k988163
2
Just wanted to add a link to the R Cookbook page that discusses the use of these functions fromtidyr
andreshape2
. It provides good examples and explanations.
– Jake
Apr 12 '17 at 13:01
add a comment |
2
Just wanted to add a link to the R Cookbook page that discusses the use of these functions fromtidyr
andreshape2
. It provides good examples and explanations.
– Jake
Apr 12 '17 at 13:01
2
2
Just wanted to add a link to the R Cookbook page that discusses the use of these functions from
tidyr
and reshape2
. It provides good examples and explanations.– Jake
Apr 12 '17 at 13:01
Just wanted to add a link to the R Cookbook page that discusses the use of these functions from
tidyr
and reshape2
. It provides good examples and explanations.– Jake
Apr 12 '17 at 13:01
add a comment |
up vote
63
down vote
You can do this with the reshape()
function, or with the melt()
/ cast()
functions in the reshape package. For the second option, example code is
library(reshape)
cast(dat1, name ~ numbers)
Or using reshape2
library(reshape2)
dcast(dat1, name ~ numbers)
thank you! I can't believe I didn't see that - I just looked at 'cast' before posting, but couldn't get it to work how I wanted.
– Steve
May 4 '11 at 22:45
7
+1 And usereshape2
for a performance gain.
– Andrie
May 6 '11 at 11:56
1
It might be worth noting that just usingcast
ordcast
will not work nicely if you don't have a clear "value" column. Trydat <- data.frame(id=c(1,1,2,2),blah=c(8,4,7,6),index=c(1,2,1,2)); dcast(dat, id ~ index); cast(dat, id ~ index)
and you will not get what you expect. You need to explicitly note thevalue/value.var
-cast(dat, id ~ index, value="blah")
anddcast(dat, id ~ index, value.var="blah")
for instance.
– thelatemail
Jun 21 '17 at 22:37
add a comment |
up vote
63
down vote
You can do this with the reshape()
function, or with the melt()
/ cast()
functions in the reshape package. For the second option, example code is
library(reshape)
cast(dat1, name ~ numbers)
Or using reshape2
library(reshape2)
dcast(dat1, name ~ numbers)
thank you! I can't believe I didn't see that - I just looked at 'cast' before posting, but couldn't get it to work how I wanted.
– Steve
May 4 '11 at 22:45
7
+1 And usereshape2
for a performance gain.
– Andrie
May 6 '11 at 11:56
1
It might be worth noting that just usingcast
ordcast
will not work nicely if you don't have a clear "value" column. Trydat <- data.frame(id=c(1,1,2,2),blah=c(8,4,7,6),index=c(1,2,1,2)); dcast(dat, id ~ index); cast(dat, id ~ index)
and you will not get what you expect. You need to explicitly note thevalue/value.var
-cast(dat, id ~ index, value="blah")
anddcast(dat, id ~ index, value.var="blah")
for instance.
– thelatemail
Jun 21 '17 at 22:37
add a comment |
up vote
63
down vote
up vote
63
down vote
You can do this with the reshape()
function, or with the melt()
/ cast()
functions in the reshape package. For the second option, example code is
library(reshape)
cast(dat1, name ~ numbers)
Or using reshape2
library(reshape2)
dcast(dat1, name ~ numbers)
You can do this with the reshape()
function, or with the melt()
/ cast()
functions in the reshape package. For the second option, example code is
library(reshape)
cast(dat1, name ~ numbers)
Or using reshape2
library(reshape2)
dcast(dat1, name ~ numbers)
edited May 26 '15 at 14:52
David Arenburg
77.3k1092156
77.3k1092156
answered May 4 '11 at 22:42
Ista
7,03312426
7,03312426
thank you! I can't believe I didn't see that - I just looked at 'cast' before posting, but couldn't get it to work how I wanted.
– Steve
May 4 '11 at 22:45
7
+1 And usereshape2
for a performance gain.
– Andrie
May 6 '11 at 11:56
1
It might be worth noting that just usingcast
ordcast
will not work nicely if you don't have a clear "value" column. Trydat <- data.frame(id=c(1,1,2,2),blah=c(8,4,7,6),index=c(1,2,1,2)); dcast(dat, id ~ index); cast(dat, id ~ index)
and you will not get what you expect. You need to explicitly note thevalue/value.var
-cast(dat, id ~ index, value="blah")
anddcast(dat, id ~ index, value.var="blah")
for instance.
– thelatemail
Jun 21 '17 at 22:37
add a comment |
thank you! I can't believe I didn't see that - I just looked at 'cast' before posting, but couldn't get it to work how I wanted.
– Steve
May 4 '11 at 22:45
7
+1 And usereshape2
for a performance gain.
– Andrie
May 6 '11 at 11:56
1
It might be worth noting that just usingcast
ordcast
will not work nicely if you don't have a clear "value" column. Trydat <- data.frame(id=c(1,1,2,2),blah=c(8,4,7,6),index=c(1,2,1,2)); dcast(dat, id ~ index); cast(dat, id ~ index)
and you will not get what you expect. You need to explicitly note thevalue/value.var
-cast(dat, id ~ index, value="blah")
anddcast(dat, id ~ index, value.var="blah")
for instance.
– thelatemail
Jun 21 '17 at 22:37
thank you! I can't believe I didn't see that - I just looked at 'cast' before posting, but couldn't get it to work how I wanted.
– Steve
May 4 '11 at 22:45
thank you! I can't believe I didn't see that - I just looked at 'cast' before posting, but couldn't get it to work how I wanted.
– Steve
May 4 '11 at 22:45
7
7
+1 And use
reshape2
for a performance gain.– Andrie
May 6 '11 at 11:56
+1 And use
reshape2
for a performance gain.– Andrie
May 6 '11 at 11:56
1
1
It might be worth noting that just using
cast
or dcast
will not work nicely if you don't have a clear "value" column. Try dat <- data.frame(id=c(1,1,2,2),blah=c(8,4,7,6),index=c(1,2,1,2)); dcast(dat, id ~ index); cast(dat, id ~ index)
and you will not get what you expect. You need to explicitly note the value/value.var
- cast(dat, id ~ index, value="blah")
and dcast(dat, id ~ index, value.var="blah")
for instance.– thelatemail
Jun 21 '17 at 22:37
It might be worth noting that just using
cast
or dcast
will not work nicely if you don't have a clear "value" column. Try dat <- data.frame(id=c(1,1,2,2),blah=c(8,4,7,6),index=c(1,2,1,2)); dcast(dat, id ~ index); cast(dat, id ~ index)
and you will not get what you expect. You need to explicitly note the value/value.var
- cast(dat, id ~ index, value="blah")
and dcast(dat, id ~ index, value.var="blah")
for instance.– thelatemail
Jun 21 '17 at 22:37
add a comment |
up vote
27
down vote
Another option if performance is a concern is to use data.table
's extension of reshape2
's melt & dcast functions
(Reference: Efficient reshaping using data.tables)
library(data.table)
setDT(dat1)
dcast(dat1, name ~ numbers, value.var = "value")
# name 1 2 3 4
# 1: firstName 0.1836433 -0.8356286 1.5952808 0.3295078
# 2: secondName -0.8204684 0.4874291 0.7383247 0.5757814
And, as of data.table v1.9.6 we can cast on multiple columns
## add an extra column
dat1[, value2 := value * 2]
## cast multiple value columns
dcast(dat1, name ~ numbers, value.var = c("value", "value2"))
# name value_1 value_2 value_3 value_4 value2_1 value2_2 value2_3 value2_4
# 1: firstName 0.1836433 -0.8356286 1.5952808 0.3295078 0.3672866 -1.6712572 3.190562 0.6590155
# 2: secondName -0.8204684 0.4874291 0.7383247 0.5757814 -1.6409368 0.9748581 1.476649 1.1515627
3
data.table
approach is the best ! very efficient ... you will see the difference whenname
is a combination of 30-40 columns !!
– joel.wilson
Aug 31 '17 at 12:06
3
I personally think this is the best answer
– cloudscomputes
Oct 20 '17 at 4:37
add a comment |
up vote
27
down vote
Another option if performance is a concern is to use data.table
's extension of reshape2
's melt & dcast functions
(Reference: Efficient reshaping using data.tables)
library(data.table)
setDT(dat1)
dcast(dat1, name ~ numbers, value.var = "value")
# name 1 2 3 4
# 1: firstName 0.1836433 -0.8356286 1.5952808 0.3295078
# 2: secondName -0.8204684 0.4874291 0.7383247 0.5757814
And, as of data.table v1.9.6 we can cast on multiple columns
## add an extra column
dat1[, value2 := value * 2]
## cast multiple value columns
dcast(dat1, name ~ numbers, value.var = c("value", "value2"))
# name value_1 value_2 value_3 value_4 value2_1 value2_2 value2_3 value2_4
# 1: firstName 0.1836433 -0.8356286 1.5952808 0.3295078 0.3672866 -1.6712572 3.190562 0.6590155
# 2: secondName -0.8204684 0.4874291 0.7383247 0.5757814 -1.6409368 0.9748581 1.476649 1.1515627
3
data.table
approach is the best ! very efficient ... you will see the difference whenname
is a combination of 30-40 columns !!
– joel.wilson
Aug 31 '17 at 12:06
3
I personally think this is the best answer
– cloudscomputes
Oct 20 '17 at 4:37
add a comment |
up vote
27
down vote
up vote
27
down vote
Another option if performance is a concern is to use data.table
's extension of reshape2
's melt & dcast functions
(Reference: Efficient reshaping using data.tables)
library(data.table)
setDT(dat1)
dcast(dat1, name ~ numbers, value.var = "value")
# name 1 2 3 4
# 1: firstName 0.1836433 -0.8356286 1.5952808 0.3295078
# 2: secondName -0.8204684 0.4874291 0.7383247 0.5757814
And, as of data.table v1.9.6 we can cast on multiple columns
## add an extra column
dat1[, value2 := value * 2]
## cast multiple value columns
dcast(dat1, name ~ numbers, value.var = c("value", "value2"))
# name value_1 value_2 value_3 value_4 value2_1 value2_2 value2_3 value2_4
# 1: firstName 0.1836433 -0.8356286 1.5952808 0.3295078 0.3672866 -1.6712572 3.190562 0.6590155
# 2: secondName -0.8204684 0.4874291 0.7383247 0.5757814 -1.6409368 0.9748581 1.476649 1.1515627
Another option if performance is a concern is to use data.table
's extension of reshape2
's melt & dcast functions
(Reference: Efficient reshaping using data.tables)
library(data.table)
setDT(dat1)
dcast(dat1, name ~ numbers, value.var = "value")
# name 1 2 3 4
# 1: firstName 0.1836433 -0.8356286 1.5952808 0.3295078
# 2: secondName -0.8204684 0.4874291 0.7383247 0.5757814
And, as of data.table v1.9.6 we can cast on multiple columns
## add an extra column
dat1[, value2 := value * 2]
## cast multiple value columns
dcast(dat1, name ~ numbers, value.var = c("value", "value2"))
# name value_1 value_2 value_3 value_4 value2_1 value2_2 value2_3 value2_4
# 1: firstName 0.1836433 -0.8356286 1.5952808 0.3295078 0.3672866 -1.6712572 3.190562 0.6590155
# 2: secondName -0.8204684 0.4874291 0.7383247 0.5757814 -1.6409368 0.9748581 1.476649 1.1515627
edited Mar 27 '16 at 22:51
answered Mar 27 '16 at 22:35
SymbolixAU
16k32885
16k32885
3
data.table
approach is the best ! very efficient ... you will see the difference whenname
is a combination of 30-40 columns !!
– joel.wilson
Aug 31 '17 at 12:06
3
I personally think this is the best answer
– cloudscomputes
Oct 20 '17 at 4:37
add a comment |
3
data.table
approach is the best ! very efficient ... you will see the difference whenname
is a combination of 30-40 columns !!
– joel.wilson
Aug 31 '17 at 12:06
3
I personally think this is the best answer
– cloudscomputes
Oct 20 '17 at 4:37
3
3
data.table
approach is the best ! very efficient ... you will see the difference when name
is a combination of 30-40 columns !!– joel.wilson
Aug 31 '17 at 12:06
data.table
approach is the best ! very efficient ... you will see the difference when name
is a combination of 30-40 columns !!– joel.wilson
Aug 31 '17 at 12:06
3
3
I personally think this is the best answer
– cloudscomputes
Oct 20 '17 at 4:37
I personally think this is the best answer
– cloudscomputes
Oct 20 '17 at 4:37
add a comment |
up vote
22
down vote
Using your example dataframe, we could:
xtabs(value ~ name + numbers, data = dat1)
2
this one is good, but the result is of format table which not may be not so easy to handle as data.frame or data.table, both has plenty of packages
– cloudscomputes
Oct 20 '17 at 4:44
add a comment |
up vote
22
down vote
Using your example dataframe, we could:
xtabs(value ~ name + numbers, data = dat1)
2
this one is good, but the result is of format table which not may be not so easy to handle as data.frame or data.table, both has plenty of packages
– cloudscomputes
Oct 20 '17 at 4:44
add a comment |
up vote
22
down vote
up vote
22
down vote
Using your example dataframe, we could:
xtabs(value ~ name + numbers, data = dat1)
Using your example dataframe, we could:
xtabs(value ~ name + numbers, data = dat1)
edited Sep 2 '16 at 7:37
zx8754
28.5k76394
28.5k76394
answered May 4 '11 at 22:58
Jim M.
4,48411430
4,48411430
2
this one is good, but the result is of format table which not may be not so easy to handle as data.frame or data.table, both has plenty of packages
– cloudscomputes
Oct 20 '17 at 4:44
add a comment |
2
this one is good, but the result is of format table which not may be not so easy to handle as data.frame or data.table, both has plenty of packages
– cloudscomputes
Oct 20 '17 at 4:44
2
2
this one is good, but the result is of format table which not may be not so easy to handle as data.frame or data.table, both has plenty of packages
– cloudscomputes
Oct 20 '17 at 4:44
this one is good, but the result is of format table which not may be not so easy to handle as data.frame or data.table, both has plenty of packages
– cloudscomputes
Oct 20 '17 at 4:44
add a comment |
up vote
14
down vote
Other two options:
Base package:
df <- unstack(dat1, form = value ~ numbers)
rownames(df) <- unique(dat1$name)
df
sqldf
package:
library(sqldf)
sqldf('SELECT name,
MAX(CASE WHEN numbers = 1 THEN value ELSE NULL END) x1,
MAX(CASE WHEN numbers = 2 THEN value ELSE NULL END) x2,
MAX(CASE WHEN numbers = 3 THEN value ELSE NULL END) x3,
MAX(CASE WHEN numbers = 4 THEN value ELSE NULL END) x4
FROM dat1
GROUP BY name')
add a comment |
up vote
14
down vote
Other two options:
Base package:
df <- unstack(dat1, form = value ~ numbers)
rownames(df) <- unique(dat1$name)
df
sqldf
package:
library(sqldf)
sqldf('SELECT name,
MAX(CASE WHEN numbers = 1 THEN value ELSE NULL END) x1,
MAX(CASE WHEN numbers = 2 THEN value ELSE NULL END) x2,
MAX(CASE WHEN numbers = 3 THEN value ELSE NULL END) x3,
MAX(CASE WHEN numbers = 4 THEN value ELSE NULL END) x4
FROM dat1
GROUP BY name')
add a comment |
up vote
14
down vote
up vote
14
down vote
Other two options:
Base package:
df <- unstack(dat1, form = value ~ numbers)
rownames(df) <- unique(dat1$name)
df
sqldf
package:
library(sqldf)
sqldf('SELECT name,
MAX(CASE WHEN numbers = 1 THEN value ELSE NULL END) x1,
MAX(CASE WHEN numbers = 2 THEN value ELSE NULL END) x2,
MAX(CASE WHEN numbers = 3 THEN value ELSE NULL END) x3,
MAX(CASE WHEN numbers = 4 THEN value ELSE NULL END) x4
FROM dat1
GROUP BY name')
Other two options:
Base package:
df <- unstack(dat1, form = value ~ numbers)
rownames(df) <- unique(dat1$name)
df
sqldf
package:
library(sqldf)
sqldf('SELECT name,
MAX(CASE WHEN numbers = 1 THEN value ELSE NULL END) x1,
MAX(CASE WHEN numbers = 2 THEN value ELSE NULL END) x2,
MAX(CASE WHEN numbers = 3 THEN value ELSE NULL END) x3,
MAX(CASE WHEN numbers = 4 THEN value ELSE NULL END) x4
FROM dat1
GROUP BY name')
answered Jul 14 '15 at 17:44
mpalanco
7,07313239
7,07313239
add a comment |
add a comment |
up vote
9
down vote
Using base R aggregate
function:
aggregate(value ~ name, dat1, I)
# name value.1 value.2 value.3 value.4
#1 firstName 0.4145 -0.4747 0.0659 -0.5024
#2 secondName -0.8259 0.1669 -0.8962 0.1681
1
why use the paste function instead of theI
(identity function)?
– Onyambu
Dec 25 '17 at 4:04
add a comment |
up vote
9
down vote
Using base R aggregate
function:
aggregate(value ~ name, dat1, I)
# name value.1 value.2 value.3 value.4
#1 firstName 0.4145 -0.4747 0.0659 -0.5024
#2 secondName -0.8259 0.1669 -0.8962 0.1681
1
why use the paste function instead of theI
(identity function)?
– Onyambu
Dec 25 '17 at 4:04
add a comment |
up vote
9
down vote
up vote
9
down vote
Using base R aggregate
function:
aggregate(value ~ name, dat1, I)
# name value.1 value.2 value.3 value.4
#1 firstName 0.4145 -0.4747 0.0659 -0.5024
#2 secondName -0.8259 0.1669 -0.8962 0.1681
Using base R aggregate
function:
aggregate(value ~ name, dat1, I)
# name value.1 value.2 value.3 value.4
#1 firstName 0.4145 -0.4747 0.0659 -0.5024
#2 secondName -0.8259 0.1669 -0.8962 0.1681
edited Dec 25 '17 at 4:05
Onyambu
15.2k1519
15.2k1519
answered Sep 2 '16 at 7:52
Ronak Shah
27.5k93551
27.5k93551
1
why use the paste function instead of theI
(identity function)?
– Onyambu
Dec 25 '17 at 4:04
add a comment |
1
why use the paste function instead of theI
(identity function)?
– Onyambu
Dec 25 '17 at 4:04
1
1
why use the paste function instead of the
I
(identity function)?– Onyambu
Dec 25 '17 at 4:04
why use the paste function instead of the
I
(identity function)?– Onyambu
Dec 25 '17 at 4:04
add a comment |
up vote
4
down vote
There's very powerful new package from genius data scientists at Win-Vector (folks that made vtreat
, seplyr
and replyr
) called cdata
. It implements "coordinated data" principles described in this document and also in this blog post. The idea is that regardless how you organize your data, it should be possible to identify individual data points using a system of "data coordinates". Here's a excerpt from the recent blog post by John Mount:
The whole system is based on two primitives or operators
cdata::moveValuesToRowsD() and cdata::moveValuesToColumnsD(). These
operators have pivot, un-pivot, one-hot encode, transpose, moving
multiple rows and columns, and many other transforms as simple special
cases.
It is easy to write many different operations in terms of the
cdata primitives. These operators can work-in memory or at big data
scale (with databases and Apache Spark; for big data use the
cdata::moveValuesToRowsN() and cdata::moveValuesToColumnsN()
variants). The transforms are controlled by a control table that
itself is a diagram of (or picture of) the transform.
We will first build the control table (see blog post for details) and then perform the move of data from rows to columns.
library(cdata)
# first build the control table
pivotControlTable <- buildPivotControlTableD(table = dat1, # reference to dataset
columnToTakeKeysFrom = 'numbers', # this will become column headers
columnToTakeValuesFrom = 'value', # this contains data
sep="_") # optional for making column names
# perform the move of data to columns
dat_wide <- moveValuesToColumnsD(tallTable = dat1, # reference to dataset
keyColumns = c('name'), # this(these) column(s) should stay untouched
controlTable = pivotControlTable# control table above
)
dat_wide
#> name numbers_1 numbers_2 numbers_3 numbers_4
#> 1 firstName 0.3407997 -0.7033403 -0.3795377 -0.7460474
#> 2 secondName -0.8981073 -0.3347941 -0.5013782 -0.1745357
add a comment |
up vote
4
down vote
There's very powerful new package from genius data scientists at Win-Vector (folks that made vtreat
, seplyr
and replyr
) called cdata
. It implements "coordinated data" principles described in this document and also in this blog post. The idea is that regardless how you organize your data, it should be possible to identify individual data points using a system of "data coordinates". Here's a excerpt from the recent blog post by John Mount:
The whole system is based on two primitives or operators
cdata::moveValuesToRowsD() and cdata::moveValuesToColumnsD(). These
operators have pivot, un-pivot, one-hot encode, transpose, moving
multiple rows and columns, and many other transforms as simple special
cases.
It is easy to write many different operations in terms of the
cdata primitives. These operators can work-in memory or at big data
scale (with databases and Apache Spark; for big data use the
cdata::moveValuesToRowsN() and cdata::moveValuesToColumnsN()
variants). The transforms are controlled by a control table that
itself is a diagram of (or picture of) the transform.
We will first build the control table (see blog post for details) and then perform the move of data from rows to columns.
library(cdata)
# first build the control table
pivotControlTable <- buildPivotControlTableD(table = dat1, # reference to dataset
columnToTakeKeysFrom = 'numbers', # this will become column headers
columnToTakeValuesFrom = 'value', # this contains data
sep="_") # optional for making column names
# perform the move of data to columns
dat_wide <- moveValuesToColumnsD(tallTable = dat1, # reference to dataset
keyColumns = c('name'), # this(these) column(s) should stay untouched
controlTable = pivotControlTable# control table above
)
dat_wide
#> name numbers_1 numbers_2 numbers_3 numbers_4
#> 1 firstName 0.3407997 -0.7033403 -0.3795377 -0.7460474
#> 2 secondName -0.8981073 -0.3347941 -0.5013782 -0.1745357
add a comment |
up vote
4
down vote
up vote
4
down vote
There's very powerful new package from genius data scientists at Win-Vector (folks that made vtreat
, seplyr
and replyr
) called cdata
. It implements "coordinated data" principles described in this document and also in this blog post. The idea is that regardless how you organize your data, it should be possible to identify individual data points using a system of "data coordinates". Here's a excerpt from the recent blog post by John Mount:
The whole system is based on two primitives or operators
cdata::moveValuesToRowsD() and cdata::moveValuesToColumnsD(). These
operators have pivot, un-pivot, one-hot encode, transpose, moving
multiple rows and columns, and many other transforms as simple special
cases.
It is easy to write many different operations in terms of the
cdata primitives. These operators can work-in memory or at big data
scale (with databases and Apache Spark; for big data use the
cdata::moveValuesToRowsN() and cdata::moveValuesToColumnsN()
variants). The transforms are controlled by a control table that
itself is a diagram of (or picture of) the transform.
We will first build the control table (see blog post for details) and then perform the move of data from rows to columns.
library(cdata)
# first build the control table
pivotControlTable <- buildPivotControlTableD(table = dat1, # reference to dataset
columnToTakeKeysFrom = 'numbers', # this will become column headers
columnToTakeValuesFrom = 'value', # this contains data
sep="_") # optional for making column names
# perform the move of data to columns
dat_wide <- moveValuesToColumnsD(tallTable = dat1, # reference to dataset
keyColumns = c('name'), # this(these) column(s) should stay untouched
controlTable = pivotControlTable# control table above
)
dat_wide
#> name numbers_1 numbers_2 numbers_3 numbers_4
#> 1 firstName 0.3407997 -0.7033403 -0.3795377 -0.7460474
#> 2 secondName -0.8981073 -0.3347941 -0.5013782 -0.1745357
There's very powerful new package from genius data scientists at Win-Vector (folks that made vtreat
, seplyr
and replyr
) called cdata
. It implements "coordinated data" principles described in this document and also in this blog post. The idea is that regardless how you organize your data, it should be possible to identify individual data points using a system of "data coordinates". Here's a excerpt from the recent blog post by John Mount:
The whole system is based on two primitives or operators
cdata::moveValuesToRowsD() and cdata::moveValuesToColumnsD(). These
operators have pivot, un-pivot, one-hot encode, transpose, moving
multiple rows and columns, and many other transforms as simple special
cases.
It is easy to write many different operations in terms of the
cdata primitives. These operators can work-in memory or at big data
scale (with databases and Apache Spark; for big data use the
cdata::moveValuesToRowsN() and cdata::moveValuesToColumnsN()
variants). The transforms are controlled by a control table that
itself is a diagram of (or picture of) the transform.
We will first build the control table (see blog post for details) and then perform the move of data from rows to columns.
library(cdata)
# first build the control table
pivotControlTable <- buildPivotControlTableD(table = dat1, # reference to dataset
columnToTakeKeysFrom = 'numbers', # this will become column headers
columnToTakeValuesFrom = 'value', # this contains data
sep="_") # optional for making column names
# perform the move of data to columns
dat_wide <- moveValuesToColumnsD(tallTable = dat1, # reference to dataset
keyColumns = c('name'), # this(these) column(s) should stay untouched
controlTable = pivotControlTable# control table above
)
dat_wide
#> name numbers_1 numbers_2 numbers_3 numbers_4
#> 1 firstName 0.3407997 -0.7033403 -0.3795377 -0.7460474
#> 2 secondName -0.8981073 -0.3347941 -0.5013782 -0.1745357
answered Dec 23 '17 at 23:01
dmi3kno
1,738521
1,738521
add a comment |
add a comment |
up vote
2
down vote
The base reshape
function works perfectly fine:
df <- data.frame(
year = c(rep(2000, 12), rep(2001, 12)),
month = rep(1:12, 2),
values = rnorm(24)
)
df_wide <- reshape(df, idvar="year", timevar="month", v.names="values", direction="wide", sep="_")
df_wide
Where
idvar
is the column of classes that separates rowstimevar
is the column of classes to cast widev.names
is the column containing numeric valuesdirection
specifies wide or long format- the optional
sep
argument is the separator used in betweentimevar
class names andv.names
in the outputdata.frame
.
If no idvar
exists, create one before using the reshape()
function:
df$id <- c(rep("year1", 12), rep("year2", 12))
df_wide <- reshape(df, idvar="id", timevar="month", v.names="values", direction="wide", sep="_")
df_wide
Just remember that idvar
is required! The timevar
and v.names
part is easy. The output of this function is more predictable than some of the others, as everything is explicitly defined.
add a comment |
up vote
2
down vote
The base reshape
function works perfectly fine:
df <- data.frame(
year = c(rep(2000, 12), rep(2001, 12)),
month = rep(1:12, 2),
values = rnorm(24)
)
df_wide <- reshape(df, idvar="year", timevar="month", v.names="values", direction="wide", sep="_")
df_wide
Where
idvar
is the column of classes that separates rowstimevar
is the column of classes to cast widev.names
is the column containing numeric valuesdirection
specifies wide or long format- the optional
sep
argument is the separator used in betweentimevar
class names andv.names
in the outputdata.frame
.
If no idvar
exists, create one before using the reshape()
function:
df$id <- c(rep("year1", 12), rep("year2", 12))
df_wide <- reshape(df, idvar="id", timevar="month", v.names="values", direction="wide", sep="_")
df_wide
Just remember that idvar
is required! The timevar
and v.names
part is easy. The output of this function is more predictable than some of the others, as everything is explicitly defined.
add a comment |
up vote
2
down vote
up vote
2
down vote
The base reshape
function works perfectly fine:
df <- data.frame(
year = c(rep(2000, 12), rep(2001, 12)),
month = rep(1:12, 2),
values = rnorm(24)
)
df_wide <- reshape(df, idvar="year", timevar="month", v.names="values", direction="wide", sep="_")
df_wide
Where
idvar
is the column of classes that separates rowstimevar
is the column of classes to cast widev.names
is the column containing numeric valuesdirection
specifies wide or long format- the optional
sep
argument is the separator used in betweentimevar
class names andv.names
in the outputdata.frame
.
If no idvar
exists, create one before using the reshape()
function:
df$id <- c(rep("year1", 12), rep("year2", 12))
df_wide <- reshape(df, idvar="id", timevar="month", v.names="values", direction="wide", sep="_")
df_wide
Just remember that idvar
is required! The timevar
and v.names
part is easy. The output of this function is more predictable than some of the others, as everything is explicitly defined.
The base reshape
function works perfectly fine:
df <- data.frame(
year = c(rep(2000, 12), rep(2001, 12)),
month = rep(1:12, 2),
values = rnorm(24)
)
df_wide <- reshape(df, idvar="year", timevar="month", v.names="values", direction="wide", sep="_")
df_wide
Where
idvar
is the column of classes that separates rowstimevar
is the column of classes to cast widev.names
is the column containing numeric valuesdirection
specifies wide or long format- the optional
sep
argument is the separator used in betweentimevar
class names andv.names
in the outputdata.frame
.
If no idvar
exists, create one before using the reshape()
function:
df$id <- c(rep("year1", 12), rep("year2", 12))
df_wide <- reshape(df, idvar="id", timevar="month", v.names="values", direction="wide", sep="_")
df_wide
Just remember that idvar
is required! The timevar
and v.names
part is easy. The output of this function is more predictable than some of the others, as everything is explicitly defined.
edited Aug 29 at 3:00
SymbolixAU
16k32885
16k32885
answered Aug 2 at 23:50
Adam Erickson
1,6141320
1,6141320
add a comment |
add a comment |
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f5890584%2fhow-to-reshape-data-from-long-to-wide-format%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
possible duplicate of Reshape three column data frame to matrix
– Frank
Oct 8 '13 at 20:53
3
@Frank: this is a much better title. long-form and wide-form are the standard terms used. The other answer cannot be found by searching on those terms.
– smci
Apr 11 '14 at 5:21
2
@smci Having a better title is not a good reason not to link the questions, is it? If the questions are essentially the same, it's better for future visitors that they be linked so that all the answers can be found easily. You could mark the duplicate in the other direction instead, I suppose... Also, I do not know why you have made new tags here.
– Frank
Apr 11 '14 at 16:49
1
^^ @Frank - this answer cannot be found by searching on the terms users are likely to use. It's not a question of aesthetics.
– smci
Apr 11 '14 at 21:03
1
@smci By "this answer" you mean... the one I linked to? Yeah, I agree that it is unlikely but someone might land on one or the other in their search, and we want that person to find all the relevant answers (here and there). It's the exact same question so the answers in both places are relevant -- that's why it should be marked as a dupe. The titles don't matter; and which one is a dupe of the other doesn't matter either (as far as I can think of). If you're talking about the tags, maybe you should take it to meta...?
– Frank
Apr 11 '14 at 21:13